v0.7 - 20220315

Big improvements:
- Autocast / mixed precision: bfloat16 instead of float16. Now we can train larger models on larger batches using 16bit float ops without loss becoming infinity!
  - WARNING: we need pytorch 1.10 or newer. Please upgrade!
- validation BLEU scores are computed without teacher forcing i.e., similar to inference. BLEU is more realistic estimate of test time bleu
  - WARNING: validations can be slower. Dont use too big validation set
- schedule:
  - inverse_sqrt support scaler multiplier term, similar to noam
  - inverse_root schedule added, generalization of inverse_sqrt
fixes
- rtg.prep CLI arguments works now
- optimizer state loading now works while resuming training
- parent model will be recreated if missing even after _PREPARED flag exists

v0.6.1 : 20220128

rtg.fork accepts multiple to_dir; thus supports cloning multiple times at once
Bug fix: early stopping on distributed parallel training
rtg.tool.augment to support data augmentations
Add attention visualization in rtg.serve; powered by plotly
rtg.pipeline and rtg.fork: uses relative symlinks instead of absolute paths
rtg.decode shows decoding speed (segs, src_toks, hyp_toks)
batch_size is auto adjusted based on number of workers and gradient_accum (huh! finally)
batch_size normalizer in distributed training setting (fix! faster convergence now)
support for byte encoding added
Validation metrics; previously BLEU was teacher-forced similar to validation loss, now BLEU is from autoregressive output (resembling test time)
- Use bfloat16 for mixed precision training, requires torch 1.10+

Redesign of registry; using decorators to register all modules
optim block is split into optimizer schedule and `criterion; as a result, this version is not backward compatible with prior versions Refer to migration guide
- NoamOpt replaced with ScheduledOptimizer which takes scheduler and optimizer objects which are independently configurable from conf.yml
Add transformer sequence classification model: tfmcls, supports initialization from pretrained NMT (picks encoder layers, source embeddings, and source vocabs from NMT experiment)

Fix rtg.decode bug fix (partial migration to new API)
- test case added for decode api so we can catch such errors in future

Add rtg-params command that shows trainable parameters in model (layer wise as well as total)
TextTransform moved inside Experiment
- Integrated into rtg.pipeline as well as into validation metrics
validation on detokenized bleu, chrf, etc is now supported
- valid_tgt_raw is now required
Criterion:
- Sparse and Dense CrossEntropy
  - Weighted Cross Entropy, with label smoothing
- Dice Loss (WIP)
- Squared Error
vocab management
- prep.pieces originally took a string, now it can take either a string (i.e. same scheme for both source and target) or [string, string] separate scheme for source and target when shared=false.
- Example: pieces: [char, bpe] with shared: false makes char pieces on source side and bpe pieces on target

rtg.serve supports flexible transformations on source (pre processing) and target (post processing)
Travis build configured to auto run tests
sequence classification is now supported via tfmcls model

DDP: multinode training see scripts/slurm-multinode-launch.sh
FP16 and mixed precision (upgrade from APEX to torch's built in AMP)
NLCodec & NLDb integration for scaling to large datasets using pyspark backend
Web UI rtg-serve
Cache ensemble state for rtg-decode
Docker images for 500-eng model
Parent child transfer: Shrink parent model vocab and embeddings to child datasets
Fix packaging of flask app: now templates and static files are also included in PyPI package

Fix issue with dec_bos_cut complaining that tensors are not on contigous storage
REST API
Docker for deployment