merge with master

marian-nmt · Aug 17, 2023 · e80b6bb · e80b6bb
2 parents 5d7d080 + 3f93e65
commit e80b6bb
Show file tree

Hide file tree

Showing 113 changed files with 7,456 additions and 720 deletions.
diff --git a/.gitignore b/.gitignore
@@ -61,5 +61,4 @@ examples/mnist/*ubyte
 /vs/MarianDll.VC.VC.opendb
 
 .vs
-.vscode
-
+.vscode
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,7 +6,33 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
 
 ## [Unreleased]
+
+### Added
 - Added `--no-spm-encode` option, allowing the model to use vocabulary IDs directly to train/decode.
+- Added --custom-fallbacks option that allows to specify a list of option sets that get traversed for subsequent fallbacks upon divergence
+- Added --overwrite-checkpoint option that (when set to false) can be used to dump checkpoints with iteration numbers.   
+- Implementations of COMET-20 (reference-based) and BLEURT-20 for inference with conversion scripts.
+- `./marian evaluate` sub command for evaluation with COMET-QE-20, COMET-20 and BLEURT-20
+- A bunch of scripts for metrics use and early MBR experiments
+- LSH vocab filtering for GPU. Speed is not competitive with non-LSH. Checking in for completeness and possible future use of LSH on GPU for non-filtering stuff
+- Added --throw-on-divergence and --fp16-fallback-to-fp32 options to detect (fp16 and fp32) and recover (only fp16) 
+  diverged runs. If not recoverable, exception gets rethrown and goes unhandled to force fatal error and shutdown.
+- Re-implementation of COMET-QE for inference and training; conversion scripts from Unbabel-Comet to Marian.
+- Validator that generates embeddings and can be used during COMET training with an external script.
+- New experimental layer framework for Transformer-like models.
+
+### Fixed
+- Fixed wrong paramter name for norm in new layer framework
+- Fixed unit test for LayerNorm
+- Only collect batch statistics during mini-batch-fit up to actual max-length.
+- Implemented fully correct version of GELU instead of using bad approximatin via Swish.
+- Handle copying from fp32 or fp16 embeddings in embedder mode correctly.
+- Correct defaults for factored embeddings such that shared library use works (move out of config.h/cpp).
+
+### Changed
+- Removed --num-devices N option that wasn't really used by anyone (I assume).
+
+
 ## [1.12.0] - 2023-02-20
 
 ### Added

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-v1.12.0
+v1.12.12
diff --git a/azure-pipelines.yml b/azure-pipelines.yml
@@ -13,8 +13,11 @@ parameters:
   type: boolean
   default: true
 
-# The pipeline CI trigger is set on the branch master only and PR trigger on a
-# (non-draft) pull request to any branch
+# Warning: the current branch policies disable the automatic triggering to
+# minimize VM usage!
+# The configuration below specifies that the pipeline CI trigger is set on the
+# branch master only and a PR trigger is on a (non-draft) pull request to any
+# branch.
 trigger:
   # This minimizes the number of parallel pipeline runs. When a pipeline is
   # running, the CI waits until it is completed before starting another one.
@@ -368,7 +371,7 @@ stages:
           -DCOMPILE_CPU=on \
           -DCOMPILE_CUDA=off \
           -DCOMPILE_EXAMPLES=on \
-          -DCOMPILE_SERVER=on \
+          -DCOMPILE_SERVER=off \
           -DCOMPILE_TESTS=on \
           -DUSE_FBGEMM=on \
           -DUSE_SENTENCEPIECE=on \

diff --git a/azure-regression-tests.yml b/azure-regression-tests.yml
@@ -64,6 +64,24 @@ stages:
       displayName: Collect system info
       workingDirectory: regression-tests
 
+    # Always run regression tests from the master branch
+    # The current SAS token will expire on 12/31/2023 and a new one will need to be set in Marian > Pipelines > Library
+    # This is run at the beginning for easier debugging of the Python environment
+    - bash: |
+        set -x
+        git checkout master
+        git pull origin master
+        # Uninstall Cython because the newest 3.0.0 is incompatible with newest available versions of pyyaml and numpy as of July 2023
+        python3 -m pip uninstall -y cython
+        python3 -m pip install 'cython<3'
+        # These modules will be installed via `make install` below, but Cython needs to be installed before
+        python3 -m pip install 'pyyaml<6.0.1' 'numpy>=1.22,<2' websocket-client
+        make install
+      displayName: Prepare regression tests
+      env:
+        AZURE_STORAGE_SAS_TOKEN: $(marian-pub-tests-blob-sas-token)
+      workingDirectory: regression-tests
+
     # https://software.intel.com/content/www/us/en/develop/articles/installing-intel-free-libs-and-python-apt-repo.html
     - bash: |
         wget -qO- "https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB" | sudo apt-key add -
@@ -106,17 +124,6 @@ stages:
       displayName: Run unit tests
       workingDirectory: build
 
-    # Always run regression tests from the master branch
-    # The current SAS token will expire on 12/31/2023 and a new one will need to be set in Marian > Pipelines > Library
-    - bash: |
-        git checkout master
-        git pull origin master
-        make install
-      displayName: Prepare regression tests
-      env:
-        AZURE_STORAGE_SAS_TOKEN: $(marian-pub-tests-blob-sas-token)
-      workingDirectory: regression-tests
-
     # Continue on error to be able to collect outputs and publish them as an artifact
     - bash: MARIAN=../build ./run_mrt.sh
       continueOnError: true

diff --git a/cmake/Tarball.cmake b/cmake/Tarball.cmake
@@ -0,0 +1,30 @@
+# marian-YYYY-MM-DD-revision.tgz
+# This combines marian, marian_decoder in a single TAR file for
+# execution in MSFT internal tools FLO and Singularity.
+
+execute_process(
+        COMMAND bash -c "TZ=America/Los_Angeles date +%Y-%m-%d"
+        OUTPUT_VARIABLE TGZ_DATE
+        OUTPUT_STRIP_TRAILING_WHITESPACE)
+
+execute_process(
+        COMMAND git rev-parse --short=7 HEAD
+        OUTPUT_VARIABLE TGZ_REV
+        OUTPUT_STRIP_TRAILING_WHITESPACE)
+
+message("Generating ${CWD}/marian-${TGZ_DATE}-${TGZ_REV}.tgz")
+
+# check if pigz is available for faster compression
+execute_process(
+        COMMAND bash -c "which pigz || which gzip"
+        OUTPUT_VARIABLE COMPRESS
+        OUTPUT_STRIP_TRAILING_WHITESPACE)
+
+execute_process(
+        COMMAND tar -I ${COMPRESS} -cvvf "${CWD}/marian-${TGZ_DATE}-${TGZ_REV}.tgz" -C "${CWD}"
+            marian 
+            marian-decoder 
+            marian-scorer 
+            marian-vocab 
+            marian-conv
+        WORKING_DIRECTORY "${CWD}")        
diff --git a/scripts/bert/contrib/chpt2pt.py b/scripts/bert/contrib/chpt2pt.py
@@ -0,0 +1,23 @@
+#!/usr/bin/env python3
+"""
+This script converts *.chpt files to *.pt files, potentially useful for extracting weights only from larger checkpoints.
+"""
+
+import torch
+import argparse
+
+# Create a parser for command line arguments
+parser = argparse.ArgumentParser()
+
+# Add arguments for the source and target files
+parser.add_argument("--source", type=str, required=True, help="Path to the source *.chpt file")
+parser.add_argument("--target", type=str, required=True, help="Path to the target *.pt file")
+
+# Parse the command line arguments
+args = parser.parse_args()
+
+# Load the model from the source file
+model = torch.load(args.source)
+
+# Save the model to the target file
+torch.save(model, args.target)
diff --git a/scripts/bert/contrib/hugging2marian.py b/scripts/bert/contrib/hugging2marian.py
@@ -0,0 +1,153 @@
+#!/usr/bin/env python3
+"""
+This script converts Huggingface Bert model to Marian weight file.
+"""
+
+import argparse
+import numpy as np
+import sys
+import yaml
+
+from transformers import XLMRobertaModel
+
+parser = argparse.ArgumentParser(description='Convert Huggingface Bert model to Marian weight file.')
+parser.add_argument('--bert', help='Path to Huggingface Bert PyTorch model', required=True)
+parser.add_argument('--marian', help='Output path for Marian weight file', required=True)
+args = parser.parse_args()
+
+huggingface = XLMRobertaModel.from_pretrained(args.bert)
+huggingface.eval()
+
+print(huggingface.config)
+
+config = dict()
+config["type"] = "bert-classifier"
+config["input-types"] = ["sequence"]
+config["tied-embeddings-all"] = True
+config["tied-embeddings-src"] = False
+
+config["transformer-ffn-depth"] = 2
+config["transformer-train-position-embeddings"] = True
+config["transformer-preprocess"] = ""
+config["transformer-postprocess"] = "dan"
+config["transformer-postprocess-emb"] = "nd"
+config["bert-train-type-embeddings"] = False
+# @TODO: figure out if it's worth adding `cometModel.name_or_path` to the end of this version string.
+config["version"] = "huggingface2marian.py conversion"
+
+config["enc-depth"] = 0
+config["transformer-dim-ffn"] = huggingface.config.intermediate_size
+config["transformer-heads"] = huggingface.config.num_attention_heads
+config["transformer-ffn-activation"] = huggingface.config.hidden_act
+
+config["bert-sep-symbol"] = "</s>"
+config["bert-class-symbol"] = "</s>"
+
+marianModel = dict()
+
+def transposeOrder(mat):
+    matT = np.transpose(mat) # just a view with changed row order
+    return matT.flatten(order="C").reshape(matT.shape) # force row order change and reshape
+
+
+def convert(pd, srcs, trg, transpose=True, bias=False):
+    if len(srcs) == 1:
+        for src in srcs:
+            num = pd[src].detach().numpy()
+            if bias:
+                marianModel[trg] = np.atleast_2d(num)
+            else:
+                if transpose:
+                    marianModel[trg] = transposeOrder(num) # transpose with row order change
+                else:
+                    marianModel[trg] = num
+    else: # path that joins matrices together for fused self-attention
+        nums = [pd[src].detach().numpy() for src in srcs]
+        if bias:
+            nums = [np.transpose(np.atleast_2d(num)) for num in nums]
+        marianModel[trg] = np.stack(nums, axis=0)
+
+
+def extract(layer, nth, level):
+    name = type(layer).__name__
+    print("  " * level, nth, name)
+    if name == "BertLayer":
+        pd = dict(layer.named_parameters())
+        for n in pd:
+            print("  " * (level + 1), n, pd[n].shape)
+
+        convert(pd, ["attention.self.query.weight"], f"encoder_l{nth + 1}_self_Wq", transpose=True)
+        convert(pd, ["attention.self.key.weight"],   f"encoder_l{nth + 1}_self_Wk")
+        convert(pd, ["attention.self.value.weight"], f"encoder_l{nth + 1}_self_Wv")
+
+        convert(pd, ["attention.self.query.bias"],   f"encoder_l{nth + 1}_self_bq", bias=True)
+        convert(pd, ["attention.self.key.bias"],     f"encoder_l{nth + 1}_self_bk", bias=True)
+        convert(pd, ["attention.self.value.bias"],   f"encoder_l{nth + 1}_self_bv", bias=True)
+
+        convert(pd, ["attention.output.dense.weight"], f"encoder_l{nth + 1}_self_Wo")
+        convert(pd, ["attention.output.dense.bias"],   f"encoder_l{nth + 1}_self_bo", bias=True)
+
+        convert(pd, ["attention.output.LayerNorm.weight"], f"encoder_l{nth + 1}_self_Wo_ln_scale", bias=True)
+        convert(pd, ["attention.output.LayerNorm.bias"],   f"encoder_l{nth + 1}_self_Wo_ln_bias", bias=True)
+
+        convert(pd, ["intermediate.dense.weight"], f"encoder_l{nth + 1}_ffn_W1")
+        convert(pd, ["intermediate.dense.bias"],   f"encoder_l{nth + 1}_ffn_b1", bias=True)
+        convert(pd, ["output.dense.weight"], f"encoder_l{nth + 1}_ffn_W2")
+        convert(pd, ["output.dense.bias"],   f"encoder_l{nth + 1}_ffn_b2", bias=True)
+
+        convert(pd, ["output.LayerNorm.weight"], f"encoder_l{nth + 1}_ffn_ffn_ln_scale", bias=True)
+        convert(pd, ["output.LayerNorm.bias"],   f"encoder_l{nth + 1}_ffn_ffn_ln_bias", bias=True)
+
+        config["enc-depth"] += 1
+
+    elif name == "BertEmbeddings":
+        for n, p in layer.named_parameters():
+            print("  " * (level + 1), n, p.shape)
+        pd = dict(layer.named_parameters())
+        convert(pd, ["word_embeddings.weight"], f"Wemb", transpose=False)
+        convert(pd, ["position_embeddings.weight"], f"Wpos", transpose=False)
+
+        config["bert-type-vocab-size"] = 0
+        if hasattr(layer, "token_type_embeddings"):
+            convert(pd, ["token_type_embeddings.weight"], f"Wtype", transpose=False)
+            config["bert-type-vocab-size"] = pd["token_type_embeddings.weight"].shape[0]
+            config["bert-train-type-embeddings"] = True
+
+        convert(pd, ["LayerNorm.weight"], f"encoder_emb_ln_scale_pre", bias=True)
+        convert(pd, ["LayerNorm.bias"],   f"encoder_emb_ln_bias_pre", bias=True)
+
+        config["dim-emb"]    = pd["word_embeddings.weight"].shape[1]
+        config["dim-vocabs"] = [ pd["word_embeddings.weight"].shape[0] ]
+        config["max-length"] = pd["position_embeddings.weight"].shape[0]
+
+    elif name == "BertPooler":
+        for n, p in layer.named_parameters():
+            print("  " * (level + 1), n, p.shape)
+
+        pd = dict(layer.named_parameters())
+        convert(pd, ["dense.weight"], "classifier_ff_logit_l1_W")
+        convert(pd, ["dense.bias"], "classifier_ff_logit_l1_b", bias=True)
+
+    else:
+        recurse(layer, level + 1)
+
+def recurse(parent, level=0):
+    for i, child in enumerate(parent.children()):
+        extract(child, i, level)
+
+recurse(huggingface)
+
+for m in marianModel:
+    print(m, marianModel[m].shape)
+
+configYamlStr = yaml.dump(config, default_flow_style=False)
+desc = list(configYamlStr)
+npDesc = np.chararray((len(desc),))
+npDesc[:] = desc
+npDesc.dtype = np.int8
+marianModel["special:model.yml"] = npDesc
+
+print("\nMarian config:")
+print(configYamlStr)
+print("Saving Marian model to %s" % (args.marian,))
+np.savez(args.marian, **marianModel)