Skip to content

Commit

Permalink
merge with master
Browse files Browse the repository at this point in the history
  • Loading branch information
XapaJIaMnu committed Aug 17, 2023
2 parents 5d7d080 + 3f93e65 commit e80b6bb
Show file tree
Hide file tree
Showing 113 changed files with 7,456 additions and 720 deletions.
3 changes: 1 addition & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -61,5 +61,4 @@ examples/mnist/*ubyte
/vs/MarianDll.VC.VC.opendb

.vs
.vscode

.vscode
26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,33 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

## [Unreleased]

### Added
- Added `--no-spm-encode` option, allowing the model to use vocabulary IDs directly to train/decode.
- Added --custom-fallbacks option that allows to specify a list of option sets that get traversed for subsequent fallbacks upon divergence
- Added --overwrite-checkpoint option that (when set to false) can be used to dump checkpoints with iteration numbers.
- Implementations of COMET-20 (reference-based) and BLEURT-20 for inference with conversion scripts.
- `./marian evaluate` sub command for evaluation with COMET-QE-20, COMET-20 and BLEURT-20
- A bunch of scripts for metrics use and early MBR experiments
- LSH vocab filtering for GPU. Speed is not competitive with non-LSH. Checking in for completeness and possible future use of LSH on GPU for non-filtering stuff
- Added --throw-on-divergence and --fp16-fallback-to-fp32 options to detect (fp16 and fp32) and recover (only fp16)
diverged runs. If not recoverable, exception gets rethrown and goes unhandled to force fatal error and shutdown.
- Re-implementation of COMET-QE for inference and training; conversion scripts from Unbabel-Comet to Marian.
- Validator that generates embeddings and can be used during COMET training with an external script.
- New experimental layer framework for Transformer-like models.

### Fixed
- Fixed wrong paramter name for norm in new layer framework
- Fixed unit test for LayerNorm
- Only collect batch statistics during mini-batch-fit up to actual max-length.
- Implemented fully correct version of GELU instead of using bad approximatin via Swish.
- Handle copying from fp32 or fp16 embeddings in embedder mode correctly.
- Correct defaults for factored embeddings such that shared library use works (move out of config.h/cpp).

### Changed
- Removed --num-devices N option that wasn't really used by anyone (I assume).


## [1.12.0] - 2023-02-20

### Added
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
v1.12.0
v1.12.12
9 changes: 6 additions & 3 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,11 @@ parameters:
type: boolean
default: true

# The pipeline CI trigger is set on the branch master only and PR trigger on a
# (non-draft) pull request to any branch
# Warning: the current branch policies disable the automatic triggering to
# minimize VM usage!
# The configuration below specifies that the pipeline CI trigger is set on the
# branch master only and a PR trigger is on a (non-draft) pull request to any
# branch.
trigger:
# This minimizes the number of parallel pipeline runs. When a pipeline is
# running, the CI waits until it is completed before starting another one.
Expand Down Expand Up @@ -368,7 +371,7 @@ stages:
-DCOMPILE_CPU=on \
-DCOMPILE_CUDA=off \
-DCOMPILE_EXAMPLES=on \
-DCOMPILE_SERVER=on \
-DCOMPILE_SERVER=off \
-DCOMPILE_TESTS=on \
-DUSE_FBGEMM=on \
-DUSE_SENTENCEPIECE=on \
Expand Down
29 changes: 18 additions & 11 deletions azure-regression-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,24 @@ stages:
displayName: Collect system info
workingDirectory: regression-tests
# Always run regression tests from the master branch
# The current SAS token will expire on 12/31/2023 and a new one will need to be set in Marian > Pipelines > Library
# This is run at the beginning for easier debugging of the Python environment
- bash: |
set -x
git checkout master
git pull origin master
# Uninstall Cython because the newest 3.0.0 is incompatible with newest available versions of pyyaml and numpy as of July 2023
python3 -m pip uninstall -y cython
python3 -m pip install 'cython<3'
# These modules will be installed via `make install` below, but Cython needs to be installed before
python3 -m pip install 'pyyaml<6.0.1' 'numpy>=1.22,<2' websocket-client
make install
displayName: Prepare regression tests
env:
AZURE_STORAGE_SAS_TOKEN: $(marian-pub-tests-blob-sas-token)
workingDirectory: regression-tests
# https://software.intel.com/content/www/us/en/develop/articles/installing-intel-free-libs-and-python-apt-repo.html
- bash: |
wget -qO- "https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB" | sudo apt-key add -
Expand Down Expand Up @@ -106,17 +124,6 @@ stages:
displayName: Run unit tests
workingDirectory: build

# Always run regression tests from the master branch
# The current SAS token will expire on 12/31/2023 and a new one will need to be set in Marian > Pipelines > Library
- bash: |
git checkout master
git pull origin master
make install
displayName: Prepare regression tests
env:
AZURE_STORAGE_SAS_TOKEN: $(marian-pub-tests-blob-sas-token)
workingDirectory: regression-tests
# Continue on error to be able to collect outputs and publish them as an artifact
- bash: MARIAN=../build ./run_mrt.sh
continueOnError: true
Expand Down
30 changes: 30 additions & 0 deletions cmake/Tarball.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# marian-YYYY-MM-DD-revision.tgz
# This combines marian, marian_decoder in a single TAR file for
# execution in MSFT internal tools FLO and Singularity.

execute_process(
COMMAND bash -c "TZ=America/Los_Angeles date +%Y-%m-%d"
OUTPUT_VARIABLE TGZ_DATE
OUTPUT_STRIP_TRAILING_WHITESPACE)

execute_process(
COMMAND git rev-parse --short=7 HEAD
OUTPUT_VARIABLE TGZ_REV
OUTPUT_STRIP_TRAILING_WHITESPACE)

message("Generating ${CWD}/marian-${TGZ_DATE}-${TGZ_REV}.tgz")

# check if pigz is available for faster compression
execute_process(
COMMAND bash -c "which pigz || which gzip"
OUTPUT_VARIABLE COMPRESS
OUTPUT_STRIP_TRAILING_WHITESPACE)

execute_process(
COMMAND tar -I ${COMPRESS} -cvvf "${CWD}/marian-${TGZ_DATE}-${TGZ_REV}.tgz" -C "${CWD}"
marian
marian-decoder
marian-scorer
marian-vocab
marian-conv
WORKING_DIRECTORY "${CWD}")
23 changes: 23 additions & 0 deletions scripts/bert/contrib/chpt2pt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/usr/bin/env python3
"""
This script converts *.chpt files to *.pt files, potentially useful for extracting weights only from larger checkpoints.
"""

import torch
import argparse

# Create a parser for command line arguments
parser = argparse.ArgumentParser()

# Add arguments for the source and target files
parser.add_argument("--source", type=str, required=True, help="Path to the source *.chpt file")
parser.add_argument("--target", type=str, required=True, help="Path to the target *.pt file")

# Parse the command line arguments
args = parser.parse_args()

# Load the model from the source file
model = torch.load(args.source)

# Save the model to the target file
torch.save(model, args.target)
153 changes: 153 additions & 0 deletions scripts/bert/contrib/hugging2marian.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
#!/usr/bin/env python3
"""
This script converts Huggingface Bert model to Marian weight file.
"""

import argparse
import numpy as np
import sys
import yaml

from transformers import XLMRobertaModel

parser = argparse.ArgumentParser(description='Convert Huggingface Bert model to Marian weight file.')
parser.add_argument('--bert', help='Path to Huggingface Bert PyTorch model', required=True)
parser.add_argument('--marian', help='Output path for Marian weight file', required=True)
args = parser.parse_args()

huggingface = XLMRobertaModel.from_pretrained(args.bert)
huggingface.eval()

print(huggingface.config)

config = dict()
config["type"] = "bert-classifier"
config["input-types"] = ["sequence"]
config["tied-embeddings-all"] = True
config["tied-embeddings-src"] = False

config["transformer-ffn-depth"] = 2
config["transformer-train-position-embeddings"] = True
config["transformer-preprocess"] = ""
config["transformer-postprocess"] = "dan"
config["transformer-postprocess-emb"] = "nd"
config["bert-train-type-embeddings"] = False
# @TODO: figure out if it's worth adding `cometModel.name_or_path` to the end of this version string.
config["version"] = "huggingface2marian.py conversion"

config["enc-depth"] = 0
config["transformer-dim-ffn"] = huggingface.config.intermediate_size
config["transformer-heads"] = huggingface.config.num_attention_heads
config["transformer-ffn-activation"] = huggingface.config.hidden_act

config["bert-sep-symbol"] = "</s>"
config["bert-class-symbol"] = "</s>"

marianModel = dict()

def transposeOrder(mat):
matT = np.transpose(mat) # just a view with changed row order
return matT.flatten(order="C").reshape(matT.shape) # force row order change and reshape


def convert(pd, srcs, trg, transpose=True, bias=False):
if len(srcs) == 1:
for src in srcs:
num = pd[src].detach().numpy()
if bias:
marianModel[trg] = np.atleast_2d(num)
else:
if transpose:
marianModel[trg] = transposeOrder(num) # transpose with row order change
else:
marianModel[trg] = num
else: # path that joins matrices together for fused self-attention
nums = [pd[src].detach().numpy() for src in srcs]
if bias:
nums = [np.transpose(np.atleast_2d(num)) for num in nums]
marianModel[trg] = np.stack(nums, axis=0)


def extract(layer, nth, level):
name = type(layer).__name__
print(" " * level, nth, name)
if name == "BertLayer":
pd = dict(layer.named_parameters())
for n in pd:
print(" " * (level + 1), n, pd[n].shape)

convert(pd, ["attention.self.query.weight"], f"encoder_l{nth + 1}_self_Wq", transpose=True)
convert(pd, ["attention.self.key.weight"], f"encoder_l{nth + 1}_self_Wk")
convert(pd, ["attention.self.value.weight"], f"encoder_l{nth + 1}_self_Wv")

convert(pd, ["attention.self.query.bias"], f"encoder_l{nth + 1}_self_bq", bias=True)
convert(pd, ["attention.self.key.bias"], f"encoder_l{nth + 1}_self_bk", bias=True)
convert(pd, ["attention.self.value.bias"], f"encoder_l{nth + 1}_self_bv", bias=True)

convert(pd, ["attention.output.dense.weight"], f"encoder_l{nth + 1}_self_Wo")
convert(pd, ["attention.output.dense.bias"], f"encoder_l{nth + 1}_self_bo", bias=True)

convert(pd, ["attention.output.LayerNorm.weight"], f"encoder_l{nth + 1}_self_Wo_ln_scale", bias=True)
convert(pd, ["attention.output.LayerNorm.bias"], f"encoder_l{nth + 1}_self_Wo_ln_bias", bias=True)

convert(pd, ["intermediate.dense.weight"], f"encoder_l{nth + 1}_ffn_W1")
convert(pd, ["intermediate.dense.bias"], f"encoder_l{nth + 1}_ffn_b1", bias=True)
convert(pd, ["output.dense.weight"], f"encoder_l{nth + 1}_ffn_W2")
convert(pd, ["output.dense.bias"], f"encoder_l{nth + 1}_ffn_b2", bias=True)

convert(pd, ["output.LayerNorm.weight"], f"encoder_l{nth + 1}_ffn_ffn_ln_scale", bias=True)
convert(pd, ["output.LayerNorm.bias"], f"encoder_l{nth + 1}_ffn_ffn_ln_bias", bias=True)

config["enc-depth"] += 1

elif name == "BertEmbeddings":
for n, p in layer.named_parameters():
print(" " * (level + 1), n, p.shape)
pd = dict(layer.named_parameters())
convert(pd, ["word_embeddings.weight"], f"Wemb", transpose=False)
convert(pd, ["position_embeddings.weight"], f"Wpos", transpose=False)

config["bert-type-vocab-size"] = 0
if hasattr(layer, "token_type_embeddings"):
convert(pd, ["token_type_embeddings.weight"], f"Wtype", transpose=False)
config["bert-type-vocab-size"] = pd["token_type_embeddings.weight"].shape[0]
config["bert-train-type-embeddings"] = True

convert(pd, ["LayerNorm.weight"], f"encoder_emb_ln_scale_pre", bias=True)
convert(pd, ["LayerNorm.bias"], f"encoder_emb_ln_bias_pre", bias=True)

config["dim-emb"] = pd["word_embeddings.weight"].shape[1]
config["dim-vocabs"] = [ pd["word_embeddings.weight"].shape[0] ]
config["max-length"] = pd["position_embeddings.weight"].shape[0]

elif name == "BertPooler":
for n, p in layer.named_parameters():
print(" " * (level + 1), n, p.shape)

pd = dict(layer.named_parameters())
convert(pd, ["dense.weight"], "classifier_ff_logit_l1_W")
convert(pd, ["dense.bias"], "classifier_ff_logit_l1_b", bias=True)

else:
recurse(layer, level + 1)

def recurse(parent, level=0):
for i, child in enumerate(parent.children()):
extract(child, i, level)

recurse(huggingface)

for m in marianModel:
print(m, marianModel[m].shape)

configYamlStr = yaml.dump(config, default_flow_style=False)
desc = list(configYamlStr)
npDesc = np.chararray((len(desc),))
npDesc[:] = desc
npDesc.dtype = np.int8
marianModel["special:model.yml"] = npDesc

print("\nMarian config:")
print(configYamlStr)
print("Saving Marian model to %s" % (args.marian,))
np.savez(args.marian, **marianModel)
Loading

0 comments on commit e80b6bb

Please sign in to comment.