Megatron-LM

Readme

Installation

cd ./data
make

Data Preprocessing

docs from Megatron-LM

changes in preprocess_data.py:

preprocess_data.py script is moved to megatron folder
supports tokenizers from HuggingFace Transformers
input can be a folder with multiple json/jsonl files

example usage with HF Tokenizer:

python preprocess_data.py \
       --input ./train \
       --output-prefix ./train \
       --dataset-impl mmap \
       --tokenizer-type HFTokenizer \
       --tokenizer-name-or-path bert-base-uncased \
       --split-sentences --workers 8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Megatron-LM

Readme

Installation

Data Preprocessing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Megatron-LM

Readme

Installation

Data Preprocessing