Skip to content

Latest commit

 

History

History
29 lines (24 loc) · 844 Bytes

File metadata and controls

29 lines (24 loc) · 844 Bytes

Megatron-LM

Readme

Original Megatron-LM readme

Installation

cd ./data
make

Data Preprocessing

docs from Megatron-LM

changes in preprocess_data.py:

  • preprocess_data.py script is moved to megatron folder
  • supports tokenizers from HuggingFace Transformers
  • input can be a folder with multiple json/jsonl files

example usage with HF Tokenizer:

python preprocess_data.py \
       --input ./train \
       --output-prefix ./train \
       --dataset-impl mmap \
       --tokenizer-type HFTokenizer \
       --tokenizer-name-or-path bert-base-uncased \
       --split-sentences --workers 8