cd ./data
make
changes in preprocess_data.py
:
preprocess_data.py
script is moved to megatron folder- supports tokenizers from HuggingFace Transformers
input
can be a folder with multiple json/jsonl files
example usage with HF Tokenizer:
python preprocess_data.py \
--input ./train \
--output-prefix ./train \
--dataset-impl mmap \
--tokenizer-type HFTokenizer \
--tokenizer-name-or-path bert-base-uncased \
--split-sentences --workers 8