DAPT playbooks - with NeMo 2.0 (NVIDIA#12067)

* DAPT with NeMo 2.0 * DAPT with NeMo 2.0 * Apply isort and black reformatting Signed-off-by: jvamaraju <[email protected]> * Deleting file not needed * Update README.md Signed-off-by: jvamaraju <[email protected]> * Addressing feedback from PR review for DAPT playbook with nemo 2.0 * Addressing feedback for DAPT with nemo 2.0 * Addressing feedback for DAPT with nemo 2.0- local executor * Add Copyright --------- Signed-off-by: jvamaraju <[email protected]> Signed-off-by: jvamaraju <[email protected]> Co-authored-by: jvamaraju <[email protected]> Co-authored-by: aastha <[email protected]> Co-authored-by: Ao Tang <[email protected]> Signed-off-by: Youngeun Kwon <[email protected]>
youngeunkwon0405 · Feb 10, 2025 · 9e44d16 · 9e44d16
1 parent 40ab2f1
commit 9e44d16
Show file tree

Hide file tree

Showing 40 changed files with 3,927 additions and 0 deletions.
diff --git a/tutorials/llm/llama-3/README.rst → tutorials/llm/llama/README.rst b/tutorials/llm/llama-3/README.rst → tutorials/llm/llama/README.rst
diff --git a/...ials/llm/llama-3/biomedical-qa/README.rst → tutorials/llm/llama/biomedical-qa/README.rst b/...ials/llm/llama-3/biomedical-qa/README.rst → tutorials/llm/llama/biomedical-qa/README.rst
diff --git a/...ical-qa/img/e2e-lora-train-and-deploy.png → ...ical-qa/img/e2e-lora-train-and-deploy.png b/...ical-qa/img/e2e-lora-train-and-deploy.png → ...ical-qa/img/e2e-lora-train-and-deploy.png
diff --git a/...iomedical-qa/llama3-lora-deploy-nim.ipynb → ...iomedical-qa/llama3-lora-deploy-nim.ipynb b/...iomedical-qa/llama3-lora-deploy-nim.ipynb → ...iomedical-qa/llama3-lora-deploy-nim.ipynb
diff --git a/...-3/biomedical-qa/llama3-lora-nemofw.ipynb → ...ma/biomedical-qa/llama3-lora-nemofw.ipynb b/...-3/biomedical-qa/llama3-lora-nemofw.ipynb → ...ma/biomedical-qa/llama3-lora-nemofw.ipynb
diff --git a/tutorials/llm/llama/domain-adaptive-pretraining/.gitignore b/tutorials/llm/llama/domain-adaptive-pretraining/.gitignore
@@ -0,0 +1,8 @@
+/code/general_data/*
+/code/data/*
+/code/models/*
+/code/nemo_experiments/
+./preprocessed_data_text_document.bin
+./preprocessed_data_text_document.idx
+./llama2_7b.py
+./test_convert.py 
diff --git a/tutorials/llm/llama/domain-adaptive-pretraining/README.md b/tutorials/llm/llama/domain-adaptive-pretraining/README.md
@@ -0,0 +1,35 @@
+# ChipNeMo - Custom tokenization + Domain Adaptive Pre-training on Llama 2 7b with NeMo Framework
+
+[ChipNeMo](https://arxiv.org/pdf/2311.00176) is a chip design domain-adapted Large Language Model (LLM). Instead of directly deploying off-the-shelf commercial or open-source LLMs, the paper adopts the following domain adaptation techniques: domain-adaptive tokenization, domain-adaptive continued pre-training, model alignment with domain-specific instructions, and domain-adapted retrieval models. Specifically, Llama 2 foundation models are continually pre-trained with more than 20 billion tokens on domain-specific chip design data, including code and documents. They are then fine-tuned with instruction datasets from design data as well as external sources. Evaluations on the resultant domain-adapted ChipNeMo model demonstrate that domain-adaptive pre-training of language models can lead to superior performance in domain-related downstream tasks compared to their base Llama 2 counterparts, without degradations in generic capabilities.
+
+Here, we share a tutorial with best practices on custom tokenization and DAPT (Domain-Adaptive Pre-Training) for a ChipNeMo-like code generation use case.
+
+## Requirements
+
+### Software Requirements
+* Access to latest NeMo Framework NGC Containers
+* This playbook has been tested on: nvcr.io/nvidia/nemo:24.07. It is expected to work similarly on other environments.
+
+### Hardware Requirements
+* This playbook can run on CPUs or GPUs. For GPUs, this playbook has been tested on minimum 2xA100 80G
+
+### Data Curation
+
+* In this tutorial, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. Therefore, as a pre-requisite the user should curate the domain specific and general purpose data using the NeMo Curator and place them in the directories mentioned below. 
+
+* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation)
+
+* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial)
+
+
+## Custom Tokenization for DAPT
+
+After placing the curated data in the directories mentioned above, we can proceed with custom tokenization and DAPT. 
+
+* `./code/custom_tokenization.ipynb` walks through the custom tokenization workflow required for DAPT 
+
+## Pretraining for DAPT
+
+Once we have the domain adapted custom tokenizer from above, we can proceed with pretraining using the customer tokenizer.
+
+* `./code/domain_adaptive_pretraining.ipynb` walks through the pretraining workflow required for DAPT 
diff --git a/tutorials/llm/llama/domain-adaptive-pretraining/code/custom_tokenization.ipynb b/tutorials/llm/llama/domain-adaptive-pretraining/code/custom_tokenization.ipynb