forked from NVIDIA/NeMo
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
DAPT playbooks - with NeMo 2.0 (NVIDIA#12067)
* DAPT with NeMo 2.0 * DAPT with NeMo 2.0 * Apply isort and black reformatting Signed-off-by: jvamaraju <[email protected]> * Deleting file not needed * Update README.md Signed-off-by: jvamaraju <[email protected]> * Addressing feedback from PR review for DAPT playbook with nemo 2.0 * Addressing feedback for DAPT with nemo 2.0 * Addressing feedback for DAPT with nemo 2.0- local executor * Add Copyright --------- Signed-off-by: jvamaraju <[email protected]> Signed-off-by: jvamaraju <[email protected]> Co-authored-by: jvamaraju <[email protected]> Co-authored-by: aastha <[email protected]> Co-authored-by: Ao Tang <[email protected]> Signed-off-by: Youngeun Kwon <[email protected]>
- Loading branch information
1 parent
40ab2f1
commit 9e44d16
Showing
40 changed files
with
3,927 additions
and
0 deletions.
There are no files selected for viewing
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
/code/general_data/* | ||
/code/data/* | ||
/code/models/* | ||
/code/nemo_experiments/ | ||
./preprocessed_data_text_document.bin | ||
./preprocessed_data_text_document.idx | ||
./llama2_7b.py | ||
./test_convert.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# ChipNeMo - Custom tokenization + Domain Adaptive Pre-training on Llama 2 7b with NeMo Framework | ||
|
||
[ChipNeMo](https://arxiv.org/pdf/2311.00176) is a chip design domain-adapted Large Language Model (LLM). Instead of directly deploying off-the-shelf commercial or open-source LLMs, the paper adopts the following domain adaptation techniques: domain-adaptive tokenization, domain-adaptive continued pre-training, model alignment with domain-specific instructions, and domain-adapted retrieval models. Specifically, Llama 2 foundation models are continually pre-trained with more than 20 billion tokens on domain-specific chip design data, including code and documents. They are then fine-tuned with instruction datasets from design data as well as external sources. Evaluations on the resultant domain-adapted ChipNeMo model demonstrate that domain-adaptive pre-training of language models can lead to superior performance in domain-related downstream tasks compared to their base Llama 2 counterparts, without degradations in generic capabilities. | ||
|
||
Here, we share a tutorial with best practices on custom tokenization and DAPT (Domain-Adaptive Pre-Training) for a ChipNeMo-like code generation use case. | ||
|
||
## Requirements | ||
|
||
### Software Requirements | ||
* Access to latest NeMo Framework NGC Containers | ||
* This playbook has been tested on: nvcr.io/nvidia/nemo:24.07. It is expected to work similarly on other environments. | ||
|
||
### Hardware Requirements | ||
* This playbook can run on CPUs or GPUs. For GPUs, this playbook has been tested on minimum 2xA100 80G | ||
|
||
### Data Curation | ||
|
||
* In this tutorial, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. Therefore, as a pre-requisite the user should curate the domain specific and general purpose data using the NeMo Curator and place them in the directories mentioned below. | ||
|
||
* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation) | ||
|
||
* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial) | ||
|
||
|
||
## Custom Tokenization for DAPT | ||
|
||
After placing the curated data in the directories mentioned above, we can proceed with custom tokenization and DAPT. | ||
|
||
* `./code/custom_tokenization.ipynb` walks through the custom tokenization workflow required for DAPT | ||
|
||
## Pretraining for DAPT | ||
|
||
Once we have the domain adapted custom tokenizer from above, we can proceed with pretraining using the customer tokenizer. | ||
|
||
* `./code/domain_adaptive_pretraining.ipynb` walks through the pretraining workflow required for DAPT |
1,920 changes: 1,920 additions & 0 deletions
1,920
tutorials/llm/llama/domain-adaptive-pretraining/code/custom_tokenization.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
Oops, something went wrong.