Skip to content

Commit

Permalink
DAPT playbooks - with NeMo 2.0 (#12067)
Browse files Browse the repository at this point in the history
* DAPT with NeMo 2.0

* DAPT with NeMo 2.0

* Apply isort and black reformatting

Signed-off-by: jvamaraju <[email protected]>

* Deleting file not needed

* Update README.md

Signed-off-by: jvamaraju <[email protected]>

* Addressing feedback from PR review for DAPT playbook with nemo 2.0

* Addressing feedback for DAPT with nemo 2.0

* Addressing feedback for DAPT with nemo 2.0- local executor

* Add Copyright

---------

Signed-off-by: jvamaraju <[email protected]>
Signed-off-by: jvamaraju <[email protected]>
Co-authored-by: jvamaraju <[email protected]>
Co-authored-by: aastha <[email protected]>
Co-authored-by: Ao Tang <[email protected]>
  • Loading branch information
4 people authored Feb 10, 2025
1 parent b3dcfd0 commit 7d81505
Show file tree
Hide file tree
Showing 40 changed files with 3,927 additions and 0 deletions.
File renamed without changes.
8 changes: 8 additions & 0 deletions tutorials/llm/llama/domain-adaptive-pretraining/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
/code/general_data/*
/code/data/*
/code/models/*
/code/nemo_experiments/
./preprocessed_data_text_document.bin
./preprocessed_data_text_document.idx
./llama2_7b.py
./test_convert.py
35 changes: 35 additions & 0 deletions tutorials/llm/llama/domain-adaptive-pretraining/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# ChipNeMo - Custom tokenization + Domain Adaptive Pre-training on Llama 2 7b with NeMo Framework

[ChipNeMo](https://arxiv.org/pdf/2311.00176) is a chip design domain-adapted Large Language Model (LLM). Instead of directly deploying off-the-shelf commercial or open-source LLMs, the paper adopts the following domain adaptation techniques: domain-adaptive tokenization, domain-adaptive continued pre-training, model alignment with domain-specific instructions, and domain-adapted retrieval models. Specifically, Llama 2 foundation models are continually pre-trained with more than 20 billion tokens on domain-specific chip design data, including code and documents. They are then fine-tuned with instruction datasets from design data as well as external sources. Evaluations on the resultant domain-adapted ChipNeMo model demonstrate that domain-adaptive pre-training of language models can lead to superior performance in domain-related downstream tasks compared to their base Llama 2 counterparts, without degradations in generic capabilities.

Here, we share a tutorial with best practices on custom tokenization and DAPT (Domain-Adaptive Pre-Training) for a ChipNeMo-like code generation use case.

## Requirements

### Software Requirements
* Access to latest NeMo Framework NGC Containers
* This playbook has been tested on: nvcr.io/nvidia/nemo:24.07. It is expected to work similarly on other environments.

### Hardware Requirements
* This playbook can run on CPUs or GPUs. For GPUs, this playbook has been tested on minimum 2xA100 80G

### Data Curation

* In this tutorial, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. Therefore, as a pre-requisite the user should curate the domain specific and general purpose data using the NeMo Curator and place them in the directories mentioned below.

* `./code/data` should contain curated data from chip domain after processing with NeMo Curator. Playbook for DAPT data curation can be found [here](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/dapt-curation)

* `./code/general_data` should contain open-source general purpose data that llama-2 was trained on. This data will help idenitfy token/vocabulary differences between general purpose and domain-specific datasets. Data can be downloaded from [Wikepedia](https://huggingface.co/datasets/legacy-datasets/wikipedia), [CommonCrawl](https://data.commoncrawl.org/) etc. and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial)


## Custom Tokenization for DAPT

After placing the curated data in the directories mentioned above, we can proceed with custom tokenization and DAPT.

* `./code/custom_tokenization.ipynb` walks through the custom tokenization workflow required for DAPT

## Pretraining for DAPT

Once we have the domain adapted custom tokenizer from above, we can proceed with pretraining using the customer tokenizer.

* `./code/domain_adaptive_pretraining.ipynb` walks through the pretraining workflow required for DAPT
1,920 changes: 1,920 additions & 0 deletions tutorials/llm/llama/domain-adaptive-pretraining/code/custom_tokenization.ipynb

Large diffs are not rendered by default.

Loading

0 comments on commit 7d81505

Please sign in to comment.