Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add jamba convergence test #1

Open
wants to merge 89 commits into
base: main
Choose a base branch
from

Conversation

yubofredwang
Copy link

@yubofredwang yubofredwang commented Sep 5, 2024

Summary

Add convergence test for jamba model monkey patching

Testing Done

  • Hardware Type: A100-80G-PCIe
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

helloworld1 and others added 30 commits August 28, 2024 10:26
## Summary
Make GPU CI optional until it is more stable

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
Testing CI

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->

Add gemma lightning example for single L40 GPU

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence

---------

Co-authored-by: Shao Tang <[email protected]>
Co-authored-by: Byron Hsu <[email protected]>
## Summary
Aims to fix linkedin#89.

## Details
Does the casts to float32 at the correct places to match the Gemma and
Llama references. Does so both in the forward and backward passes.
Also modified the tests for RMSNorm with tighter tolerances + fp16
tests.

## Testing Done
Ran tests for convergence and RMSNorm.
```
test/convergence/test_mini_models.py ........                            [100%] |
                                                                                |
========================= 8 passed in 78.70s (0:01:18) =========================

test/transformers/test_rms_norm.py ................................................                                                                                                                                                                            [100%]

========================================================================================================================= 48 passed in 4.62s =========================================================================================================================

```

- Hardware Type: NVIDIA L4
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

---------

Co-authored-by: Yun Dai <[email protected]>
## Summary
Adds optional bias param for fused linear cross entropy!
Added bias = {true, false} for the testing space.
Also changed weight/bias generation in tests to uniform rand instead of
normal (seems stabler for low precision bfloat16).

## Testing Done
Tests for convergence + tests for fused linear cross entropy. 
Replace BLANK with your device type. For example, A100-80G-PCIe

**Results**
```
test/transformers/test_fused_linear_cross_entropy.py ............                                                                                                                                                                                              [100%]

======================================================================================================================== 12 passed in 31.61s =========================================================================================================================
```

- Hardware Type: NVIDIA L4
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

---------

Co-authored-by: Byron Hsu <[email protected]>
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

as title 

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
updated seqlen for rope to be non-constexpr

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->
sl from constexpr to non-constexpr

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: RTX 3090
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
1. bf16 loss rtol is a bit too loose, tighten it by 1 digit
2. slightly loosen gemma1 atol, it's been failing
3. old `transformers` version doesn't carry phi3 source code (testing on
4.40.1), since we claim support for >= 4.40.1, change the import a bit
so things still work on older HF ver
4. rerun all benchmark to reflect latest performance in preparation for
new release
<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

Co-authored-by: Yun Dai <[email protected]>
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
With Gemma2 support, import will fail with `transformers<4.42.0`
<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
Add missing tf_keras to req.txt 
<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->
Add missing tf_keras to req.txt , otherwise pip install -r
requirement.txt and then run the script will result in

```
ValueError: Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`.
```

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->
Done
<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

Co-authored-by: jaszhu <[email protected]>
## Summary
 Turn on GPU CI enforce

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
for release
<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
New release with features should bump minor version tehe
<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence
…flag are mutual exclusive (linkedin#168)

## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
Bug fix for gemma: fused_linear_cross_entropy flag and cross_entropy
flag are mutual exclusive

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->
Bug fix for gemma: fused_linear_cross_entropy flag and cross_entropy
flag are mutual exclusive
## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->
Done, test model training 

```
{'loss': 2.5808, 'grad_norm': 77.5, 'learning_rate': 3e-06, 'epoch': 0.0, 'num_input_tokens_seen': 18256}

  5%|▌         | 1/20 [00:05<01:45,  5.55s/it]
 10%|█         | 2/20 [00:07<01:04,  3.59s/it]
                                              
{'loss': 2.652, 'grad_norm': 80.0, 'learning_rate': 6e-06, 'epoch': 0.0, 'num_input_tokens_seen': 33376, 'step': 2, 'step_time_sec': 2.18, 'avg_step_time_sec': 2.18, 'time_to_completion_sec': 39.29, 'estimated_total_time_sec': 43.66, 'step_peak_memory_allocated_MB': 21965.16, 'step_peak_memory_reserved_MB': 34126.0, 'total_peak_memory_allocated_MB': 21965.16, 'total_peak_memory_reserved_MB': 34126.0, 'step_tokens_per_second': 6926.16, 'avg_tokens_per_second': 6926.16}

 10%|█         | 2/20 [00:07<01:04,  3.59s/it]
 15%|█▌        | 3/20 [00:09<00:49,  2.93s/it]
                                              
{'loss': 2.1275, 'grad_norm': 46.0, 'learning_rate': 5.954423259036625e-06, 'epoch': 0.0, 'num_input_tokens_seen': 47504, 'step': 3, 'step_time_sec': 2.08, 'avg_step_time_sec': 2.13, 'time_to_completion_sec': 36.2, 'estimated_total_time_sec': 42.59, 'step_peak_memory_allocated_MB': 21998.28, 'step_peak_memory_reserved_MB': 34126.0, 'total_peak_memory_allocated_MB': 21998.28, 'total_peak_memory_reserved_MB': 34126.0, 'step_tokens_per_second': 6804.83, 'avg_tokens_per_second': 6867.02}

 15%|█▌        | 3/20 [00:09<00:49,  2.93s/it]
 20%|██        | 4/20 [00:12<00:47,  2.94s/it]
                                              
{'loss': 1.7238, 'grad_norm': 15.75, 'learning_rate': 5.819077862357725e-06, 'epoch': 0.01, 'num_input_tokens_seen': 64176, 'step': 4, 'step_time_sec': 2.89, 'avg_step_time_sec': 2.38, 'time_to_completion_sec': 38.14, 'estimated_total_time_sec': 47.68, 'step_peak_memory_allocated_MB': 21734.73, 'step_peak_memory_reserved_MB': 35628.0, 'total_peak_memory_allocated_MB': 21998.28, 'total_peak_memory_reserved_MB': 35628.0, 'step_tokens_per_second': 5763.26, 'avg_tokens_per_second': 6420.58}

```
<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

Co-authored-by: jaszhu <[email protected]>
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
Add gemma 7b it benchmark
<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->
Add gemma 7b it benchmark
## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->
NA
<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

---------

Co-authored-by: jaszhu <[email protected]>
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
to catch linkedin#168
<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence
## Summary

Closes linkedin#87

Skipped tests for `bfloat16` on GPUs with compute capability below
Ampere architecture (`sm_80`).

<!--- This is a required section; please describe the main purpose of
this proposed code change. --->

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: NVIDIA **T4** (should skip most cases)
- [X] run `make test` to ensure correctness
- [X] run `make checkstyle` to ensure code style
- [X] run `make test-convergence` to ensure convergence

```
⚡ main ~/Liger-Kernel make all
python -m pytest --disable-warnings test/ --ignore=test/convergence
HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence
flake8 .; flake8_status=$?; \
isort .; isort_status=$?; \
black .; black_status=$?; \
if [ $flake8_status -ne 0 ] || [ $isort_status -ne 0 ] || [ $black_status -ne 0 ]; then \
        exit 1; \
fi
=================================================================== test session starts ====================================================================
platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0
rootdir: /teamspace/studios/this_studio/Liger-Kernel
plugins: anyio-4.4.0
collecting ... =================================================================== test session starts ====================================================================
platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0
rootdir: /teamspace/studios/this_studio/Liger-Kernel
plugins: anyio-4.4.0
collecting ... Skipped 1 files
All done! ✨ 🍰 ✨
58 files left unchanged.
collected 163 items                                                                                                                                        

test/transformers/test_auto_model.py .                                                                                                               [  0%]
test/transformers/test_cross_entropy.py ssssssssssssssssssssssssssssssssssssssssssssssssssssssssss                                                   [ 36%]
collected 28 items                                                                                                                                         

test/convergence/test_mini_models.py .....s.....s....                                                                                    [ 43%]
test/transformers/test_geglu.py .s....ssss                                                                                                             [ 48%]
test/transformers/test_monkey_patch.py .....                                                                                                         [ 51%]
test/transformers/test_rms_norm.py ........ssssssss...............ssssssss........                                                                  [ 80%]
test/transformers/test_rope.py ......ssssss                                                                                                          [ 88%]
test/transformers/test_swiglu.py ....ssss.s....ssss                                                                                                    [ 98%]
test/transformers/test_trainer_integration.py .                                                                                                      [ 98%]
test/triton/test_triton_monkey_patch.py ..                                                                                                           [100%]

======================================================== 71 passed, 92 skipped in 136.69s (0:02:16) ========================================================
.s.s.s                                                                                                  [ 50%]
test/convergence/test_mini_models_no_logits.py .s.s.s.s.s.s.s                                                                                        [100%]

======================================================== 14 passed, 14 skipped in 353.27s (0:05:53) ========================================================
```

- Hardware Type: NVIDIA **L4** (should skip few cases)
- [X] run `make test` to ensure correctness
- [X] run `make checkstyle` to ensure code style
- [X] run `make test-convergence` to ensure convergence

```
⚡ main ~/Liger-Kernel make all
python -m pytest --disable-warnings test/ --ignore=test/convergence
HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence
flake8 .; flake8_status=$?; \
isort .; isort_status=$?; \
black .; black_status=$?; \
if [ $flake8_status -ne 0 ] || [ $isort_status -ne 0 ] || [ $black_status -ne 0 ]; then \
        exit 1; \
fi
=================================================================== test session starts ====================================================================
platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0
rootdir: /teamspace/studios/this_studio/Liger-Kernel
plugins: anyio-4.4.0
collecting ... =================================================================== test session starts ====================================================================
platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0
rootdir: /teamspace/studios/this_studio/Liger-Kernel
plugins: anyio-4.4.0
collecting ... Skipped 1 files
All done! ✨ 🍰 ✨
58 files left unchanged.
collected 163 items                                                                                                                                        

test/transformers/test_auto_model.py .                                                                                                               [  0%]
collected 28 items                                                                                                                                         

test/convergence/test_mini_models.py ........................................................ss                                                   [ 36%]
test/transformers/test_fused_linear_cross_entropy.py ...............                                                                                    [ 43%]
test/transformers/test_geglu.py .........                                                                                                             [ 48%]
test/transformers/test_monkey_patch.py .....                                                                                                         [ 51%]
test/transformers/test_rms_norm.py .................................................                                                                  [ 80%]
test/transformers/test_rope.py ............                                                                                                          [ 88%]
test/transformers/test_swiglu.py ..................                                                                                                    [ 98%]
test/transformers/test_trainer_integration.py .                                                                                                      [ 98%]
test/triton/test_triton_monkey_patch.py ..                                                                                                           [100%]

======================================================== 161 passed, 2 skipped in 90.45s (0:01:30) =========================================================
.......                                                                                                  [ 50%]
test/convergence/test_mini_models_no_logits.py ..............                                                                                        [100%]

============================================================== 28 passed in 290.65s (0:04:50) ==============================================================
```

##  Additional Context
FYR, here’s a list of NVIDIA architecture names, and which compute
capabilities they have:

<img width="1268" alt="Screenshot 2024-08-29 at 6 04 56 PM"
src="https://github.com/user-attachments/assets/6675ae9e-9137-4adb-8af7-ee1226733353">

---------

Signed-off-by: Austin Liu <[email protected]>
Co-authored-by: Shao Tang <[email protected]>
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
integrated layernorm custom kernels + LigerLayerNorm module

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->
tested layernorm kernels for correctness

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: RTX 3090
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

---------

Co-authored-by: Shao Tang <[email protected]>
…inkedin#170)

## Summary

Fixes example in README to make it functional

## Testing Done

- Hardware Type: <BLANK>
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence
## Summary
- Updated the banner animated code snippet to use the
AutoLigerKernelForCausalLM
- Added wave snippet to acknowledgements

## Testing Done
- Preview markdown

- Hardware Type: N/A
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
Added LayerNorm description to README

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->
N//A

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: RTX 3090
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

---------

Co-authored-by: Shao Tang <[email protected]>
## Summary
- Removing torch compile from benchmark scripts for consistency and
re-run benchmarks on A100

## Testing Done
Ran benchmark

- Hardware Type: A100
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
conclude some learnings from this release :D 
<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence

---------

Co-authored-by: Shao Tang <[email protected]>
…rad context for easy reuse (linkedin#178)

## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
Extract forward/backward core computation bits outside of torch autograd
context for easy reuse. This is beneficial for lightning thunder
integration and the reuse of kernel in other context.

Doubled checked the speed and memory usage, within variance range, no
degradation

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

---------

Co-authored-by: Shao Tang <[email protected]>
## Summary
- Added Embedding forward/backwards kernels + LigerEmbedding class which
maps to nn.Embedding
- nn.Embedding is useful for encoder-only models such as BERT
- ref: linkedin#131

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->
- tested against nn.Embedding for correctness on various inputs
- tested with and without padding_idx

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: RTX 3090 + RTX 4090
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

---------

Co-authored-by: Shao Tang <[email protected]>
yubofredwang and others added 30 commits September 6, 2024 18:39
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence
## Summary
Reference Unsloth in header section

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence
## Summary
Aim to solve linkedin#81.

## Details

### For loss:
Label smoothing regularization ( LSR ) by replacing the label
distribution $q(k) = \delta_{k,y}$ with
```math
q'(k) = (1 - \epsilon)\delta_{k,y} + \frac{\epsilon}{K}
```
Considering cross entropy with LSR is

```math
\begin{align}
L' = H(q', p) &= -\sum^K_{k=1}log\ {p(k)}q'(k) = -\sum^K_{k=1}log\ {p(k)}((1 - \epsilon)\delta_{k,y} + \frac{\epsilon}{K})\\
                    &=  -\sum^K_{k=1}log\ {p(k)}(1 - \epsilon)q(k)   -\sum^K_{k=1}log\ {p(k)}\frac{\epsilon}{K} \\
                    &= (1 - \epsilon)H(q,p) + \frac{\epsilon}{K} \sum^K_{k=1} log\ softmax(x_k)\\
                    &= (1- \epsilon)L + \frac{\epsilon}{K}\ SmoothLoss,

\end{align}
```
where $L = H(q,p)$ is the original loss and $\sum^K_{k=1} log\
softmax(x_k)$ is smooth loss.

### For gradients:
The original:
```math
\begin{align}
\frac{\partial L}{\partial x_i} &= p(k) - q(k)\\
                                            &= \begin{cases}
                                                   softmax(x_i) ,                        &  i \neq y \\
                                                   softmax(x_i) - 1,                    &  i = y
                                                   
\end{cases}
\end{align} 
```
With LSR:
```math
\begin{align}
\frac{\partial L'}{\partial x_i} &= p(k) - q'(k)\\
                                            &= softmax(x_i) - (1 - \epsilon)\delta_{k,y} + \frac{\epsilon}{K}\\
                                            &= \begin{cases} softmax(x_i) - \frac{\epsilon}{K},                        &  i \neq y \\
                                                   softmax(x_i) - \frac{\epsilon}{K} - (1 - \epsilon) &  i = y

\end{cases}
\end{align}
```

We can handle the $i = y$ case by simply adding $-(1-\epsilon)$ after
computing all $i$.


Reference:
[Rethinking the Inception Architecture for Computer
Vision](https://arxiv.org/abs/1512.00567)

## Testing Done
Add a unit test for label smoothing.

- Hardware Type: RTX-3080
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence
```bash
❯ python3 -m pytest test/transformers/test_cross_entropy.py
============================================ test session starts =============================================
platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
rootdir: /home/tcc/Liger-Kernel
collected 94 items

test/transformers/test_cross_entropy.py .............................................................. [ 65%]
...............................F                                                                       [100%]

================================================== FAILURES ==================================================
__________________________________ test_large_no_exception[8-16384-128256] ___________________________________

B = 8, T = 16384, V = 128256

    @pytest.mark.parametrize(
        "B, T, V",
        [
            (
                8,
                8192,
                128256,
            ),  # _input = 16GB, total = ~32GB, 8405385216 > 2,147,483,647, so we need int64
            (8, 16384, 128256),  # _input = 32GB, total = ~64GB
        ],
    )
    # @pytest.mark.skipif(
    #     torch.cuda.get_device_properties(0).total_memory < 64 * 1000 * 1000 * 1000,
    #     reason="Needs 64GB+ GPU memory.",
    # )
    def test_large_no_exception(B, T, V):
        # The large inputs were hitting cuda illegal memory access because of
        # triton-lang/triton#1058
>       _full_pass_once(B, T, V)

test/transformers/test_cross_entropy.py:401:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

B = 8, T = 16384, V = 128256

    def _full_pass_once(B, T, V):
        torch.manual_seed(0)
        liger_ce = LigerCrossEntropyLoss()

>       _input = torch.randn(
            B * T, V, requires_grad=True, device="cuda", dtype=torch.bfloat16
        )
E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.31 GiB. GPU 0 has a total capacity of 10.00 GiB of which 8.84 GiB is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

test/transformers/test_cross_entropy.py:374: OutOfMemoryError
========================================== short test summary info ===========================================
FAILED test/transformers/test_cross_entropy.py::test_large_no_exception[8-16384-128256] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.31 GiB. GPU 0 has a total capacity of 10...
================================== 1 failed, 93 passed in 130.88s (0:02:10) ==================================
```
```bash
❯ make test
python -m pytest --disable-warnings test/ --ignore=test/convergence
============================================ test session starts =============================================
platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
rootdir: /home/tcc/Liger-Kernel
collected 256 items

test/transformers/test_auto_model.py .                                                                 [  0%]
test/transformers/test_cross_entropy.py ssssssssssssssssssssssss............ssssssssssssssssssssssssss [ 24%]
ssssssssssssssssssssssssssssssss                                                                       [ 37%]
test/transformers/test_embedding.py ...........                                                        [ 41%]
test/transformers/test_fused_linear_cross_entropy.py ................                                  [ 47%]
test/transformers/test_geglu.py ............                                                           [ 52%]
test/transformers/test_layer_norm.py ................                                                  [ 58%]
test/transformers/test_monkey_patch.py .....                                                           [ 60%]
test/transformers/test_rms_norm.py ............................................................        [ 83%]
test/transformers/test_rope.py ..................                                                      [ 91%]
test/transformers/test_swiglu.py ....................                                                  [ 98%]
test/transformers/test_trainer_integration.py .                                                        [ 99%]
test/triton/test_triton_monkey_patch.py ..                                                             [100%]

================================ 174 passed, 82 skipped in 123.06s (0:02:03) =================================
```
```bash
❯ make checkstyle
flake8 .; flake8_status=$?; \
isort .; isort_status=$?; \
black .; black_status=$?; \
if [ $flake8_status -ne 0 ] || [ $isort_status -ne 0 ] || [ $black_status -ne 0 ]; then \
        exit 1; \
fi
Skipped 2 files
All done! ✨ 🍰 ✨
68 files left unchanged.
```
```bash
❯ make test-convergence
HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence
============================================ test session starts =============================================
platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
rootdir: /home/tcc/Liger-Kernel
collected 30 items

test/convergence/test_mini_models.py ..............                                                    [ 46%]
test/convergence/test_mini_models_no_logits.py ................                                        [100%]

======================================= 30 passed in 223.18s (0:03:43) =======================================
```
## Summary
- Added Hugging Face training benchmarking script used for tech report
- Writes files to
`/results/${MODEL_TYPE}_use_liger_${USE_LIGER}_batch_size_${BATCH_SIZE}_rep_${i}.log`

## Testing Done
- Ran benchmarking script

- Hardware Type: A100
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence
## Summary
Fix `tool.setuptools.packages.find` field in pyproject.toml. Otherwise
in local build mode with `pip install .`, python system fails to locate
liger_kernel.

Co-authored-by: Byron Hsu <[email protected]>
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
This PR improves the performance of swiglu and geglu forward by
replacing `zeros_like` with `empty_like`. The difference is that
`empty_like` doesn't require a separate kernel launch.

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->
Testing is covered by existing `test_geglu.py` and `test_swiglu.py`.

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: A100-80G-PCIe
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

---------

Co-authored-by: Byron Hsu <[email protected]>
Co-authored-by: Shao Tang <[email protected]>
## Summary
Add repr information for layernorm and rmsnorm class so that the useful
layer information can be displayed after the model is printed. Other
classes are not modified because they inherit from related torch.nn
classes, or there are torch.nn sub-modules.


## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

---------

Co-authored-by: Byron Hsu <[email protected]>
Co-authored-by: Shao Tang <[email protected]>
## Summary
In linkedin#218, I fixed the
`tool.setuptools.packages.find` field and tested it only in editable
mode with `pip install -e .`. However, in production mode with `pip
install .`, only the env_report.py file is copied to the Python
site-packages directory. To fix this, adding "liger_kernel.*" to the
include list will ensure that setuptools correctly includes all
subpackages within liger_kernel.

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence

---------

Co-authored-by: Byron Hsu <[email protected]>
## Summary
Implements a new script, `benchmark/benchmarks_visualizer.py`, that
substitues the functionality provided by current
`benchmark/benchmarks_visualizer.ipynb`. Resolves linkedin#211 .

## Details
```console
$ python3 benchmarks_visualizer.py --help
usage: benchmarks_visualizer.py [-h] --kernel-name KERNEL_NAME --metric-name METRIC_NAME --kernel-operation-mode KERNEL_OPERATION_MODE [--display] [--overwrite]

options:
  -h, --help            show this help message and exit
  --kernel-name KERNEL_NAME
                        Kernel name to benchmark
  --metric-name METRIC_NAME
                        Metric name to visualize (speed/memory)
  --kernel-operation-mode KERNEL_OPERATION_MODE
                        Kernel operation mode to visualize (forward/backward/full)
  --display             Display the visualization
  --overwrite           Overwrite existing visualization, if none exist this flag has no effect as one are always created
  ```

## Testing Done
<!--- This is a required section; please describe how this change was tested. --->

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence

---------

Co-authored-by: Shao Tang <[email protected]>
## Summary
Adds newly implemented kl divergence loss to readme. Closes linkedin#188
finally.

## Testing Done
No code changes

---------

Co-authored-by: Shao Tang <[email protected]>
Co-authored-by: Byron Hsu <[email protected]>
## Summary
Monkeypatch for the recently-published
[Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
HF `transformers` modeling code:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

Feature Request: linkedin#165

## Details
Qwen2-VL in `transformers` is available on `transformers` main but is
yet to be published in a release.

## Testing Done
- Hardware Type: 4090
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence

---------

Co-authored-by: Shao Tang <[email protected]>
linkedin#237)

## Summary

Add some easy checks for `weight.requires_grad` to skip allocating +
calculating weight gradients if they're not needed. The weight gradient
matrix can be pretty large, so this can also be a significant memory
savings.

Also, a small micro-optimization: skip the `.item()` call on
`total_n_non_ignore` (the subsequent calculations work fine with the
tensor form) to defer CUDA synchronization (otherwise it will wait for
all the `torch.zeros` initializations on the preceding lines to
synchronize, which may take a non-trivial amount of time.)

## Testing Done

The existing unit test already has a case where the weight does not have
gradients enabled, and it still passes forwards/backwards:
https://github.com/linkedin/Liger-Kernel/blob/main/test/transformers/test_fused_linear_cross_entropy.py#L165

And the preceding test verifies the 'normal' case where the weight
gradients are needed.

- Hardware Type: A100 80G
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [x] run `make test-convergence` to ensure convergence
## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.