forked from linkedin/Liger-Kernel
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add jamba convergence test #1
Open
yubofredwang
wants to merge
89
commits into
winglian:main
Choose a base branch
from
yubofredwang:jamba-test
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Summary Make GPU CI optional until it is more stable <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done Testing CI <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> Add gemma lightning example for single L40 GPU <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]> Co-authored-by: Byron Hsu <[email protected]>
## Summary Aims to fix linkedin#89. ## Details Does the casts to float32 at the correct places to match the Gemma and Llama references. Does so both in the forward and backward passes. Also modified the tests for RMSNorm with tighter tolerances + fp16 tests. ## Testing Done Ran tests for convergence and RMSNorm. ``` test/convergence/test_mini_models.py ........ [100%] | | ========================= 8 passed in 78.70s (0:01:18) ========================= test/transformers/test_rms_norm.py ................................................ [100%] ========================================================================================================================= 48 passed in 4.62s ========================================================================================================================= ``` - Hardware Type: NVIDIA L4 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Yun Dai <[email protected]>
## Summary Adds optional bias param for fused linear cross entropy! Added bias = {true, false} for the testing space. Also changed weight/bias generation in tests to uniform rand instead of normal (seems stabler for low precision bfloat16). ## Testing Done Tests for convergence + tests for fused linear cross entropy. Replace BLANK with your device type. For example, A100-80G-PCIe **Results** ``` test/transformers/test_fused_linear_cross_entropy.py ............ [100%] ======================================================================================================================== 12 passed in 31.61s ========================================================================================================================= ``` - Hardware Type: NVIDIA L4 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Byron Hsu <[email protected]>
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> as title ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> updated seqlen for rope to be non-constexpr <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> sl from constexpr to non-constexpr ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: RTX 3090 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> 1. bf16 loss rtol is a bit too loose, tighten it by 1 digit 2. slightly loosen gemma1 atol, it's been failing 3. old `transformers` version doesn't carry phi3 source code (testing on 4.40.1), since we claim support for >= 4.40.1, change the import a bit so things still work on older HF ver 4. rerun all benchmark to reflect latest performance in preparation for new release <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence Co-authored-by: Yun Dai <[email protected]>
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> With Gemma2 support, import will fail with `transformers<4.42.0` <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> Add missing tf_keras to req.txt <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> Add missing tf_keras to req.txt , otherwise pip install -r requirement.txt and then run the script will result in ``` ValueError: Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`. ``` ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> Done <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence Co-authored-by: jaszhu <[email protected]>
## Summary Turn on GPU CI enforce <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> for release <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> New release with features should bump minor version tehe <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence
…flag are mutual exclusive (linkedin#168) ## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> Bug fix for gemma: fused_linear_cross_entropy flag and cross_entropy flag are mutual exclusive <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> Bug fix for gemma: fused_linear_cross_entropy flag and cross_entropy flag are mutual exclusive ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> Done, test model training ``` {'loss': 2.5808, 'grad_norm': 77.5, 'learning_rate': 3e-06, 'epoch': 0.0, 'num_input_tokens_seen': 18256} 5%|▌ | 1/20 [00:05<01:45, 5.55s/it] 10%|█ | 2/20 [00:07<01:04, 3.59s/it] {'loss': 2.652, 'grad_norm': 80.0, 'learning_rate': 6e-06, 'epoch': 0.0, 'num_input_tokens_seen': 33376, 'step': 2, 'step_time_sec': 2.18, 'avg_step_time_sec': 2.18, 'time_to_completion_sec': 39.29, 'estimated_total_time_sec': 43.66, 'step_peak_memory_allocated_MB': 21965.16, 'step_peak_memory_reserved_MB': 34126.0, 'total_peak_memory_allocated_MB': 21965.16, 'total_peak_memory_reserved_MB': 34126.0, 'step_tokens_per_second': 6926.16, 'avg_tokens_per_second': 6926.16} 10%|█ | 2/20 [00:07<01:04, 3.59s/it] 15%|█▌ | 3/20 [00:09<00:49, 2.93s/it] {'loss': 2.1275, 'grad_norm': 46.0, 'learning_rate': 5.954423259036625e-06, 'epoch': 0.0, 'num_input_tokens_seen': 47504, 'step': 3, 'step_time_sec': 2.08, 'avg_step_time_sec': 2.13, 'time_to_completion_sec': 36.2, 'estimated_total_time_sec': 42.59, 'step_peak_memory_allocated_MB': 21998.28, 'step_peak_memory_reserved_MB': 34126.0, 'total_peak_memory_allocated_MB': 21998.28, 'total_peak_memory_reserved_MB': 34126.0, 'step_tokens_per_second': 6804.83, 'avg_tokens_per_second': 6867.02} 15%|█▌ | 3/20 [00:09<00:49, 2.93s/it] 20%|██ | 4/20 [00:12<00:47, 2.94s/it] {'loss': 1.7238, 'grad_norm': 15.75, 'learning_rate': 5.819077862357725e-06, 'epoch': 0.01, 'num_input_tokens_seen': 64176, 'step': 4, 'step_time_sec': 2.89, 'avg_step_time_sec': 2.38, 'time_to_completion_sec': 38.14, 'estimated_total_time_sec': 47.68, 'step_peak_memory_allocated_MB': 21734.73, 'step_peak_memory_reserved_MB': 35628.0, 'total_peak_memory_allocated_MB': 21998.28, 'total_peak_memory_reserved_MB': 35628.0, 'step_tokens_per_second': 5763.26, 'avg_tokens_per_second': 6420.58} ``` <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence Co-authored-by: jaszhu <[email protected]>
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> Add gemma 7b it benchmark <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> Add gemma 7b it benchmark ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> NA <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: jaszhu <[email protected]>
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> to catch linkedin#168 <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence
## Summary Closes linkedin#87 Skipped tests for `bfloat16` on GPUs with compute capability below Ampere architecture (`sm_80`). <!--- This is a required section; please describe the main purpose of this proposed code change. ---> <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: NVIDIA **T4** (should skip most cases) - [X] run `make test` to ensure correctness - [X] run `make checkstyle` to ensure code style - [X] run `make test-convergence` to ensure convergence ``` ⚡ main ~/Liger-Kernel make all python -m pytest --disable-warnings test/ --ignore=test/convergence HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence flake8 .; flake8_status=$?; \ isort .; isort_status=$?; \ black .; black_status=$?; \ if [ $flake8_status -ne 0 ] || [ $isort_status -ne 0 ] || [ $black_status -ne 0 ]; then \ exit 1; \ fi =================================================================== test session starts ==================================================================== platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0 rootdir: /teamspace/studios/this_studio/Liger-Kernel plugins: anyio-4.4.0 collecting ... =================================================================== test session starts ==================================================================== platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0 rootdir: /teamspace/studios/this_studio/Liger-Kernel plugins: anyio-4.4.0 collecting ... Skipped 1 files All done! ✨ 🍰 ✨ 58 files left unchanged. collected 163 items test/transformers/test_auto_model.py . [ 0%] test/transformers/test_cross_entropy.py ssssssssssssssssssssssssssssssssssssssssssssssssssssssssss [ 36%] collected 28 items test/convergence/test_mini_models.py .....s.....s.... [ 43%] test/transformers/test_geglu.py .s....ssss [ 48%] test/transformers/test_monkey_patch.py ..... [ 51%] test/transformers/test_rms_norm.py ........ssssssss...............ssssssss........ [ 80%] test/transformers/test_rope.py ......ssssss [ 88%] test/transformers/test_swiglu.py ....ssss.s....ssss [ 98%] test/transformers/test_trainer_integration.py . [ 98%] test/triton/test_triton_monkey_patch.py .. [100%] ======================================================== 71 passed, 92 skipped in 136.69s (0:02:16) ======================================================== .s.s.s [ 50%] test/convergence/test_mini_models_no_logits.py .s.s.s.s.s.s.s [100%] ======================================================== 14 passed, 14 skipped in 353.27s (0:05:53) ======================================================== ``` - Hardware Type: NVIDIA **L4** (should skip few cases) - [X] run `make test` to ensure correctness - [X] run `make checkstyle` to ensure code style - [X] run `make test-convergence` to ensure convergence ``` ⚡ main ~/Liger-Kernel make all python -m pytest --disable-warnings test/ --ignore=test/convergence HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence flake8 .; flake8_status=$?; \ isort .; isort_status=$?; \ black .; black_status=$?; \ if [ $flake8_status -ne 0 ] || [ $isort_status -ne 0 ] || [ $black_status -ne 0 ]; then \ exit 1; \ fi =================================================================== test session starts ==================================================================== platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0 rootdir: /teamspace/studios/this_studio/Liger-Kernel plugins: anyio-4.4.0 collecting ... =================================================================== test session starts ==================================================================== platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0 rootdir: /teamspace/studios/this_studio/Liger-Kernel plugins: anyio-4.4.0 collecting ... Skipped 1 files All done! ✨ 🍰 ✨ 58 files left unchanged. collected 163 items test/transformers/test_auto_model.py . [ 0%] collected 28 items test/convergence/test_mini_models.py ........................................................ss [ 36%] test/transformers/test_fused_linear_cross_entropy.py ............... [ 43%] test/transformers/test_geglu.py ......... [ 48%] test/transformers/test_monkey_patch.py ..... [ 51%] test/transformers/test_rms_norm.py ................................................. [ 80%] test/transformers/test_rope.py ............ [ 88%] test/transformers/test_swiglu.py .................. [ 98%] test/transformers/test_trainer_integration.py . [ 98%] test/triton/test_triton_monkey_patch.py .. [100%] ======================================================== 161 passed, 2 skipped in 90.45s (0:01:30) ========================================================= ....... [ 50%] test/convergence/test_mini_models_no_logits.py .............. [100%] ============================================================== 28 passed in 290.65s (0:04:50) ============================================================== ``` ## Additional Context FYR, here’s a list of NVIDIA architecture names, and which compute capabilities they have: <img width="1268" alt="Screenshot 2024-08-29 at 6 04 56 PM" src="https://github.com/user-attachments/assets/6675ae9e-9137-4adb-8af7-ee1226733353"> --------- Signed-off-by: Austin Liu <[email protected]> Co-authored-by: Shao Tang <[email protected]>
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> integrated layernorm custom kernels + LigerLayerNorm module <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> tested layernorm kernels for correctness <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: RTX 3090 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>
…inkedin#170) ## Summary Fixes example in README to make it functional ## Testing Done - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence
## Summary - Updated the banner animated code snippet to use the AutoLigerKernelForCausalLM - Added wave snippet to acknowledgements ## Testing Done - Preview markdown - Hardware Type: N/A - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> Added LayerNorm description to README <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> N//A <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: RTX 3090 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>
## Summary - Removing torch compile from benchmark scripts for consistency and re-run benchmarks on A100 ## Testing Done Ran benchmark - Hardware Type: A100 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> conclude some learnings from this release :D <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>
…rad context for easy reuse (linkedin#178) ## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> Extract forward/backward core computation bits outside of torch autograd context for easy reuse. This is beneficial for lightning thunder integration and the reuse of kernel in other context. Doubled checked the speed and memory usage, within variance range, no degradation <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>
## Summary - Added Embedding forward/backwards kernels + LigerEmbedding class which maps to nn.Embedding - nn.Embedding is useful for encoder-only models such as BERT - ref: linkedin#131 <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> - tested against nn.Embedding for correctness on various inputs - tested with and without padding_idx <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: RTX 3090 + RTX 4090 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence
## Summary Reference Unsloth in header section <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence
## Summary Aim to solve linkedin#81. ## Details ### For loss: Label smoothing regularization ( LSR ) by replacing the label distribution $q(k) = \delta_{k,y}$ with ```math q'(k) = (1 - \epsilon)\delta_{k,y} + \frac{\epsilon}{K} ``` Considering cross entropy with LSR is ```math \begin{align} L' = H(q', p) &= -\sum^K_{k=1}log\ {p(k)}q'(k) = -\sum^K_{k=1}log\ {p(k)}((1 - \epsilon)\delta_{k,y} + \frac{\epsilon}{K})\\ &= -\sum^K_{k=1}log\ {p(k)}(1 - \epsilon)q(k) -\sum^K_{k=1}log\ {p(k)}\frac{\epsilon}{K} \\ &= (1 - \epsilon)H(q,p) + \frac{\epsilon}{K} \sum^K_{k=1} log\ softmax(x_k)\\ &= (1- \epsilon)L + \frac{\epsilon}{K}\ SmoothLoss, \end{align} ``` where $L = H(q,p)$ is the original loss and $\sum^K_{k=1} log\ softmax(x_k)$ is smooth loss. ### For gradients: The original: ```math \begin{align} \frac{\partial L}{\partial x_i} &= p(k) - q(k)\\ &= \begin{cases} softmax(x_i) , & i \neq y \\ softmax(x_i) - 1, & i = y \end{cases} \end{align} ``` With LSR: ```math \begin{align} \frac{\partial L'}{\partial x_i} &= p(k) - q'(k)\\ &= softmax(x_i) - (1 - \epsilon)\delta_{k,y} + \frac{\epsilon}{K}\\ &= \begin{cases} softmax(x_i) - \frac{\epsilon}{K}, & i \neq y \\ softmax(x_i) - \frac{\epsilon}{K} - (1 - \epsilon) & i = y \end{cases} \end{align} ``` We can handle the $i = y$ case by simply adding $-(1-\epsilon)$ after computing all $i$. Reference: [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/abs/1512.00567) ## Testing Done Add a unit test for label smoothing. - Hardware Type: RTX-3080 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence ```bash ❯ python3 -m pytest test/transformers/test_cross_entropy.py ============================================ test session starts ============================================= platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0 rootdir: /home/tcc/Liger-Kernel collected 94 items test/transformers/test_cross_entropy.py .............................................................. [ 65%] ...............................F [100%] ================================================== FAILURES ================================================== __________________________________ test_large_no_exception[8-16384-128256] ___________________________________ B = 8, T = 16384, V = 128256 @pytest.mark.parametrize( "B, T, V", [ ( 8, 8192, 128256, ), # _input = 16GB, total = ~32GB, 8405385216 > 2,147,483,647, so we need int64 (8, 16384, 128256), # _input = 32GB, total = ~64GB ], ) # @pytest.mark.skipif( # torch.cuda.get_device_properties(0).total_memory < 64 * 1000 * 1000 * 1000, # reason="Needs 64GB+ GPU memory.", # ) def test_large_no_exception(B, T, V): # The large inputs were hitting cuda illegal memory access because of # triton-lang/triton#1058 > _full_pass_once(B, T, V) test/transformers/test_cross_entropy.py:401: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ B = 8, T = 16384, V = 128256 def _full_pass_once(B, T, V): torch.manual_seed(0) liger_ce = LigerCrossEntropyLoss() > _input = torch.randn( B * T, V, requires_grad=True, device="cuda", dtype=torch.bfloat16 ) E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.31 GiB. GPU 0 has a total capacity of 10.00 GiB of which 8.84 GiB is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) test/transformers/test_cross_entropy.py:374: OutOfMemoryError ========================================== short test summary info =========================================== FAILED test/transformers/test_cross_entropy.py::test_large_no_exception[8-16384-128256] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.31 GiB. GPU 0 has a total capacity of 10... ================================== 1 failed, 93 passed in 130.88s (0:02:10) ================================== ``` ```bash ❯ make test python -m pytest --disable-warnings test/ --ignore=test/convergence ============================================ test session starts ============================================= platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0 rootdir: /home/tcc/Liger-Kernel collected 256 items test/transformers/test_auto_model.py . [ 0%] test/transformers/test_cross_entropy.py ssssssssssssssssssssssss............ssssssssssssssssssssssssss [ 24%] ssssssssssssssssssssssssssssssss [ 37%] test/transformers/test_embedding.py ........... [ 41%] test/transformers/test_fused_linear_cross_entropy.py ................ [ 47%] test/transformers/test_geglu.py ............ [ 52%] test/transformers/test_layer_norm.py ................ [ 58%] test/transformers/test_monkey_patch.py ..... [ 60%] test/transformers/test_rms_norm.py ............................................................ [ 83%] test/transformers/test_rope.py .................. [ 91%] test/transformers/test_swiglu.py .................... [ 98%] test/transformers/test_trainer_integration.py . [ 99%] test/triton/test_triton_monkey_patch.py .. [100%] ================================ 174 passed, 82 skipped in 123.06s (0:02:03) ================================= ``` ```bash ❯ make checkstyle flake8 .; flake8_status=$?; \ isort .; isort_status=$?; \ black .; black_status=$?; \ if [ $flake8_status -ne 0 ] || [ $isort_status -ne 0 ] || [ $black_status -ne 0 ]; then \ exit 1; \ fi Skipped 2 files All done! ✨ 🍰 ✨ 68 files left unchanged. ``` ```bash ❯ make test-convergence HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence ============================================ test session starts ============================================= platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0 rootdir: /home/tcc/Liger-Kernel collected 30 items test/convergence/test_mini_models.py .............. [ 46%] test/convergence/test_mini_models_no_logits.py ................ [100%] ======================================= 30 passed in 223.18s (0:03:43) ======================================= ```
## Summary - Added Hugging Face training benchmarking script used for tech report - Writes files to `/results/${MODEL_TYPE}_use_liger_${USE_LIGER}_batch_size_${BATCH_SIZE}_rep_${i}.log` ## Testing Done - Ran benchmarking script - Hardware Type: A100 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence
## Summary Fix `tool.setuptools.packages.find` field in pyproject.toml. Otherwise in local build mode with `pip install .`, python system fails to locate liger_kernel. Co-authored-by: Byron Hsu <[email protected]>
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> This PR improves the performance of swiglu and geglu forward by replacing `zeros_like` with `empty_like`. The difference is that `empty_like` doesn't require a separate kernel launch. <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> Testing is covered by existing `test_geglu.py` and `test_swiglu.py`. <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: A100-80G-PCIe - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Byron Hsu <[email protected]> Co-authored-by: Shao Tang <[email protected]>
## Summary Add repr information for layernorm and rmsnorm class so that the useful layer information can be displayed after the model is printed. Other classes are not modified because they inherit from related torch.nn classes, or there are torch.nn sub-modules. ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Byron Hsu <[email protected]> Co-authored-by: Shao Tang <[email protected]>
## Summary In linkedin#218, I fixed the `tool.setuptools.packages.find` field and tested it only in editable mode with `pip install -e .`. However, in production mode with `pip install .`, only the env_report.py file is copied to the Python site-packages directory. To fix this, adding "liger_kernel.*" to the include list will ensure that setuptools correctly includes all subpackages within liger_kernel. <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence --------- Co-authored-by: Byron Hsu <[email protected]>
## Summary Implements a new script, `benchmark/benchmarks_visualizer.py`, that substitues the functionality provided by current `benchmark/benchmarks_visualizer.ipynb`. Resolves linkedin#211 . ## Details ```console $ python3 benchmarks_visualizer.py --help usage: benchmarks_visualizer.py [-h] --kernel-name KERNEL_NAME --metric-name METRIC_NAME --kernel-operation-mode KERNEL_OPERATION_MODE [--display] [--overwrite] options: -h, --help show this help message and exit --kernel-name KERNEL_NAME Kernel name to benchmark --metric-name METRIC_NAME Metric name to visualize (speed/memory) --kernel-operation-mode KERNEL_OPERATION_MODE Kernel operation mode to visualize (forward/backward/full) --display Display the visualization --overwrite Overwrite existing visualization, if none exist this flag has no effect as one are always created ``` ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>
## Summary Adds newly implemented kl divergence loss to readme. Closes linkedin#188 finally. ## Testing Done No code changes --------- Co-authored-by: Shao Tang <[email protected]> Co-authored-by: Byron Hsu <[email protected]>
## Summary Monkeypatch for the recently-published [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct). HF `transformers` modeling code: https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py Feature Request: linkedin#165 ## Details Qwen2-VL in `transformers` is available on `transformers` main but is yet to be published in a release. ## Testing Done - Hardware Type: 4090 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>
linkedin#237) ## Summary Add some easy checks for `weight.requires_grad` to skip allocating + calculating weight gradients if they're not needed. The weight gradient matrix can be pretty large, so this can also be a significant memory savings. Also, a small micro-optimization: skip the `.item()` call on `total_n_non_ignore` (the subsequent calculations work fine with the tensor form) to defer CUDA synchronization (otherwise it will wait for all the `torch.zeros` initializations on the preceding lines to synchronize, which may take a non-trivial amount of time.) ## Testing Done The existing unit test already has a case where the weight does not have gradients enabled, and it still passes forwards/backwards: https://github.com/linkedin/Liger-Kernel/blob/main/test/transformers/test_fused_linear_cross_entropy.py#L165 And the preceding test verifies the 'normal' case where the weight gradients are needed. - Hardware Type: A100 80G - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Add convergence test for jamba model monkey patching
Testing Done
make test
to ensure correctnessmake checkstyle
to ensure code stylemake test-convergence
to ensure convergence