add jamba convergence test #1

yubofredwang · 2024-09-05T07:02:55Z

Summary

Add convergence test for jamba model monkey patching

Testing Done

Hardware Type: A100-80G-PCIe
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

## Summary Make GPU CI optional until it is more stable  ## Testing Done Testing CI  - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence

## Summary  Add gemma lightning example for single L40 GPU  ## Testing Done   - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]> Co-authored-by: Byron Hsu <[email protected]>

## Summary Aims to fix linkedin#89. ## Details Does the casts to float32 at the correct places to match the Gemma and Llama references. Does so both in the forward and backward passes. Also modified the tests for RMSNorm with tighter tolerances + fp16 tests. ## Testing Done Ran tests for convergence and RMSNorm. ``` test/convergence/test_mini_models.py ........ [100%] | | ========================= 8 passed in 78.70s (0:01:18) ========================= test/transformers/test_rms_norm.py ................................................ [100%] ========================================================================================================================= 48 passed in 4.62s ========================================================================================================================= ``` - Hardware Type: NVIDIA L4 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Yun Dai <[email protected]>

## Summary Adds optional bias param for fused linear cross entropy! Added bias = {true, false} for the testing space. Also changed weight/bias generation in tests to uniform rand instead of normal (seems stabler for low precision bfloat16). ## Testing Done Tests for convergence + tests for fused linear cross entropy. Replace BLANK with your device type. For example, A100-80G-PCIe **Results** ``` test/transformers/test_fused_linear_cross_entropy.py ............ [100%] ======================================================================================================================== 12 passed in 31.61s ========================================================================================================================= ``` - Hardware Type: NVIDIA L4 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Byron Hsu <[email protected]>

## Summary   as title ## Testing Done   - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence

## Summary  updated seqlen for rope to be non-constexpr  sl from constexpr to non-constexpr ## Testing Done   - Hardware Type: RTX 3090 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence

## Summary  1. bf16 loss rtol is a bit too loose, tighten it by 1 digit 2. slightly loosen gemma1 atol, it's been failing 3. old `transformers` version doesn't carry phi3 source code (testing on 4.40.1), since we claim support for >= 4.40.1, change the import a bit so things still work on older HF ver 4. rerun all benchmark to reflect latest performance in preparation for new release  ## Testing Done   - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence Co-authored-by: Yun Dai <[email protected]>

## Summary  With Gemma2 support, import will fail with `transformers<4.42.0`  ## Testing Done   - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence

## Summary  Add missing tf_keras to req.txt  Add missing tf_keras to req.txt , otherwise pip install -r requirement.txt and then run the script will result in ``` ValueError: Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`. ``` ## Testing Done  Done  - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence Co-authored-by: jaszhu <[email protected]>

## Summary Turn on GPU CI enforce  ## Testing Done   - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence

## Summary  for release  ## Testing Done   - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence

## Summary  New release with features should bump minor version tehe  ## Testing Done   - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence

…flag are mutual exclusive (linkedin#168) ## Summary  Bug fix for gemma: fused_linear_cross_entropy flag and cross_entropy flag are mutual exclusive  Bug fix for gemma: fused_linear_cross_entropy flag and cross_entropy flag are mutual exclusive ## Testing Done  Done, test model training ``` {'loss': 2.5808, 'grad_norm': 77.5, 'learning_rate': 3e-06, 'epoch': 0.0, 'num_input_tokens_seen': 18256} 5%|▌ | 1/20 [00:05<01:45, 5.55s/it] 10%|█ | 2/20 [00:07<01:04, 3.59s/it] {'loss': 2.652, 'grad_norm': 80.0, 'learning_rate': 6e-06, 'epoch': 0.0, 'num_input_tokens_seen': 33376, 'step': 2, 'step_time_sec': 2.18, 'avg_step_time_sec': 2.18, 'time_to_completion_sec': 39.29, 'estimated_total_time_sec': 43.66, 'step_peak_memory_allocated_MB': 21965.16, 'step_peak_memory_reserved_MB': 34126.0, 'total_peak_memory_allocated_MB': 21965.16, 'total_peak_memory_reserved_MB': 34126.0, 'step_tokens_per_second': 6926.16, 'avg_tokens_per_second': 6926.16} 10%|█ | 2/20 [00:07<01:04, 3.59s/it] 15%|█▌ | 3/20 [00:09<00:49, 2.93s/it] {'loss': 2.1275, 'grad_norm': 46.0, 'learning_rate': 5.954423259036625e-06, 'epoch': 0.0, 'num_input_tokens_seen': 47504, 'step': 3, 'step_time_sec': 2.08, 'avg_step_time_sec': 2.13, 'time_to_completion_sec': 36.2, 'estimated_total_time_sec': 42.59, 'step_peak_memory_allocated_MB': 21998.28, 'step_peak_memory_reserved_MB': 34126.0, 'total_peak_memory_allocated_MB': 21998.28, 'total_peak_memory_reserved_MB': 34126.0, 'step_tokens_per_second': 6804.83, 'avg_tokens_per_second': 6867.02} 15%|█▌ | 3/20 [00:09<00:49, 2.93s/it] 20%|██ | 4/20 [00:12<00:47, 2.94s/it] {'loss': 1.7238, 'grad_norm': 15.75, 'learning_rate': 5.819077862357725e-06, 'epoch': 0.01, 'num_input_tokens_seen': 64176, 'step': 4, 'step_time_sec': 2.89, 'avg_step_time_sec': 2.38, 'time_to_completion_sec': 38.14, 'estimated_total_time_sec': 47.68, 'step_peak_memory_allocated_MB': 21734.73, 'step_peak_memory_reserved_MB': 35628.0, 'total_peak_memory_allocated_MB': 21998.28, 'total_peak_memory_reserved_MB': 35628.0, 'step_tokens_per_second': 5763.26, 'avg_tokens_per_second': 6420.58} ```  - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence Co-authored-by: jaszhu <[email protected]>

## Summary  Add gemma 7b it benchmark  Add gemma 7b it benchmark ## Testing Done  NA  - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: jaszhu <[email protected]>

## Summary  to catch linkedin#168  ## Testing Done   - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence

## Summary Closes linkedin#87 Skipped tests for `bfloat16` on GPUs with compute capability below Ampere architecture (`sm_80`).   ## Testing Done   - Hardware Type: NVIDIA **T4** (should skip most cases) - [X] run `make test` to ensure correctness - [X] run `make checkstyle` to ensure code style - [X] run `make test-convergence` to ensure convergence ``` ⚡ main ~/Liger-Kernel make all python -m pytest --disable-warnings test/ --ignore=test/convergence HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence flake8 .; flake8_status=$?; \ isort .; isort_status=$?; \ black .; black_status=$?; \ if [ $flake8_status -ne 0 ] || [ $isort_status -ne 0 ] || [ $black_status -ne 0 ]; then \ exit 1; \ fi =================================================================== test session starts ==================================================================== platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0 rootdir: /teamspace/studios/this_studio/Liger-Kernel plugins: anyio-4.4.0 collecting ... =================================================================== test session starts ==================================================================== platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0 rootdir: /teamspace/studios/this_studio/Liger-Kernel plugins: anyio-4.4.0 collecting ... Skipped 1 files All done! ✨ 🍰 ✨ 58 files left unchanged. collected 163 items test/transformers/test_auto_model.py . [ 0%] test/transformers/test_cross_entropy.py ssssssssssssssssssssssssssssssssssssssssssssssssssssssssss [ 36%] collected 28 items test/convergence/test_mini_models.py .....s.....s.... [ 43%] test/transformers/test_geglu.py .s....ssss [ 48%] test/transformers/test_monkey_patch.py ..... [ 51%] test/transformers/test_rms_norm.py ........ssssssss...............ssssssss........ [ 80%] test/transformers/test_rope.py ......ssssss [ 88%] test/transformers/test_swiglu.py ....ssss.s....ssss [ 98%] test/transformers/test_trainer_integration.py . [ 98%] test/triton/test_triton_monkey_patch.py .. [100%] ======================================================== 71 passed, 92 skipped in 136.69s (0:02:16) ======================================================== .s.s.s [ 50%] test/convergence/test_mini_models_no_logits.py .s.s.s.s.s.s.s [100%] ======================================================== 14 passed, 14 skipped in 353.27s (0:05:53) ======================================================== ``` - Hardware Type: NVIDIA **L4** (should skip few cases) - [X] run `make test` to ensure correctness - [X] run `make checkstyle` to ensure code style - [X] run `make test-convergence` to ensure convergence ``` ⚡ main ~/Liger-Kernel make all python -m pytest --disable-warnings test/ --ignore=test/convergence HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence flake8 .; flake8_status=$?; \ isort .; isort_status=$?; \ black .; black_status=$?; \ if [ $flake8_status -ne 0 ] || [ $isort_status -ne 0 ] || [ $black_status -ne 0 ]; then \ exit 1; \ fi =================================================================== test session starts ==================================================================== platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0 rootdir: /teamspace/studios/this_studio/Liger-Kernel plugins: anyio-4.4.0 collecting ... =================================================================== test session starts ==================================================================== platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0 rootdir: /teamspace/studios/this_studio/Liger-Kernel plugins: anyio-4.4.0 collecting ... Skipped 1 files All done! ✨ 🍰 ✨ 58 files left unchanged. collected 163 items test/transformers/test_auto_model.py . [ 0%] collected 28 items test/convergence/test_mini_models.py ........................................................ss [ 36%] test/transformers/test_fused_linear_cross_entropy.py ............... [ 43%] test/transformers/test_geglu.py ......... [ 48%] test/transformers/test_monkey_patch.py ..... [ 51%] test/transformers/test_rms_norm.py ................................................. [ 80%] test/transformers/test_rope.py ............ [ 88%] test/transformers/test_swiglu.py .................. [ 98%] test/transformers/test_trainer_integration.py . [ 98%] test/triton/test_triton_monkey_patch.py .. [100%] ======================================================== 161 passed, 2 skipped in 90.45s (0:01:30) ========================================================= ....... [ 50%] test/convergence/test_mini_models_no_logits.py .............. [100%] ============================================================== 28 passed in 290.65s (0:04:50) ============================================================== ``` ## Additional Context FYR, here’s a list of NVIDIA architecture names, and which compute capabilities they have: <img width="1268" alt="Screenshot 2024-08-29 at 6 04 56 PM" src="https://github.com/user-attachments/assets/6675ae9e-9137-4adb-8af7-ee1226733353"> --------- Signed-off-by: Austin Liu <[email protected]> Co-authored-by: Shao Tang <[email protected]>

## Summary  integrated layernorm custom kernels + LigerLayerNorm module  ## Testing Done  tested layernorm kernels for correctness  - Hardware Type: RTX 3090 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>

…inkedin#170) ## Summary Fixes example in README to make it functional ## Testing Done - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence

## Summary - Updated the banner animated code snippet to use the AutoLigerKernelForCausalLM - Added wave snippet to acknowledgements ## Testing Done - Preview markdown - Hardware Type: N/A - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence

## Summary  Added LayerNorm description to README  ## Testing Done  N//A  - Hardware Type: RTX 3090 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>

## Summary - Removing torch compile from benchmark scripts for consistency and re-run benchmarks on A100 ## Testing Done Ran benchmark - Hardware Type: A100 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence

## Summary  conclude some learnings from this release :D  ## Testing Done   - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>

…rad context for easy reuse (linkedin#178) ## Summary  Extract forward/backward core computation bits outside of torch autograd context for easy reuse. This is beneficial for lightning thunder integration and the reuse of kernel in other context. Doubled checked the speed and memory usage, within variance range, no degradation  ## Testing Done   - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>

## Summary - Added Embedding forward/backwards kernels + LigerEmbedding class which maps to nn.Embedding - nn.Embedding is useful for encoder-only models such as BERT - ref: linkedin#131  ## Testing Done  - tested against nn.Embedding for correctness on various inputs - tested with and without padding_idx  - Hardware Type: RTX 3090 + RTX 4090 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>

## Summary   ## Testing Done   - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence

## Summary Reference Unsloth in header section  ## Testing Done   - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence

## Summary Aim to solve linkedin#81. ## Details ### For loss: Label smoothing regularization ( LSR ) by replacing the label distribution $q(k) = \delta_{k,y}$ with ```math q'(k) = (1 - \epsilon)\delta_{k,y} + \frac{\epsilon}{K} ``` Considering cross entropy with LSR is ```math \begin{align} L' = H(q', p) &= -\sum^K_{k=1}log\ {p(k)}q'(k) = -\sum^K_{k=1}log\ {p(k)}((1 - \epsilon)\delta_{k,y} + \frac{\epsilon}{K})\\ &= -\sum^K_{k=1}log\ {p(k)}(1 - \epsilon)q(k) -\sum^K_{k=1}log\ {p(k)}\frac{\epsilon}{K} \\ &= (1 - \epsilon)H(q,p) + \frac{\epsilon}{K} \sum^K_{k=1} log\ softmax(x_k)\\ &= (1- \epsilon)L + \frac{\epsilon}{K}\ SmoothLoss, \end{align} ``` where $L = H(q,p)$ is the original loss and $\sum^K_{k=1} log\ softmax(x_k)$ is smooth loss. ### For gradients: The original: ```math \begin{align} \frac{\partial L}{\partial x_i} &= p(k) - q(k)\\ &= \begin{cases} softmax(x_i) , & i \neq y \\ softmax(x_i) - 1, & i = y \end{cases} \end{align} ``` With LSR: ```math \begin{align} \frac{\partial L'}{\partial x_i} &= p(k) - q'(k)\\ &= softmax(x_i) - (1 - \epsilon)\delta_{k,y} + \frac{\epsilon}{K}\\ &= \begin{cases} softmax(x_i) - \frac{\epsilon}{K}, & i \neq y \\ softmax(x_i) - \frac{\epsilon}{K} - (1 - \epsilon) & i = y \end{cases} \end{align} ``` We can handle the $i = y$ case by simply adding $-(1-\epsilon)$ after computing all $i$. Reference: [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/abs/1512.00567) ## Testing Done Add a unit test for label smoothing. - Hardware Type: RTX-3080 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence ```bash ❯ python3 -m pytest test/transformers/test_cross_entropy.py ============================================ test session starts ============================================= platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0 rootdir: /home/tcc/Liger-Kernel collected 94 items test/transformers/test_cross_entropy.py .............................................................. [ 65%] ...............................F [100%] ================================================== FAILURES ================================================== __________________________________ test_large_no_exception[8-16384-128256] ___________________________________ B = 8, T = 16384, V = 128256 @pytest.mark.parametrize( "B, T, V", [ ( 8, 8192, 128256, ), # _input = 16GB, total = ~32GB, 8405385216 > 2,147,483,647, so we need int64 (8, 16384, 128256), # _input = 32GB, total = ~64GB ], ) # @pytest.mark.skipif( # torch.cuda.get_device_properties(0).total_memory < 64 * 1000 * 1000 * 1000, # reason="Needs 64GB+ GPU memory.", # ) def test_large_no_exception(B, T, V): # The large inputs were hitting cuda illegal memory access because of # triton-lang/triton#1058 > _full_pass_once(B, T, V) test/transformers/test_cross_entropy.py:401: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ B = 8, T = 16384, V = 128256 def _full_pass_once(B, T, V): torch.manual_seed(0) liger_ce = LigerCrossEntropyLoss() > _input = torch.randn( B * T, V, requires_grad=True, device="cuda", dtype=torch.bfloat16 ) E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.31 GiB. GPU 0 has a total capacity of 10.00 GiB of which 8.84 GiB is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) test/transformers/test_cross_entropy.py:374: OutOfMemoryError ========================================== short test summary info =========================================== FAILED test/transformers/test_cross_entropy.py::test_large_no_exception[8-16384-128256] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.31 GiB. GPU 0 has a total capacity of 10... ================================== 1 failed, 93 passed in 130.88s (0:02:10) ================================== ``` ```bash ❯ make test python -m pytest --disable-warnings test/ --ignore=test/convergence ============================================ test session starts ============================================= platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0 rootdir: /home/tcc/Liger-Kernel collected 256 items test/transformers/test_auto_model.py . [ 0%] test/transformers/test_cross_entropy.py ssssssssssssssssssssssss............ssssssssssssssssssssssssss [ 24%] ssssssssssssssssssssssssssssssss [ 37%] test/transformers/test_embedding.py ........... [ 41%] test/transformers/test_fused_linear_cross_entropy.py ................ [ 47%] test/transformers/test_geglu.py ............ [ 52%] test/transformers/test_layer_norm.py ................ [ 58%] test/transformers/test_monkey_patch.py ..... [ 60%] test/transformers/test_rms_norm.py ............................................................ [ 83%] test/transformers/test_rope.py .................. [ 91%] test/transformers/test_swiglu.py .................... [ 98%] test/transformers/test_trainer_integration.py . [ 99%] test/triton/test_triton_monkey_patch.py .. [100%] ================================ 174 passed, 82 skipped in 123.06s (0:02:03) ================================= ``` ```bash ❯ make checkstyle flake8 .; flake8_status=$?; \ isort .; isort_status=$?; \ black .; black_status=$?; \ if [ $flake8_status -ne 0 ] || [ $isort_status -ne 0 ] || [ $black_status -ne 0 ]; then \ exit 1; \ fi Skipped 2 files All done! ✨ 🍰 ✨ 68 files left unchanged. ``` ```bash ❯ make test-convergence HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence ============================================ test session starts ============================================= platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0 rootdir: /home/tcc/Liger-Kernel collected 30 items test/convergence/test_mini_models.py .............. [ 46%] test/convergence/test_mini_models_no_logits.py ................ [100%] ======================================= 30 passed in 223.18s (0:03:43) ======================================= ```

## Summary - Added Hugging Face training benchmarking script used for tech report - Writes files to `/results/${MODEL_TYPE}_use_liger_${USE_LIGER}_batch_size_${BATCH_SIZE}_rep_${i}.log` ## Testing Done - Ran benchmarking script - Hardware Type: A100 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence

## Summary Fix `tool.setuptools.packages.find` field in pyproject.toml. Otherwise in local build mode with `pip install .`, python system fails to locate liger_kernel. Co-authored-by: Byron Hsu <[email protected]>

## Summary  This PR improves the performance of swiglu and geglu forward by replacing `zeros_like` with `empty_like`. The difference is that `empty_like` doesn't require a separate kernel launch.  ## Testing Done  Testing is covered by existing `test_geglu.py` and `test_swiglu.py`.  - Hardware Type: A100-80G-PCIe - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Byron Hsu <[email protected]> Co-authored-by: Shao Tang <[email protected]>

## Summary Add repr information for layernorm and rmsnorm class so that the useful layer information can be displayed after the model is printed. Other classes are not modified because they inherit from related torch.nn classes, or there are torch.nn sub-modules. ## Testing Done   - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Byron Hsu <[email protected]> Co-authored-by: Shao Tang <[email protected]>

## Summary In linkedin#218, I fixed the `tool.setuptools.packages.find` field and tested it only in editable mode with `pip install -e .`. However, in production mode with `pip install .`, only the env_report.py file is copied to the Python site-packages directory. To fix this, adding "liger_kernel.*" to the include list will ensure that setuptools correctly includes all subpackages within liger_kernel.  ## Testing Done   - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence --------- Co-authored-by: Byron Hsu <[email protected]>

## Summary Implements a new script, `benchmark/benchmarks_visualizer.py`, that substitues the functionality provided by current `benchmark/benchmarks_visualizer.ipynb`. Resolves linkedin#211 . ## Details ```console $ python3 benchmarks_visualizer.py --help usage: benchmarks_visualizer.py [-h] --kernel-name KERNEL_NAME --metric-name METRIC_NAME --kernel-operation-mode KERNEL_OPERATION_MODE [--display] [--overwrite] options: -h, --help show this help message and exit --kernel-name KERNEL_NAME Kernel name to benchmark --metric-name METRIC_NAME Metric name to visualize (speed/memory) --kernel-operation-mode KERNEL_OPERATION_MODE Kernel operation mode to visualize (forward/backward/full) --display Display the visualization --overwrite Overwrite existing visualization, if none exist this flag has no effect as one are always created ``` ## Testing Done  - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>

## Summary Adds newly implemented kl divergence loss to readme. Closes linkedin#188 finally. ## Testing Done No code changes --------- Co-authored-by: Shao Tang <[email protected]> Co-authored-by: Byron Hsu <[email protected]>

## Summary Monkeypatch for the recently-published [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct). HF `transformers` modeling code: https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py Feature Request: linkedin#165 ## Details Qwen2-VL in `transformers` is available on `transformers` main but is yet to be published in a release. ## Testing Done - Hardware Type: 4090 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>

linkedin#237) ## Summary Add some easy checks for `weight.requires_grad` to skip allocating + calculating weight gradients if they're not needed. The weight gradient matrix can be pretty large, so this can also be a significant memory savings. Also, a small micro-optimization: skip the `.item()` call on `total_n_non_ignore` (the subsequent calculations work fine with the tensor form) to defer CUDA synchronization (otherwise it will wait for all the `torch.zeros` initializations on the preceding lines to synchronize, which may take a non-trivial amount of time.) ## Testing Done The existing unit test already has a case where the weight does not have gradients enabled, and it still passes forwards/backwards: https://github.com/linkedin/Liger-Kernel/blob/main/test/transformers/test_fused_linear_cross_entropy.py#L165 And the preceding test verifies the 'normal' case where the weight gradients are needed. - Hardware Type: A100 80G - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence

## Summary   ## Testing Done   - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence

helloworld1 and others added 30 commits August 28, 2024 10:26

update readme to use absolute paths (linkedin#157)

7f9e16b

Update README.md

78075f1

Add trainer integration

fe9ecef

add nightly download badge & format

bbf396c

make table nicer

3747d4c

Update README.md

a690adc

yubofredwang and others added 30 commits September 6, 2024 18:39

Merge branch 'main' into jamba-test

e654665

Update README.md

c844f78

(fix) fix pyproject.toml (linkedin#218)

53dcf02

## Summary Fix `tool.setuptools.packages.find` field in pyproject.toml. Otherwise in local build mode with `pip install .`, python system fails to locate liger_kernel. Co-authored-by: Byron Hsu <[email protected]>

Merge branch 'main' into jamba-test

674fae5

Update README.md

8cf49e2

Update README.md

6a75ddc

Merge branch 'main' into jamba-test

b7c866d

Merge branch 'main' into jamba-test

50e1c4c

Merge branch 'main' into jamba-test

4b247c4

split conv1d and mamba to separate dependencies

cfd1eda

fix contributing guide

e46fb17

remove old deps

c64791f

remove jamba from test

cf61cb2

fix contribute guide

68494df

Feat: add kl div to readme (linkedin#229)

9250546

## Summary Adds newly implemented kl divergence loss to readme. Closes linkedin#188 finally. ## Testing Done No code changes --------- Co-authored-by: Shao Tang <[email protected]> Co-authored-by: Byron Hsu <[email protected]>

Merge branch 'main' into jamba-test

13f3cd8

Merge branch 'main' into jamba-test

b91dd7b

Merge branch 'main' into jamba-test

08d5bd9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add jamba convergence test #1

add jamba convergence test #1

yubofredwang commented Sep 5, 2024 •

edited

Loading

add jamba convergence test #1

Are you sure you want to change the base?

add jamba convergence test #1

Conversation

yubofredwang commented Sep 5, 2024 • edited Loading

Summary

Testing Done

yubofredwang commented Sep 5, 2024 •

edited

Loading