Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kyle primeintellect aisi #7122

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
64d6c5a
Update engine.py
Quentin-Anthony Mar 9, 2023
0a23729
Update README.md
Quentin-Anthony Mar 9, 2023
5317ca6
Rebase DS Autotuning to DeeperSpeed 2.0 (#49)
Quentin-Anthony May 2, 2023
3362dad
Prevent zero optimizer loading when finetuning a model and using zero…
bentherien May 18, 2023
317355c
Make checkpoints compatible with NeoX (#47)
dashstander May 18, 2023
f587de1
Merge with latest deepspeed
Quentin-Anthony May 18, 2023
f2acc07
Add consecutive hysteresis support (#52)
Quentin-Anthony May 18, 2023
22fda1e
Add necessary comma (#53)
dashstander May 19, 2023
a48c649
bf16 load and finetune fix (#57)
kshitijkg Nov 6, 2023
b926043
BF16 optimizer sync parameters with module when load_module_only=True…
ZHAOTING Nov 26, 2023
6f42f87
Merge branch 'master' of https://github.com/microsoft/DeepSpeed into …
Quentin-Anthony Nov 26, 2023
9920f1b
Merge branch 'microsoft-master' into main
Quentin-Anthony Nov 26, 2023
a1af9e7
Update Slurm Launcher (#59)
haileyschoelkopf Jan 19, 2024
02e2ebf
feat: fix PipelineEngine init logging error (#60)
DayOfThePenguin Mar 2, 2024
6d097be
add _deepspeed_no_cast attribute (#61)
haileyschoelkopf Mar 8, 2024
5621cec
Merged upstream torch-compatibility changes in elasticity/elastic_age…
jacobthebanana Mar 6, 2025
91d1c55
merge in local branch for v3.0 release (#64)
Quentin-Anthony Mar 6, 2025
d0f75b3
Update launcher for prime intellect infra
Kyle1668 Mar 10, 2025
5ab1fe4
Remove ssh check when starting
Kyle1668 Mar 10, 2025
2d991ba
Set default deepspeed ssh to 2222
Kyle1668 Mar 10, 2025
06ae7e0
Comment out _setup_mpi_environment
Kyle1668 Mar 10, 2025
ba81813
Fix concat with 2222
Kyle1668 Mar 10, 2025
f5368de
Remove openib
Kyle1668 Mar 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/deepspeed_chat_bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ If applicable, add screenshots to help explain your problem.
**System info (please complete the following information):**
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. two machines with x8 A100s each]
- (if applicable) what [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) version are you using
- (if applicable) what [DeepSpeed-MII](https://github.com/deepspeedai/deepspeed-mii) version are you using
- (if applicable) Hugging Face Transformers/Accelerate/etc. versions
- Python version
- Any other relevant info about your setup
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/inference_bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ If applicable, add screenshots to help explain your problem.
**System info (please complete the following information):**
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. two machines with x8 A100s each]
- (if applicable) what [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) version are you using
- (if applicable) what [DeepSpeed-MII](https://github.com/deepspeedai/deepspeed-mii) version are you using
- (if applicable) Hugging Face Transformers/Accelerate/etc. versions
- Python version
- Any other relevant info about your setup
Expand Down
56 changes: 0 additions & 56 deletions .github/workflows/amd-mi100.yml

This file was deleted.

12 changes: 7 additions & 5 deletions .github/workflows/amd-mi200.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
name: amd-mi200

on:
workflow_dispatch:
pull_request:
paths:
- '.github/workflows/amd-mi200.yml'
- 'requirements/**'
schedule:
- cron: "0 0 * * *"
workflow_dispatch:

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
Expand All @@ -21,14 +25,14 @@ jobs:
# Steps represent a sequence of tasks that will be executed as part of the job
steps:
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
- uses: actions/checkout@v3
- uses: actions/checkout@v4

- id: setup-venv
uses: ./.github/workflows/setup-venv

- name: Install pytorch
run: |
pip install -U --cache-dir $TORCH_CACHE torch torchvision --index-url https://download.pytorch.org/whl/rocm5.6
pip install -U --cache-dir $TORCH_CACHE torch torchvision --index-url https://download.pytorch.org/whl/rocm6.0
python -c "import torch; print('torch:', torch.__version__, torch)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

Expand All @@ -44,8 +48,6 @@ jobs:
- name: Install (ROCm) apex
run: |
git clone https://github.com/ROCmSoftwarePlatform/apex.git
cd apex
git checkout torch_2.1_higher
CURRENT_VER=$(git rev-parse HEAD)
INSTALLED_VER=$(cat /blob/amd-apex/.venv_installed_version)
if [[ "$CURRENT_VER" != "$INSTALLED_VER" ]]; then
Expand Down
59 changes: 0 additions & 59 deletions .github/workflows/auto-sync.yml

This file was deleted.

68 changes: 48 additions & 20 deletions .github/workflows/cpu-inference.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,52 +2,73 @@ name: cpu-inference

on:
workflow_dispatch:
pull_request:
paths:
- '.github/workflows/cpu-inference.yml'
- 'requirements/**'
- 'deepspeed/__init__.py'
- 'deepspeed/inference/**'
- '!deepspeed/inference/v2/**' # exclude v2 dir
- 'tests/unit/inference/**'
- '!tests/unit/inference/v2/**' # exclude v2 tests dir
merge_group:
branches: [ master ]
schedule:
- cron: "0 0 * * 0"

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
unit-tests:
runs-on: ubuntu-20.04
runs-on: [self-hosted, cpu]

env: {ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true} # Allow using Node16 actions

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4

- id: setup-venv
uses: ./.github/workflows/setup-venv

- name: Install gcc-9
run: |
sudo add-apt-repository -u ppa:ubuntu-toolchain-r/test
sudo apt install -y gcc-9 g++-9
# set gcc-9 and g++9 to default
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 99
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 99

- name: Check gcc version
run: |
# Get gcc version
gcc --version
g++ --version

- name: Detect instruction sets on instance
run: |
lscpu
pip install cmake
git clone https://github.com/intel/intel-extension-for-pytorch
cd intel-extension-for-pytorch/tests/cpu/isa
cmake .
make
./cpu_features

- name: Install numactl
run: |
sudo apt-get install -y numactl

- name: Install oneCCL Bindings for PyTorch
- name: Install dependencies
run: |
python -m pip install intel_extension_for_pytorch
python -m pip install oneccl_bind_pt==2.0 -f https://developer.intel.com/ipex-whl-stable-cpu
pip install torch
# check installed version
pip list |grep \\\<torch\\\>

- name: Install oneCCL
run: |
pip install cmake
git clone https://github.com/oneapi-src/oneCCL
cd oneCCL
mkdir build
cd build
cmake ..
make
make install
#source ./_install/env/setvars.sh
# test whether oneCCL is correctly installed
#mpirun -n 2 ./examples/benchmark/benchmark
make -j install

- name: Install transformers
run: |
Expand All @@ -62,14 +83,21 @@ jobs:
pip install .[dev,1bit,autotuning,inf]
ds_report

- name: Python environment
- name: Python environment check
run: |
pip list
source oneCCL/build/_install/env/setvars.sh
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6
# check whether the environment is properly setup
python -c "import deepspeed;from deepspeed.accelerator import get_accelerator;print(get_accelerator().device_name());print(get_accelerator().is_available())"

- name: Unit tests
run: |
# prep oneCCL for CCLBackend comm ops building
source oneCCL/build/_install/env/setvars.sh
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
cd tests
TRANSFORMERS_CACHE=~/tmp/transformers_cache/ TORCH_EXTENSIONS_DIR=./torch-extensions pytest -m 'seq_inference' unit/
TRANSFORMERS_CACHE=~/tmp/transformers_cache/ TORCH_EXTENSIONS_DIR=./torch-extensions pytest -m 'inference_ops' -m 'inference' unit/
cd tests
# LOCAL_SIZE=2 enforce CPU to report 2 devices, this helps run the test on github default runner
LOCAL_SIZE=2 COLUMNS=240 HF_HOME=~/tmp/hf_home/ TORCH_EXTENSIONS_DIR=./torch-extensions pytest -m 'seq_inference' unit/
LOCAL_SIZE=2 COLUMNS=240 HF_HOME=~/tmp/hf_home/ TORCH_EXTENSIONS_DIR=./torch-extensions pytest -m 'inference_ops' -m 'inference' unit/
Original file line number Diff line number Diff line change
@@ -1,31 +1,39 @@
name: nv-torch110-p40
name: cpu-torch-latest

on:
workflow_dispatch:
pull_request:
paths-ignore:
- 'docs/**'
- 'blogs/**'
- 'deepspeed/inference/v2/**'
- 'tests/unit/inference/v2/**'
merge_group:
branches: [ master ]
schedule:
- cron: "0 0 * * *"
workflow_dispatch:

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

permissions:
contents: read
issues: write

jobs:
unit-tests:
runs-on: [self-hosted, nvidia, cu111, p40]
runs-on: ubuntu-24.04

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4

- id: setup-venv
uses: ./.github/workflows/setup-venv

- name: Install system packages
run: |
sudo apt-get install -y numactl pdsh

- name: Install pytorch
run: |
pip install -U --cache-dir $TORCH_CACHE torch==1.10.0+cu111 torchvision==0.11.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
python -c "import torch; print('torch:', torch.__version__, torch)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

Expand All @@ -34,13 +42,13 @@ jobs:
git clone https://github.com/huggingface/transformers
cd transformers
# if needed switch to the last known good SHA until transformers@master is fixed
# git checkout 1cc453d33
git checkout 981c276
git rev-parse --short HEAD
pip install .

- name: Install deepspeed
run: |
pip install .[dev,1bit,autotuning] --no-build-isolation
pip install .[dev,autotuning]
ds_report

- name: Python environment
Expand All @@ -51,13 +59,5 @@ jobs:
run: |
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
cd tests
pytest $PYTEST_OPTS --forked -n 4 unit/ --torch_ver="1.10" --cuda_ver="11.1"

- name: Open GitHub issue if nightly CI fails
if: ${{ failure() && (github.event_name == 'schedule') }}
uses: JasonEtco/create-an-issue@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
filename: .github/ISSUE_TEMPLATE/ci_failure_report.md
update_existing: true
HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS -n 4 unit/ --torch_ver="2.6"
HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS -m 'sequential' unit/ --torch_ver="2.6"
7 changes: 4 additions & 3 deletions .github/workflows/formatting.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
name: Formatting

on:
workflow_dispatch:
pull_request:
branches:
'**'
Expand All @@ -16,11 +17,11 @@ concurrency:
jobs:

# formatting and basic install on cpu-only machine
formatting:
runs-on: ubuntu-20.04
unit-tests:
runs-on: ubuntu-22.04

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4

- name: environment
run: |
Expand Down
Loading