Releases: openucx/ucc
Releases · openucx/ucc
1.3.0 (April 18th, 2024)
1.3.0 (April 18, 2024)
New Features and Enhancements
CL/HIER
- Disable onesided alltoallv {PR #875}
TL/CUDA
- Initialize remote CUDA scratch to NULL {PR #911}
TL/UCP
- Enable hybrid alltoallv {PR #781}
- Avoid copy in knomial scatter {PR #771}
- Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
- Remove memcpy in last SRA step {PR #743}
- Fix sparse pack in hybrid a2av {PR #825}
- Fix recycle in hybrid a2av {PR #827}
- Reorder ranks for SRA {PR #834}
- Use ring allgather when reordering needed {PR #879}
- Use pipelining in SRA allreduce for CUDA {PR #873}
- Poll for onesided alltoall completion {PR #876}
- Add support for non-host buffers in bruck alltoall {PR #852}
- Added Neighbor Exchange Allgather{PR #822}
TL/SHARP
- Enable bcast for any predefined dt {PR #774}
- Don't print team create error {PR #777}
- Check datasize supported {PR #776}
- Fix sharp context cleanup {PR #843}
API
- Remove duplicate get_version_string {PR #933}
TL/NCCL
- Make team init non-blocking {PR #772}
- Add CUDA managed to score {PR #793}
- Make ncclGroupEnd nb {PR #798}
- Lazy init nccl comm {PR #851}
TL/MLX5
- Share ib_ctx and pd {PR #749}
- Rcache {PR #753}
- Device memory and topo init {PR #780}
- Adding mcast interface {PR #784}
- A2A part 1 -- coll init {PR #790}
- A2A part 2 -- full collective {PR #802}
- Revisit team and ctx init {PR #815}
- Fix context create hang {PR #887}
- Add librdmacm linkage {PR #910}
CORE
- Fix score update when only score given {PR #779}
- Coverity fixes {PR #809}
- Additional coverty fixes {PR #813}
- Fix error handling for ctx create epilog {PR #818}
- Skip zero size collectives {PR #787}
DOCS
BUILD and TEST
v1.3.0-rc1
1.3.0 (TBD)
New Features and Enhancements
CL/HIER
- Disable onesided alltoallv {PR #875}
TL/CUDA
- Initialize remote CUDA scratch to NULL {PR #911}
TL/UCP
- Enable hybrid alltoallv {PR #781}
- Avoid copy in knomial scatter {PR #771}
- Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
- Remove memcpy in last SRA step {PR #743}
- Fix sparse pack in hybrid a2av {PR #825}
- Fix recycle in hybrid a2av {PR #827}
- Reorder ranks for SRA {PR #834}
- Use ring allgather when reordering needed {PR #879}
- Use pipelining in SRA allreduce for CUDA {PR #873}
- Poll for onesided alltoall completion {PR #876}
- Add support for non-host buffers in bruck alltoall {PR #852}
- Added Neighbor Exchange Allgather{PR #822}
TL/SHARP
- Enable bcast for any predefined dt {PR #774}
- Don't print team create error {PR #777}
- Check datasize supported {PR #776}
- Fix sharp context cleanup {PR #843}
API
- Remove duplicate get_version_string {PR #933}
TL/NCCL
- Make team init non-blocking {PR #772}
- Add CUDA managed to score {PR #793}
- Make ncclGroupEnd nb {PR #798}
- Lazy init nccl comm {PR #851}
TL/MLX5
- Share ib_ctx and pd {PR #749}
- Rcache {PR #753}
- Device memory and topo init {PR #780}
- Adding mcast interface {PR #784}
- A2A part 1 -- coll init {PR #790}
- A2A part 2 -- full collective {PR #802}
- Revisit team and ctx init {PR #815}
- Fix context create hang {PR #887}
- Add librdmacm linkage {PR #910}
CORE
- Fix score update when only score given {PR #779}
- Coverity fixes {PR #809}
- Additional coverty fixes {PR #813}
- Fix error handling for ctx create epilog {PR #818}
- Skip zero size collectives {PR #787}
DOCS
- Updating NEWS for v1.2 {PR #791}
TEST
UCC v1.2.0
This release includes numerous updates, bug fixes, and improvements across various components. The following is a summary of the changes based on the commit messages:
New Features and Enhancements
CL/HIER
- Fixed single proc on node issue in alltoall (#658)
- Implemented allreduce rab pipelined (#608)
- Added bcast 2step algorithm (#620)
- Fixed allreduce rab pipeline (#759)
TL/CUDA
- Support for CUDA 12
- Fixed cache unmap issue (#642)
- Implemented reduce scatter linear (#669)
- Added algorithm selection based on topology (#688)
- Fixed linear algorithms (#751)
- Fixed pipelining in linear rs (#770)
TL/UCP
- Added special service worker (#560)
- Added scatterv (#663)
- Added gatherv (#664)
- Fixed running with npolls 0 (#695)
- Added knomial allgather (#729)
- Fixed bug for triggered colls (#757)
- Added bruck alltoall (#756)
- Added SLOAV alltoallv (#687)
- Large message broadcast optimizations (#738)
- Ranks reordering in ring allgather for better locality(#69)
TL/SHARP
- Fixed memory type check in allreduce (#662)
- Added support for sharpv3 dt (#661)
- Fixed assert check (#686)
- Implemented SHARP OOB fixes (#746)
- Fixed local rank when NODE SBGP not enabled (#760)
- Prevented sharp team with team max ppn > 1 (#761)
CORE
- Fixed memory type score update (#650)
- Fixed ucc parser build (#666)
- Implemented ucc_pipeline_params (#675)
- Changed log level of config_modify (#667)
- Fixed timeout handle for triggered post (#679)
DOCS
- Added User Guide (#720)
v1.2.0-rc1
This release includes numerous updates, bug fixes, and improvements across various components. The following is a summary of the changes based on the commit messages:
New Features and Enhancements
CL/HIER
- Fixed single proc on node issue in alltoall (#658)
- Implemented allreduce rab pipelined (#608)
- Added bcast 2step algorithm (#620)
- Fixed allreduce rab pipeline (#759)
TL/CUDA
- Fixed cache unmap issue (#642)
- Implemented reduce scatter linear (#669)
- Added algorithm selection based on topology (#688)
- Fixed linear algorithms (#751)
- Fixed pipelining in linear rs (#770)
TL/UCP
- Added special service worker (#560)
- Added scatterv (#663)
- Added gatherv (#664)
- Fixed running with npolls 0 (#695)
- Added knomial allgather (#729)
- Fixed bug for triggered colls (#757)
- Added bruck alltoall (#756)
TL/SHARP
- Fixed memory type check in allreduce (#662)
- Added support for sharpv3 dt (#661)
- Fixed assert check (#686)
- Implemented SHARP OOB fixes (#746)
- Fixed local rank when NODE SBGP not enabled (#760)
- Prevented sharp team with team max ppn > 1 (#761)
CORE
- Fixed memory type score update (#650)
- Fixed ucc parser build (#666)
- Implemented ucc_pipeline_params (#675)
- Changed log level of config_modify (#667)
- Fixed timeout handle for triggered post (#679)
DOCS
- Added User Guide (#720)
UCC Version 1.1.0
Features
API
- Added float 128 and float 32, 64, 128 (complex) data types
- Added Active Sets based collectives to support dynamic groups as well as
point-to-point messaging - Added ucc_team_get_attr interface
Core
- Config file support
- Fixed component search
CL
- Added split rail allreduce collective implementation
- Enable hierarchical alltoallv and barrier
- Fixed cleanup bugs
TL
- Added SELF TL supporting team size one
UCP
- Added service broadcast
- Added reduce_scatterv ring algorithm
- Added k-nomial based gather collective implementation
- Added one-sided get based algorithms
SHARP
- Fixed SHARP OOB
- Added SHARP broadcast
GPU Collectives (CUDA, NCCL TL and RCCL TL)
- Added RCCL TL to support RCCL collectives
- Added support for CUDA TL (intranode collectives for NVIDIA GPUs)
- Added multiring allgatherv, alltoall, reduce-scatter, and reduce-scatterv
multiring in CUDA TL - Added topo based ring construction in CUDA TL to maximize bandwidth
- Added NCCL gather, scatter and its vector variant
- Enable using multiple streams for collectives
- Added support for RCCL gather (v), scatter (v), broadcast, allgather (v),
barrier, alltoall (v) and all reduce collectives - Added ROCm memory component
- Adapted all GPU collectives to executor design
Tests
- Added tests for triggered collectives in perftests
- Fixed bugs in multi-threading tests
Utils
- Added CPU model and vendor detection
- Several bug fixes in all components
UCC Version 1.1.0 - RC1
1.1.0
Features
API
- Added float 128 and float 32, 64, 128 (complex) data types
- Added Active Sets based collectives to support dynamic groups as well as point-to-point messaging
Core
- Config file support
- Fixed component search
CL
- Added split rail all reduce collective implementation
- Enable hierarchical alltoallv
- Fixed cleanup bugs
TL
- Added SELF TL supporting team size one
UCP
- Added service broadcast
- Added reduce_scatterv ring algorithm
- Added k-nomial based gather collective implementation
- Added one-sided get based algorithms
SHARP
- Fixed SHARP OOB
- Added SHARP broadcast
GPU Collectives (CUDA, NCCL TL and RCCL TL)
- Added support for CUDA TL (intranode collectives for NVIDIA GPUs)
- Added multiring allgatherv, alltoall in CUDA TL
- Added NCCL gather, scatter and its vector variant
- Enable using multiple streams for collectives
- Added support for RCCL gather (v), scatter (v), broadcast, allgather (v), barrier, alltoall (v) and all reduce collectives
- Added ROCm memory component
- Adapted all GPU collectives to executor design
Tests
- Added tests for triggered collectives in perftests
- Fixed bugs in multi-threading tests
Utils
- Added CPU model and vendor detection
- Several bug fixes in all components
Unified Collective Communication, Version 1.0.0
1.0.0
Features
API
- Added Avg reduce operation
- Added nonblocking team destroy option
- Added user-defined datatype definitions
- Added Bfloat16 type
- Clarify semantics of core abstractions including teams and context
- Added timeout option
Core
- Added coll scoring and selection support
- Added support for Triggered collectives
- Added support for timeouts in collectives
- Added support for team create without ep in post
- Added support for multithreaded context progress
- Added support for nonblocking team destroy
CL
- Added support for hierarchical collectives
- Added support for hierarchical allreduce collective operation
- Added support for collectives based on one-sided communication routines
TL
- Added SHARP TL
UCP
- Added Bcast SAG algorithm for large messages
- Added Knomial based reduce algorithm
- Making allgather and alltoall agree with the API
- Added SRA knomial allreduce algorithm
- Added pairwise alltoall and alltoallv algorithms
- Added allgather and allgatherv ring algorithms
- Added support for collective operations based on one-sided semantics
- Added support for alltoall with one-sided transfer semantics
- Bug fixes
SHARP
- Added support for switch-based hardware collectives (SHARP)
NCCL
- Add support for NCCL allreduce, alltoall, alltoallv, barrier, reduce, reduce
scatter, bcast, allgather and allgatherv
Tests
- Updated tests to test the newly added algorithms and operations
Unified Collective Communication, Version 1.0.0 - RC2
1.0.0
Features
API
- Added Avg reduce operation
- Added nonblocking team destroy option
- Added user-defined datatype definitions
- Added Bfloat16 type
- Clarify semantics of core abstractions including teams and context
- Added timeout option
Core
- Added coll scoring and selection support
- Added support for Triggered collectives
- Added support for timeouts in collectives
- Added support for team create without ep in post
- Added support for multithreaded context progress
- Added support for nonblocking team destroy
CL
- Added support for hierarchical collectives
- Added support for hierarchical allreduce collective operation
- Added support for collectives based on one-sided communication routines
TL
- Added SHARP TL
UCP
- Added Bcast SAG algorithm for large messages
- Added Knomial based reduce algorithm
- Making allgather and alltoall agree with the API
- Added SRA knomial allreduce algorithm
- Added pairwise alltoall and alltoallv algorithms
- Added allgather and allgatherv ring algorithms
- Added support for collective operations based on one-sided semantics
- Added support for alltoall with one-sided transfer semantics
- Bug fixes
SHARP
- Added support for switch-based hardware collectives (SHARP)
NCCL
- Add support for NCCL allreduce, alltoall, alltoallv, barrier, reduce, reduce
scatter, bcast, allgather and allgatherv
Tests
- Updated tests to test the newly added algorithms and operations
Unified Collective Communication, Version 0.1.0 - RC1
This is an early release of the UCC API and its implementation. Major features in this release are detailed below.
Features
API
- UCC API to support library, contexts, teams, collective operations, execution
engine, memory types, and triggered operations
Core
- Added implementation for UCC abstractions - library, context, team,
collective operations, execution engine, memory types, and triggered
operations - Added support for memory types - CUDA, and CPU
- Added support for configuring UCC library and contexts
CL
- Added support for collectives, while the source and destination is either in
CPU or device (GPU) - Added support for UCC_THREAD_MULTIPLE
- Added support for CUDA stream-based collectives
TL
- Added support for send/receive based collectives using UCX/UCP as a transport
layer - Support for basic collectives types including barrier, alltoall, alltoallv,
broadcast, allgather, allgatherv, allreduce was added in the UCP TL - Added support using NCCL as a transport layer
- Support for collectives types including alltoall, alltoallv, allgather,
allgatherv, allreduce, and broadcast
Tests
- Added support for unit testing (gtest) infrastructure
- Added support for MPI tests
Unified Collective Communication, Version 0.1.0
This is an early release of the UCC API and its implementation. Major features in this release are detailed below.
Features
API
- UCC API to support library, contexts, teams, collective operations, execution
engine, memory types, and triggered operations
Core
- Added implementation for UCC abstractions - library, context, team,
collective operations, execution engine, memory types, and triggered
operations - Added support for memory types - CUDA, and CPU
- Added support for configuring UCC library and contexts
CL
- Added support for collectives, while the source and destination is either in
CPU or device (GPU) - Added support for UCC_THREAD_MULTIPLE
- Added support for CUDA stream-based collectives
TL
- Added support for send/receive based collectives using UCX/UCP as a transport
layer - Support for basic collectives types including barrier, alltoall, alltoallv,
broadcast, allgather, allgatherv, allreduce was added in the UCP TL - Added support using NCCL as a transport layer
- Support for collectives types including alltoall, alltoallv, allgather,
allgatherv, allreduce, and broadcast
Tests
- Added support for unit testing (gtest) infrastructure
- Added support for MPI tests