Releases · openucx/ucx

21 Feb 22:58

tvegas1

v1.18.1-rc1

3ed7241

v1.18.1 RC1 Pre-release

Pre-release

1.18.1-rc1 (February 20, 2025)

Features:

AZP

Added Ubuntu 24.04 to build and release pipeline

Bugfixes:

UCP

Fixed potential active message user header use after free with protocol reconfiguration

Assets 23

21 Jan 09:39

tvegas1

v1.18.0

693d028

v1.18.0 Latest

Latest

1.18.0 (January 17, 2025)

Features:

UCP

Enabled using CUDA staging buffers for pipeline protocols by default
Added endpoint reconfiguration support for non-reused p2p scenarios
Enabled non-cacheable memory domains, activated for gdr_copy
Added user_data parameter to ucp_ep_query
Added support for host memory pipeline through CUDA buffers for rendezvous protocol
Added global VA infrastructure and memory region in absence of error handling
Made protocol performance node names more informative
Enforced always running on the same thread in single thread mode
Multiple improvements in protocols selection infrastructure
Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
Allowed up-to 64 endpoint lanes for systems with many transports or devices
Added usage tracker to worker
Improved various logging messages

RDMA CORE (IB, ROCE, etc.)

Added environment variable to manage DC initiator capacity
Added DC dcs_hybrid policy
Reduced MLX5/DV stack size consumption
Added ODP support for verbs and mlx5dv
Added support of CUDA managed memory on IB when ODP is available
Added support of Adaptive Routing on RoCE
Enabled use of implicit ODP with relaxed ordering
Improved GPU-Direct detection in IB transport
Increased DC initiator default count to 32 for performance optimization
Added ConnectX-8 device support with DDP
Added support for subnet filter list for RoCE interfaces
Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
Added IB MLX5 as a separate UCX module with separate RPM sub-package
Added initial support for GGA transport, for fast DPU memory access
Set IB DevX atomic mode based on device capabilities
Removed DC keepalive mechanism, since the keepalive is done on UCP layer
Optimized cross-gVMI memory registration using indirect memory keys cache
Improved various logging messages

CUDA

Added multi-node NVlink support
Added CUDA Fabric memory support with detection and allocation
Improved gdr_copy latency estimations on AMD Milan systems
Added check for gdr_copy runtime/build version mismatch
Added handling missing IPC capability when unpacking keys
Added caching for CUDA IPC memory pool import operation
Added gdr_copy variables to optimize performance on Grace Hopper systems
Improved CUDA IPC concurrency for a larger count of reachable peers

UCS

Added support for wildcards in configuration parameter names
Added ASAN protection to several internal data structures
Reduced stack usage in topology detection code
Improved bitmaps configuration parsing with wider bitfield
Added options to set topology distance between devices
Optimized VFS unix socket watch by using user private folder
Added general IP subnet matching infrastructure
Extend array data structure to support user-provided array copy routine
Improved time units description

UCM

Extend CUDA memory hooks to include memory mapping APIs

Tools

Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
Improved ucx_perftest uni-directional test with added fence
Detailed ucx_perftest batch section of command-line documentation

Documentation

Added a section regarding adaptive routing on RoCE

Architecture

Added CPU Model for MI300A
Added Fujitsu ARM specific values to ucx.conf
Added AMD Turin support
Added an optimized non-temporal memory copy implementation for AMD CPU

Build

Improved compiler error reporting with added flag
Improved coverity script to allow faster turnaround time
Improved Intel Compiler detection and support

GO

Added multi-send flag and user memh support in request params

Packaging

Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments

Bugfixes:

UCP

Fixed stack overflow in exported rkey unpack
Removed extra remote-cpu overhead from protocol estimation for zcopy
Fixed performance estimation for rndv pipeline protocols
Fixed ATP sending by picking the correct lane
Fixed missing reg_id on memh creation
Fixed repeated invalidations by retaining existing access flags
Fixed abort reason propagation for rendezvous RTR mtype
Do not check transport availability if it is disabled by UCX_TLS environment variable
Fixed wrong flag being used for checking BCOPY capability
Fixed sending too many ATPs for small messages
Enforced 16 bits size for Active Messages identifiers
Fixed unnecessary status check for emulated AMO
Fixed more than one fragment sending in rendezvous pipeline
Fixed crash by using biggest max frag across all lanes
Fixed missing memory handle flags by copying from parent to child
Fixed worker interface activate count
Fixed flush requests by replacing ATP/flush lane map with lane indexes
Fixed lost uct_flags when merging memory regions

UCT

Fixed memory domain UCT flags description

RDMA CORE (IB, ROCE, etc.)

Fixed FETCH_ADD remote access error for ODP/KSM case
Fixed missing conditional compilation checks for DM
Fixed IB MD allocation naming typo
Fixed invalid GIDs filter in IB
Fixed flags usage in MLX5 zcopy_post
Do not limit ODP registration retries
Fixed JUCX failures by considering the number of supported completion vectors

CUDA

Fixed async memory handling using CUDA memory type on Grace
Added rcache overhead in performance estimation
Fixed gdr_copy performance regression by providing maximum estimation between get and put
Fixed CUDA IPC reachability check
Fixed crash in MPI_Finalize when CUDA context is destroyed
Always require rcache by default for gdr_copy
Fixed crash in gdr_copy cleanup when registration cache is disabled
Fixed CUDA copy memory domain allocations
Fixed multiple tests for gdr_copy transport
Fixed race condition in CUDA IPC peer accessible cache

UCS

Fixed a crash by using heap allocation to process expired timers in batch
Fixed allocation issue on memtrack dump
Fixed deletion of the monitored folder in VFS
Fixed unsafe resize for DC initiator array
Fixed function macro invocation to match C standard
Fixed calling async handler on already released resource
Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
Fixed undeclared value error in timer conversion routine
Fixed uninitialized value access in registration cache

UCM

Fixed race condition in parsing proc maps
Fixed mremap failure while parsing /proc/self/maps

ROCM

Fixed ROCM interface reachability test
Fixed memory domain fork test

TCP

Always bind endpoint to interface

Tools

Fixed buffer size potential overflow in ucx_perftest
Fixed missing address when packing memory keys on ucx_perftest
Fixed memory leak for endpoint report in ucx_info
Fixed build without openmp in ucx_perftest
Fixed UCT device override on server side on ucx_perftest

Build

Fixed using correct ASAN version for running tests

Configuration

Used POSIX bourne syntax to check equality
Fixed build failure by using proper flags in compiler.m4
Fixed perftest MAD support default guessing

GO

Added serialized thread mode to avoid subtle races between threads
Fixed make distcheck

Assets 21

ucx-1.18.0-1.el7.src.rpm

3.16 MB 2025-01-21T08:53:09Z
ucx-1.18.0-1.el8.src.rpm

3.2 MB 2025-01-21T08:49:05Z
ucx-1.18.0-centos7-mofed5-cuda11-x86_64.tar.bz2

7.32 MB 2025-01-21T09:05:51Z
ucx-1.18.0-centos7-mofed5-cuda12-x86_64.tar.bz2

7.32 MB 2025-01-21T09:08:52Z
ucx-1.18.0-centos8-mofed5-cuda11-aarch64.tar.bz2

8.7 MB 2025-01-21T08:58:30Z
ucx-1.18.0-centos8-mofed5-cuda11-x86_64.tar.bz2

9.14 MB 2025-01-21T09:16:07Z
ucx-1.18.0-ubuntu16.04-mofed5-cuda11-x86_64.tar.bz2

1.58 MB 2025-01-21T09:14:14Z
ucx-1.18.0-ubuntu18.04-mofed5-cuda11-aarch64.tar.bz2

1.43 MB 2025-01-21T08:58:05Z
ucx-1.18.0-ubuntu18.04-mofed5-cuda11-x86_64.tar.bz2

1.52 MB 2025-01-21T09:19:25Z
ucx-1.18.0-ubuntu18.04-mofed5-cuda12-x86_64.tar.bz2

1.52 MB 2025-01-21T09:18:41Z
Source code (zip)

2025-01-17T14:34:18Z
Source code (tar.gz)

2025-01-17T14:34:18Z

23 Dec 17:06

tvegas1

v1.18.0-rc3

9ce35d0

v1.18.0 RC3 Pre-release

Pre-release

1.18.0-rc3 (December 23, 2024)

Features:

UCP

Enabled using CUDA staging buffers for pipeline protocols by default
Added endpoint reconfiguration support for non-reused p2p scenarios
Enabled non-cacheable memory domains, activated for gdr_copy
Added user_data parameter to ucp_ep_query
Added support for host memory pipeline through CUDA buffers for rendezvous protocol
Added global VA infrastructure and memory region in absence of error handling
Made protocol performance node names more informative
Enforced always running on the same thread in single thread mode
Multiple improvements in protocols selection infrastructure
Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
Allowed up-to 64 endpoint lanes for systems with many transports or devices
Added usage tracker to worker
Improved various logging messages

RDMA CORE (IB, ROCE, etc.)

Added environment variable to manage DC initiator capacity
Added DC dcs_hybrid policy
Reduced MLX5/DV stack size consumption
Added ODP support for verbs and mlx5dv
Added support of CUDA managed memory on IB when ODP is available
Added support of Adaptive Routing on RoCE
Enabled use of implicit ODP with relaxed ordering
Improved GPU-Direct detection in IB transport
Increased DC initiator default count to 32 for performance optimization
Added ConnectX-8 device support with DDP
Added support for subnet filter list for RoCE interfaces
Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
Added IB MLX5 as a separate UCX module with separate RPM sub-package
Added initial support for GGA transport, for fast DPU memory access
Set IB DevX atomic mode based on device capabilities
Removed DC keepalive mechanism, since the keepalive is done on UCP layer
Optimized cross-gVMI memory registration using indirect memory keys cache
Improved various logging messages

CUDA

Added multi-node NVlink support
Added CUDA Fabric memory support with detection and allocation
Improved gdr_copy latency estimations on AMD Milan systems
Added check for gdr_copy runtime/build version mismatch
Added handling missing IPC capability when unpacking keys
Added caching for CUDA IPC memory pool import operation
Added gdr_copy variables to optimize performance on Grace Hopper systems
Improved CUDA IPC concurrency for a larger count of reachable peers

UCS

Added support for wildcards in configuration parameter names
Added ASAN protection to several internal data structures
Reduced stack usage in topology detection code
Improved bitmaps configuration parsing with wider bitfield
Added options to set topology distance between devices
Optimized VFS unix socket watch by using user private folder
Added general IP subnet matching infrastructure
Extend array data structure to support user-provided array copy routine
Improved time units description

UCM

Extend CUDA memory hooks to include memory mapping APIs

Tools

Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
Improved ucx_perftest uni-directional test with added fence
Detailed ucx_perftest batch section of command-line documentation

Documentation

Added a section regarding adaptive routing on RoCE

Architecture

Added CPU Model for MI300A
Added Fujitsu ARM specific values to ucx.conf
Added AMD Turin support
Added an optimized non-temporal memory copy implementation for AMD CPU

Build

Improved compiler error reporting with added flag
Improved coverity script to allow faster turnaround time
Improved Intel Compiler detection and support

GO

Added multi-send flag and user memh support in request params

Packaging

Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments

Bugfixes:

UCP

Fixed stack overflow in exported rkey unpack
Removed extra remote-cpu overhead from protocol estimation for zcopy
Fixed performance estimation for rndv pipeline protocols
Fixed ATP sending by picking the correct lane
Fixed missing reg_id on memh creation
Fixed repeated invalidations by retaining existing access flags
Fixed abort reason propagation for rendezvous RTR mtype
Do not check transport availability if it is disabled by UCX_TLS environemnt variable
Fixed wrong flag being used for checking BCOPY capability
Fixed sending too many ATPs for small messages
Enforced 16 bits size for Active Messages identifiers
Fixed unnecessary status check for emulated AMO
Fixed more than one fragment sending in rendezvous pipeline
Fixed crash by using biggest max frag across all lanes
Fixed missing memory handle flags by copying from parent to child
Fixed worker interface activate count
Fixed flush requests by replacing ATP/flush lane map with lane indexes
Fixed lost uct_flags when merging memory regions

UCT

Fixed memory domain UCT flags description

RDMA CORE (IB, ROCE, etc.)

Fixed FETCH_ADD remote access error for ODP/KSM case
Fixed missing conditional compilation checks for DM
Fixed IB MD allocation naming typo
Fixed invalid GIDs filter in IB
Fixed flags usage in MLX5 zcopy_post
Do not limit ODP registration retries
Fixed JUCX failures by considering the number of supported completion vectors

CUDA

Fixed async memory handling using CUDA memory type on Grace
Added rcache overhead in performance estimation
Fixed gdr_copy performance regression by providing maximum estimation between get and put
Fixed CUDA IPC reachability check
Fixed crash in MPI_Finalize when CUDA context is destroyed
Always require rcache by default for gdr_copy
Fixed crash in gdr_copy cleanup when registration cache is disabled
Fixed CUDA copy memory domain allocations
Fixed multiple tests for gdr_copy transport
Fixed race condition in CUDA IPC peer accessible cache

UCS

Fixed a crash by using heap allocation to process expired timers in batch
Fixed allocation issue on memtrack dump
Fixed deletion of the monitored folder in VFS
Fixed unsafe resize for DC initiator array
Fixed function macro invocation to match C standard
Fixed calling async handler on already released resource
Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
Fixed undeclared value error in timer conversion routine
Fixed uninitialized value access in registration cache

UCM

Fixed race condition in parsing proc maps
Fixed mremap failure while parsing /proc/self/maps

ROCM

Fixed ROCM interface reachability test
Fixed memory domain fork test

TCP

Always bind endpoint to interface

Tools

Fixed buffer size potential overflow in ucx_perftest
Fixed missing address when packing memory keys on ucx_perftest
Fixed memory leak for endpoint report in ucx_info
Fixed build without openmp in ucx_perftest
Fixed UCT device override on server side on ucx_perftest

Build

Fixed using correct ASAN version for running tests

Configuration

Used POSIX bourne syntax to check equality
Fixed build failure by using proper flags in compiler.m4
Fixed perftest MAD support default guessing

GO

Added serialized thread mode to avoid subtle races between threads
Fixed make distcheck

Assets 21

10 Dec 16:40

tvegas1

v1.18.0-rc2

e992f1b

v1.18.0 RC2 Pre-release

Pre-release

1.18.0-rc2 (December 10, 2024)

Features: TBD

Bugfixes: TBD

Assets 21

26 Nov 13:25

tvegas1

v1.18.0-rc1

a0fb15f

v1.18.0 RC1 Pre-release

Pre-release

1.18.0-rc1 (November 26, 2024)

Features: TBD

Bugfixes: TBD

Assets 21

13 Jun 15:35

shasson5

v1.17.0

4ef9a09

v1.17.0

1.17.0 (June 13, 2024)

Features:

UCP

Improved the accuracy of rendezvous protocol performance estimation
Enabled short protocol for non-host memory types on empty messages
Improved the accuracy of performance estimation for empty messages by removing non-relevant overheads
Added RMA_ZCOPY_MAX_SEG_SIZE configuration parameter to allow modifying segment size for RMA-ZCOPY protocols
Added support for separate intra/inter-node rendezvous thresholds
Added support for minimal fragment size in rendezvous protocol
Added support for resetting request during send operation
Added UCX_PROTO_OVERHEAD configuration variable to allow setting protocol overheads
Improved performance for combined Active Message/RMA scenarios by separating them to different lanes
Added support for device staging buffers in pipeline protocols
Enabled on-demand paging for Nvidia's Grace platforms by default

RDMA CORE (IB, ROCE, etc.)

Introduced the UCX_REVERSE_SL environment variable to configure reverse SL for DC transport. By default, it uses UCX_IB_SL.
Added support for GID auto-detection in Floating LID based routing
Added support for multithreading KSM registration of unaligned buffers
Added IB_SEND_OVERHEAD and MM_[SEND|RECV]_OVERHEAD configuration variables

GPU (CUDA, ROCM)

Added support for oneAPI Level-Zero library for Intel GPUs

UCS

Added support for rcache dynamic region alignment
Added dynamic bitmap data structure
Added support for advanced key-value parsing for UCX configuration
Added piecewise linear function data structure
Added support for allocating dynamic arrays on stack

Tools

Added support for device memory allocation in UCX perftest
Added a script to use for squashing commits after PR approval
Added support for DPU cross-gvmi daemon in UCX perftest

Java

Added support for EP local socket address API in JUCX

Build

Added address sanitizer support
Added a helper shell script to run static checks

AZP

Replaced Valgrind tests with address sanitizer tool
Added Ubuntu 22.04 docker image testing

Configuration

Added support for filtering configuration sections by platform type
Added configuration file with section for Grace Hopper

Bugfixes:

UCP

Fixed crash due to incorrect lane selection when active message is disabled
Fixed RMA lane selection issue due to wrong bandwidth calculation
Fixed rendezvous protocol information in protocol details table
Fixed endpoint reconfiguration issue due to wrong bandwidth calculation
Fixed Active Message handlers issue due to out of order registration
Fixed registration of memh evens for imported memory key
Fixed sockaddr unreachable destination error handling
Fixed uninitialized memory issue in new protocols infrastructure
Fixed race condition when using strong fence by flushing all endpoints
Fixed incorrect RMA message size on immediate completion with no datatype
Fixed incorrect performance estimation due to fp8 pack/unpack issue
Fixed remote access error when rcache memory is not registered with atomic access
Fixed assertion failure when rcache fails during memh allocation
Fixed atomic device selection issue
Fixed worker interface deactivation while still in use by endpoints
Fixed wire compatibility issue due to mismatched lane selection

RDMA CORE (IB, ROCE, etc.)

Disabled device memory if atomics are not available
Fixed indirect keys creation for MT registered memory
Fixed KSM start address value when creating export key
Fixed DCI pool index to support maximum of 16 pools
Fixed atomic rkey issue when using imported memory
Fixed crash due to unsupported SRQ capability

GPU (CUDA, ROCM)

Removed unused environment variable RCACHE_ADDR_ALIGN from ROCm transport
Fixed usage of cuda device 0 when no context is active
Removed error handling support from CUDA IPC transport
Fixed allocation of unaligned CUDA memory

Shared Memory

Fixed occasional crash when shm_unlink fails during interface initialization

UCS

Fixed system device distance calculation for devices on different PCIe root
Fixed support for large size arrays in ucs_array
Fixed synchronization issue in rcache
Fixed uninitialized variable access in rcache

Tests

Fixed test failures when GPU is present but disabled
Fixed Active Message hanging issue in ucp_client_server
Fixed potential crash due to redundant munmap call in ucp mmap tests
Fixed a crash when running CUDA gtest under valgrind
Fixed UD endpoint timeout issue under Valgrind

Java

Fixed failures in Java tests by waiting for send requests completion
Fixed JVM segfault in Java tests when gdrcopy driver is not loaded
Fixed go build and go tests failures

Packaging

Disabled Go bindings in Debian package

Assets 21

06 Jun 16:47

shasson5

v1.17.0-rc3

770b5a6

v1.17.0 RC3 Pre-release

Pre-release

1.17.0 RC3 (June 6, 2024)

Bugfixes:

UCP

Fixed wire compatibility issue due to mismatched lane selection

UCS

Fixed uninitialized variable access in rcache

Assets 21

03 Jun 08:10

shasson5

v1.17.0-rc2

9cec0d4

v1.17.0 RC2 Pre-release

Pre-release

1.17.0 RC2 (May 29, 2024)

Features:

UCP

Improved the accuracy of rendezvous protocol performance estimation
Enabled short protocol for non-host memory types on empty messages
Improved the accuracy of performance estimation for empty messages by removing non-relevant overheads
Added RMA_ZCOPY_MAX_SEG_SIZE configuration parameter to allow modifying segment size for RMA-ZCOPY protocols
Added support for separate intra/inter-node rendezvous thresholds
Added support for minimal fragment size in rendezvous protocol
Added support for resetting request during send operation
Added UCX_PROTO_OVERHEAD configuration variable to allow setting protocol overheads
Improved performance for combined Active Message/RMA scenarios by separating them to different lanes
Added support for device staging buffers in pipeline protocols
Enabled on-demand paging for Nvidia's Grace platforms by default

RDMA CORE (IB, ROCE, etc.)

Introduced the UCX_REVERSE_SL environment variable to configure reverse SL for DC transport. By default, it uses UCX_IB_SL.
Added support for GID auto-detection in Floating LID based routing
Added support for multithreading KSM registration of unaligned buffers
Added IB_SEND_OVERHEAD and MM_[SEND|RECV]_OVERHEAD configuration variables

GPU (CUDA, ROCM)

Added support for oneAPI Level-Zero library for Intel GPUs

UCS

Added support for rcache dynamic region alignment
Added dynamic bitmap data structure
Added support for advanced key-value parsing for UCX configuration
Added piecewise linear function data structure
Added support for allocating dynamic arrays on stack

Tools

Added support for device memory allocation in UCX perftest
Added a script to use for squashing commits after PR approval
Added support for DPU cross-gvmi daemon in UCX perftest

Java

Added support for EP local socket address API in JUCX

Build

Added address sanitizer support
Added a helper shell script to run static checks

AZP

Replaced Valgrind tests with address sanitizer tool
Added Ubuntu 22.04 docker image testing

Configuration

Added support for filtering configuration sections by platform type
Added configuration file with section for Grace Hopper

Bugfixes:

UCP

Fixed crash due to incorrect lane selection when active message is disabled
Fixed RMA lane selection issue due to wrong bandwidth calculation
Fixed rendezvous protocol information in protocol details table
Fixed endpoint reconfiguration issue due to wrong bandwidth calculation
Fixed Active Message handlers issue due to out of order registration
Fixed registration of memh evens for imported memory key
Fixed sockaddr unreachable destination error handling
Fixed uninitialized memory issue in new protocols infrastructure
Fixed race condition when using strong fence by flushing all endpoints
Fixed incorrect RMA message size on immediate completion with no datatype
Fixed incorrect performance estimation due to fp8 pack/unpack issue
Fixed remote access error when rcache memory is not registered with atomic access
Fixed assertion failure when rcache fails during memh allocation
Fixed atomic device selection issue
Fixed worker interface deactivation while still in use by endpoints

RDMA CORE (IB, ROCE, etc.)

Disabled device memory if atomics are not available
Fixed indirect keys creation for MT registered memory
Fixed KSM start address value when creating export key
Fixed DCI pool index to support maximum of 16 pools
Fixed atomic rkey issue when using imported memory
Fixed crash due to unsupported SRQ capability

GPU (CUDA, ROCM)

Removed unused environment variable RCACHE_ADDR_ALIGN from ROCm transport
Fixed usage of cuda device 0 when no context is active
Removed error handling support from CUDA IPC transport
Fixed allocation of unaligned CUDA memory

Shared Memory

Fixed occasional crash when shm_unlink fails during interface initialization

UCS

Fixed system device distance calculation for devices on different PCIe root
Fixed support for large size arrays in ucs_array
Fixed synchronization issue in rcache

Tests

Fixed test failures when GPU is present but disabled
Fixed Active Message hanging issue in ucp_client_server
Fixed potential crash due to redundant munmap call in ucp mmap tests
Fixed a crash when running CUDA gtest under valgrind
Fixed UD endpoint timeout issue under Valgrind

Java

Fixed failures in Java tests by waiting for send requests completion
Fixed JVM segfault in Java tests when gdrcopy driver is not loaded
Fixed go build and go tests failures

Packaging

Disabled Go bindings in Debian package

Assets 21

16 May 14:25

shasson5

v1.17.0-rc1

0233ba6

v1.17.0 RC1 Pre-release

Pre-release

1.17.0 RC1 (May 16, 2024)

TBD

Assets 21

16 Apr 13:39

shasson5

v1.16.0

e4bb802

v1.16.0

1.16.0 (April 15, 2024)

Features:

UCP

Added tag offload rendezvous protocol in new infrastructure
Added rcache to old protocols infrastructure
Added multi-fragment protocols for stream API in new infrastructure
Enabled new protocols infrastructure by default
Removed context param from ucp_memh_put
Added assertion if trying to register unsupported memory type
Adjusted rendezvous latency to improve scalability
Improved endpoint configuration logging information
Added check for max length of user defined Active Message header
Added rcache support for mem type memory registration
Enabled error handling for rndv/put_zcopy protocol
Enabled v2 as default client/server connection establishment packet version
Enabled rendezvous protocol selection for reachable MDs only
Added ucp_rkey_compare API to enable rkey comparison
Added release version to worker address to enable wire compatability
Added support for memory invalidation for rendezvous through DC transport
Enabled the use of strong fence with new protocols infrastructure

UCT

Added UCS_MEMORY_TYPE_RDMA memory type for better latency on supported devices
Implemented is_reachable_v2 API for IB transport
Added ep_is_conntected API

RDMA CORE (IB, ROCE, etc.)

Added Floating LID(FLID) based routing support
Added latency and min_zcopy configuration variables to ROCm-IPC
Added support for indirect MR for cross-gvmi mkey instead of direct MR with DEVX UMEM

TCP

Added filter for eliminate bridge devices from lane selection

GPU (CUDA, ROCM)

Added support for handling memh with multiple registrations
Added performance estimation BW based on GPU type
Adjusted rocm/ipc latency and zcopy threshold parameters
Improved error message when libnvidia-ml not installed
Added profiling to Cuda runtime API calls
Adjusted gdr_copy estimated BW to improve protocol selection

Shared Memory

Adjusted FIFO_SIZE to improve scalability
Removed redundent rcahce implementation in knem transport
Added support for symmetric rkey to improve memory usage

UCS

Improved scalability of connection establishment flow
Improved memtype cache performance by replacing ptrhead_lock to spinlock
Added support for VLAN over channel bonding interface
Added LRU cache and Usage Tracker datastructures
Improved cross-NUMA device detection
Added support for PCIe gen5 bandwidth detection

Build

Added LCOV coverage report as a build option
Added binutils 2.40 library dependencies
Added development modulefile

Tools

Added information about sizes of ucp_request_t fields in ucx_info
Added ucx env to profiling output
Added MAD RTE in ucx_perftest to support setups without IPoIB

Tests

Added GTEST_LOG_LEVEL env var to set log level just before test run
Disabled protov1 and ud_verbs tests for valgrind mode
Reduced gtest execution time

Documentation

Added a few details to coding style

Bugfixes:

UCP

Reverted wireup latency calculation which caused lanes selection issue
Fixed strong fence to always ensure ordering
Fixed registration of memh for RNDV protocol
Fixed rndv_put and rkey_ptr assertion failure
Fixed performance estimation for multi-fragment protocols
Fixed memory registration error handling
Fixed buffer overflow of large log messages
Fixed progress enabling for selected lanes
Fixed atomic lanes progress enabling
Added missing rendezvous schemes to environment variable documentation
Fixed bcopy BW estimation for AMD
Fixed lanes information printing for new protocols infrastructure
Fixed rndv_am protocol thresholds
Fixed fp8 packing issue
Fixed Intel OneAPI compilation error
Fixed CM address packing on server side
Fixed endpoint reconfiguration issue due to asymmetrical selection
Fixed asymmetrical selection due to wire compatability issue
Fixed potential deadlock with cuda_copy and RTR protocol
Fixed tag_recv return value on immediate completion
Fixed memory corruption by proper memh handling in tag offload rendezvous
Changed default allocator to not use reserved huge pages
Fixed rndv put protocol to avoid early completion
Fixed rndv_put transport selection for device to device scenario
Disabled rendezvous pipeline protocol selection when using non-contiguous buffer
Fixed crash in rendezvous protocol rkey pack after failed memory registration

RDMA CORE (IB, ROCE, etc.)

Fixed compilation failure when DevX is explicitly disabled
Fixed crash when using PCIe relaxed ordering
Fixed remote access error with rc_verbs transport
Fixed endpoint address management in unified mode
Fixed assertion failure when configured with UCX_IB_ADDR_TYPE=ib_global
Fixed overwritten MD attribute capabilities when querying a device
Fixed ibv_reg_mr error by registering memory in rcache callback
Disabled MR multithreading registration
Fixed mlx5 WQE posting error due to compiler memory copy optimizations

TCP

Fixed assymetric lanes selection issue due to inconsistent device listing

GPU (CUDA, ROCM)

Fixed compilation flags to support ROCm 6.0
Fixed values of D2H_THRESH and latencey params
Fixed Cuda memory support for iov datatype
Increased max number of agents in ROCm
Fixed cuda_ipc transport being disabled if a CUDA device is not set during initialization

Shared Memoey

Fixed posix and cma transport selection by enhancing reachability checks
Fixed UGNI build failure
Fixed latency overhead for knem and cma transports
Fixed possible out-of-order issue in mm_iface

UCS

Fixed a deadlock when forked debugger is attached during an error in rcache operation
Fixed crash due to passing null pointer to log function
Fixed crash due to incorrect hashing method
Fixed crash in configuration parser cleanup by moving it after profiler cleanup
Fixed floating point division by zero during protocols initialization

UCM

Fixed occasional crash in bisto hooks by adding a lock before hooking
Fixed compilation error when building on PPC64

Java

Fixed go tests by setting CUDA device before allocating CUDA memory
Fixed perftest error detection and hanging issue

Tools

Fixed cpu model type for AMD Genoa in ucx_info
Enhanced multi-thread test output

Build

Fixed JUCX package publishing, so it will include support for ARM
Fixed ROCm building and testing
Removed libnvidia-compute version dependency
Removed libibmad/libumad from default build configuration to avoid runtime dependency

Packaging

Fixed already existing target error when using cmake find_package(ucx) twice

Assets 21

Releases: openucx/ucx

v1.18.1 RC1

1.18.1-rc1 (February 20, 2025)

Features:

AZP

Bugfixes:

UCP

v1.18.0

1.18.0 (January 17, 2025)

Features:

UCP

RDMA CORE (IB, ROCE, etc.)

CUDA

UCS

UCM

Tools

Documentation

Architecture

Build

GO

Packaging

Bugfixes:

UCP

UCT

RDMA CORE (IB, ROCE, etc.)

CUDA

UCS

UCM

ROCM

TCP

Tools

Build

Configuration

GO

v1.18.0 RC3

1.18.0-rc3 (December 23, 2024)

Features:

UCP

RDMA CORE (IB, ROCE, etc.)

CUDA

UCS

UCM

Tools

Documentation

Architecture

Build

GO

Packaging

Bugfixes:

UCP

UCT

RDMA CORE (IB, ROCE, etc.)

CUDA

UCS

UCM

ROCM

TCP

Tools

Build

Configuration

GO

v1.18.0 RC2

1.18.0-rc2 (December 10, 2024)

Features: TBD

Bugfixes: TBD

v1.18.0 RC1

1.18.0-rc1 (November 26, 2024)

Features: TBD

Bugfixes: TBD

v1.17.0

1.17.0 (June 13, 2024)

Features:

UCP

RDMA CORE (IB, ROCE, etc.)

GPU (CUDA, ROCM)

UCS

Tools

Java

Build

AZP