- #553: Deprecate the
CUB_USE_COOPERATIVE_GROUPS
macro, as all supported CTK distributions provide CG. This macro will be removed in a future version of CUB.
- #359: Add new
DeviceBatchMemcpy
algorithm. - #565: Add
DeviceMergeSort::StableSortKeysCopy
API. Thanks to David Wendt (@davidwendt) for this contribution. - #585: Add SM90 tuning policy for
DeviceRadixSort
. Thanks to Andy Adinets (@canonizer) for this contribution. - #586: Introduce a new mechanism to opt-out of compiling CDP support in CUB algorithms by
defining
CUB_DISABLE_CDP
. - #589: Support 64-bit indexing in
DeviceReduce
. - #607: Support 128-bit integers in radix sort.
- #547: Resolve several long-running issues resulting from using multiple versions of CUB within the same process. Adds an inline namespace that encodes CUB version and targeted PTX architectures.
- #562: Fix bug in
BlockShuffle
resulting from an invalid thread offset. Thanks to @sjfeng1999 for this contribution. - #564: Fix bug in
BlockRadixRank
when used with blocks that are not a multiple of 32 threads. - #579: Ensure that all threads in the logical warp participate in the index-shuffle
for
BlockRadixRank
. Thanks to Andy Adinets (@canonizer) for this contribution. - #582: Fix reordering in CUB member initializer lists.
- #589: Fix
DeviceSegmentedSort
when used withbool
keys. - #590: Fix CUB's CMake install rules. Thanks to Robert Maynard (@robertmaynard) for this contribution.
- #592: Fix overflow in
DeviceReduce
. - #598: Fix
DeviceRunLengthEncode
when the first item is aNaN
. - #611: Fix
WarpScanExclusive
for vector types.
- #537: Add detailed and expanded version of a CUB developer overview.
- #549: Fix
BlockReduceRaking
docs for non-commutative operations. Thanks to Tobias Ribizel (@upsj) for this contribution. - #606: Optimize CUB's decoupled-lookback implementation.
- Skip device-side synchronization on SM90+. These syncs are a debugging-only feature and not required for correctness, and a warning will be emitted if this happens.
The CUB 2.0.0 major release adds a dependency on libcu++ and contains several
breaking changes. These include new diagnostics when inspecting device-only
lambdas from the host, an updated method of determining accumulator types for
algorithms like Reduce and Scan, and a compile-time replacement for the
runtime debug_synchronous
debugging flags.
This release also includes several new features. DeviceHistogram
now
supports __half
and better handles various edge cases. WarpReduce
now
performs correctly when restricted to a single-thread “warp”, and will use
the __reduce_add_sync
accelerated intrinsic (introduced with Ampere) when
appropriate. DeviceRadixSort
learned to handle the case
where begin_bit == end_bit
.
Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.
- #448 Add libcu++ dependency (v1.8.0+).
- #448: The following macros are no longer defined by default. They
can be re-enabled by defining
CUB_PROVIDE_LEGACY_ARCH_MACROS
. These will be completely removed in a future release.CUB_IS_HOST_CODE
: Replace withNV_IF_TARGET
.CUB_IS_DEVICE_CODE
: Replace withNV_IF_TARGET
.CUB_INCLUDE_HOST_CODE
: Replace withNV_IF_TARGET
.CUB_INCLUDE_DEVICE_CODE
: Replace withNV_IF_TARGET
.
- #486: CUB's CUDA Runtime support macros have been updated to
support
NV_IF_TARGET
. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.CUB_RUNTIME_FUNCTION
: Execution space annotations for functions that invoke CUDA Runtime APIs.- Old behavior:
- RDC enabled: Defined to
__host__ __device__
- RDC not enabled:
- NVCC host pass: Defined to
__host__ __device__
- NVCC device pass: Defined to
__host__
- NVCC host pass: Defined to
- RDC enabled: Defined to
- New behavior:
- RDC enabled: Defined to
__host__ __device__
- RDC not enabled: Defined to
__host__
- RDC enabled: Defined to
- Old behavior:
CUB_RUNTIME_ENABLED
: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:- RDC enabled: Macro is defined.
- RDC not enabled:
- NVCC host pass: Macro is defined.
- NVCC device pass: Macro is not defined.
CUB_RDC_ENABLED
: New macro, may be combined withNV_IF_TARGET
to replace most usages ofCUB_RUNTIME_ENABLED
. Behavior:- RDC enabled: Macro is defined.
- RDC not enabled: Macro is not defined.
- #509: A compile-time error is now emitted when a
__device__
-only lambda's return type is queried from host code (requires libcu++ ≥ 1.9.0).- Due to limitations in the CUDA programming model, the result of this query is unreliable, and will silently return an incorrect result. This leads to difficult to debug errors.
- When using libcu++ 1.9.0, an error will be emitted with information about
work-arounds:
- Use a named function object with a
__device__
-only implementation ofoperator()
. - Use a
__host__ __device__
lambda. - Use
cuda::proclaim_return_type
(Added in libcu++ 1.9.0)
- Use a named function object with a
- #509: Use the result type of the binary reduction operator for
accumulating intermediate results in the
DeviceReduce
algorithm, following guidance from http://wg21.link/P2322R6.- This change requires host-side introspection of the binary operator's signature, and device-only extended lambda functions can no longer be used.
- In addition to the behavioral changes, the interfaces for
the
Dispatch*Reduce
layer have changed:DispatchReduce
:- Now accepts accumulator type as last parameter.
- Now accepts initializer type instead of output iterator value type.
- Constructor now accepts
init
as initial type instead of output iterator value type.
DispatchSegmentedReduce
:- Accepts accumulator type as last parameter.
- Accepts initializer type instead of output iterator value type.
- Thread operators now accept parameters using different types:
Equality
,Inequality
,InequalityWrapper
,Sum
,Difference
,Division
,Max
,ArgMax
,Min
,ArgMin
. ThreadReduce
now accepts accumulator type and uses a different type forprefix
.
- #511: Use the result type of the binary operator for accumulating
intermediate results in the
DeviceScan
,DeviceScanByKey
, andDeviceReduceByKey
algorithms, following guidance from http://wg21.link/P2322R6.- This change requires host-side introspection of the binary operator's signature, and device-only extended lambda functions can no longer be used.
- In addition to the behavioral changes, the interfaces for the
Dispatch
layer have changed:DispatchScan
now accepts accumulator type as a template parameter.DispatchScanByKey
now accepts accumulator type as a template parameter.DispatchReduceByKey
now accepts accumulator type as the last template parameter.
- #527: Deprecate the
debug_synchronous
flags on device algorithms.- This flag no longer has any effect. Define
CUB_DEBUG_SYNC
during compilation to enable these checks. - Moving this option from run-time to compile-time avoids the compilation overhead of unused debugging paths in production code.
- This flag no longer has any effect. Define
- #514: Support
__half
inDeviceHistogram
. - #516: Add support for single-threaded invocations of
WarpReduce
. - #516: Use
__reduce_add_sync
hardware acceleration forWarpReduce
on supported architectures.
- #481: Fix the device-wide radix sort implementations to simply copy
the input to the output when
begin_bit == end_bit
. - #487: Fix
DeviceHistogram::Even
for a variety of edge cases:- Bin ids are now correctly computed when mixing different types for
SampleT
andLevelT
. - Bin ids are now correctly computed when
LevelT
is an integral type and the number of levels does not evenly divide the level range.
- Bin ids are now correctly computed when mixing different types for
- #508: Ensure that
temp_storage_bytes
is properly set in theAdjacentDifferenceCopy
device algorithms. - #508: Remove excessive calls to the binary operator given to
the
AdjacentDifferenceCopy
device algorithms. - #533: Fix debugging utilities when RDC is disabled.
- #448: Removed special case code for unsupported CUDA architectures.
- #448: Replace several usages of
__CUDA_ARCH__
with<nv/target>
to handle host/device code divergence. - #448: Mark unused PTX arch parameters as legacy.
- #476: Enabled additional debug logging for the onesweep radix sort implementation. Thanks to @canonizer for this contribution.
- #480: Add
CUB_DISABLE_BF16_SUPPORT
to avoid including thecuda_bf16.h
header or using the__nv_bfloat16
type. - #486: Add debug log messages for post-kernel debug synchronizations.
- #490: Clarify documentation for in-place usage of
DeviceScan
algorithms. - #494: Clarify documentation for in-place usage of
DeviceHistogram
algorithms. - #495: Clarify documentation for in-place usage of
DevicePartition
algorithms. - #499: Clarify documentation for in-place usage of
Device*Sort
algorithms. - #500: Clarify documentation for in-place usage of
DeviceReduce
algorithms. - #501: Clarify documentation for in-place usage
of
DeviceRunLengthEncode
algorithms. - #503: Clarify documentation for in-place usage of
DeviceSelect
algorithms. - #518: Fix typo in
WarpMergeSort
documentation. - #519: Clarify segmented sort documentation regarding the handling of elements that are not included in any segment.
CUB 1.17.2 is a minor bugfix release.
- #547: Introduce an annotated inline namespace to prevent issues with collisions and mismatched kernel configurations across libraries. The new namespace encodes the CUB version and target SM architectures.
CUB 1.17.1 is a minor bugfix release.
- #508: Ensure that
temp_storage_bytes
is properly set in theAdjacentDifferenceCopy
device algorithms. - #508: Remove excessive calls to the binary operator given to
the
AdjacentDifferenceCopy
device algorithms. - Fix device-side debug synchronous behavior in
DeviceSegmentedSort
.
CUB 1.17.0 is the final minor release of the 1.X series. It provides a variety of bug fixes and miscellaneous enhancements, detailed below.
Several CUB device algorithms are documented to provide deterministic results
(per device) for non-associative reduction operators (e.g. floating-point
addition). Unfortunately, the implementations of these algorithms contain
performance optimizations that violate this guarantee.
The DeviceReduce::ReduceByKey
and DeviceScan
algorithms are known to be
affected. We're currently evaluating the scope and impact of correcting this in
a future CUB release. See NVIDIA/cub#471 for details.
- #444: Fixed
DeviceSelect
to work with discard iterators and mixed input/output types. - #452: Fixed install issue when
CMAKE_INSTALL_LIBDIR
contained nested directories. Thanks to @robertmaynard for this contribution. - #462: Fixed bug that produced incorrect results
from
DeviceSegmentedSort
on sm_61 and sm_70. - #464: Fixed
DeviceSelect::Flagged
so that flags are normalized to 0 or 1. - #468: Fixed overflow issues in
DeviceRadixSort
givennum_items
close to 2^32. Thanks to @canonizer for this contribution. - #498: Fixed compiler regression in
BlockAdjacentDifference
. Thanks to @MKKnorr for this contribution.
- #445: Remove device-sync in
DeviceSegmentedSort
when launched via CDP. - #449: Fixed invalid link in documentation. Thanks to @kshitij12345 for this contribution.
- #450:
BlockDiscontinuity
: Replaced recursive-template loop unrolling with#pragma unroll
. Thanks to @kshitij12345 for this contribution. - #451: Replaced the deprecated
TexRefInputIterator
implementation with an alias toTexObjInputIterator
. This fully removes all usages of the deprecated CUDA texture reference APIs from CUB. - #456:
BlockAdjacentDifference
: Replaced recursive-template loop unrolling with#pragma unroll
. Thanks to @kshitij12345 for this contribution. - #466:
cub::DeviceAdjacentDifference
API has been updated to use the newOffsetT
deduction approach described in #212. - #470: Fix several doxygen-related warnings. Thanks to @karthikeyann for this contribution.
CUB 1.16.0 is a major release providing several improvements to the device scope
algorithms. DeviceRadixSort
now supports large (64-bit indexed) input data. A
new UniqueByKey
algorithm has been added to DeviceSelect
.
DeviceAdjacentDifference
provides new SubtractLeft
and SubtractRight
functionality.
This release also deprecates several obsolete APIs, including type traits
and BlockAdjacentDifference
algorithms. Many bugfixes and documentation
updates are also included.
Users frequently want to process large datasets using CUB's device-scope algorithms, but the current public APIs limit input data sizes to those that can be indexed by a 32-bit integer. Beginning with this release, CUB is updating these APIs to support 64-bit offsets, as discussed in #212.
The device-scope algorithms will be updated with 64-bit offset support
incrementally, starting with the cub::DeviceRadixSort
family of algorithms.
Thanks to @canonizer for contributing this functionality.
cub::DeviceSelect
now provides a UniqueByKey
algorithm, which has been
ported from Thrust. Thanks to @zasdfgbnm for this contribution.
The new cub::DeviceAdjacentDifference
interface, also ported from Thrust,
provides SubtractLeft
and SubtractRight
algorithms as CUB kernels.
A future version of CUB will change the debug_synchronous
behavior of
device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).
This will only affect calls to CUB device-scope algorithms launched from
device-side code with debug_synchronous = true
. Such invocations will continue
to print extra debugging information, but they will no longer synchronize after
kernel launches.
CUB provided a variety of metaprogramming type traits in order to support C++03. Since C++14 is now required, these traits have been deprecated in favor of their STL equivalents, as shown below:
Deprecated CUB Trait | Replacement STL Trait |
---|---|
cub::If | std::conditional |
cub::Equals | std::is_same |
cub::IsPointer | std::is_pointer |
cub::IsVolatile | std::is_volatile |
cub::RemoveQualifiers | std::remove_cv |
cub::EnableIf | std::enable_if |
CUB now uses the STL traits internally, resulting in a ~6% improvement in compile time.
The algorithms in cub::BlockAdjacentDifference
have been deprecated, as their
names did not clearly describe their intent. The FlagHeads
method is
now SubtractLeft
, and FlagTails
has been replaced by SubtractRight
.
- #331: Deprecate the misnamed
BlockAdjacentDifference::FlagHeads
andFlagTails
methods. Use the newSubtractLeft
andSubtractRight
methods instead. - #364: Deprecate some obsolete type traits. These should be replaced
by the equivalent traits in
<type_traits>
as described above.
- #331: Port the
thrust::adjacent_difference
kernel and expose it ascub::DeviceAdjacentDifference
. - #405: Port the
thrust::unique_by_key
kernel and expose it ascub::DeviceSelect::UniqueByKey
. Thanks to @zasdfgbnm for this contribution.
- #340: Allow 64-bit offsets in
DeviceRadixSort
public APIs. Thanks to @canonizer for this contribution. - #400: Implement a significant reduction in
DeviceMergeSort
compilation time. - #415: Support user-defined
CMAKE_INSTALL_INCLUDEDIR
values in Thrust's CMake install rules. Thanks for @robertmaynard for this contribution.
- #381: Fix shared memory alignment in
dyn_smem
example. - #393: Fix some collisions with the
min
/max
macros defined inwindows.h
. - #404: Fix bad cast in
util_device
. - #410: Fix CDP issues in
DeviceSegmentedSort
. - #411: Ensure that the
nv_exec_check_disable
pragma is only used on nvcc. - #418: Fix
-Wsizeof-array-div
warning on gcc 11. Thanks to @robertmaynard for this contribution. - #420: Fix new uninitialized variable warning in
DiscardIterator
on gcc 10. - #423: Fix some collisions with the
small
macro defined inwindows.h
. - #426: Fix some issues with version handling in CUB's CMake packages.
- #430: Remove documentation for
DeviceSpmv
parameters that are absent from public APIs. - #432: Remove incorrect documentation for
DeviceScan
algorithms that guaranteed run-to-run deterministic results for floating-point addition.
CUB 1.15.0 includes a new cub::DeviceSegmentedSort
algorithm, which
demonstrates up to 5000x speedup compared to cub::DeviceSegmentedRadixSort
when sorting a large number of small segments. A new cub::FutureValue<T>
helper allows the cub::DeviceScan
algorithms to lazily load the
initial_value
from a pointer. cub::DeviceScan
also added ScanByKey
functionality.
The new DeviceSegmentedSort
algorithm partitions segments into size groups.
Each group is processed with specialized kernels using a variety of sorting
algorithms. This approach varies the number of threads allocated for sorting
each segment and utilizes the GPU more efficiently.
cub::FutureValue<T>
provides the ability to use the result of a previous
kernel as a scalar input to a CUB device-scope algorithm without unnecessary
synchronization:
int *d_intermediate_result = ...;
intermediate_kernel<<<blocks, threads>>>(d_intermediate_result, // output
arg1, // input
arg2); // input
// Wrap the intermediate pointer in a FutureValue -- no need to explicitly
// sync when both kernels are stream-ordered. The pointer is read after
// the ExclusiveScan kernel starts executing.
cub::FutureValue<int> init_value(d_intermediate_result);
cub::DeviceScan::ExclusiveScan(d_temp_storage,
temp_storage_bytes,
d_in,
d_out,
cub::Sum(),
init_value,
num_items);
Previously, an explicit synchronization would have been necessary to obtain the intermediate result, which was passed by value into ExclusiveScan. This new feature enables better performance in workflows that use cub::DeviceScan.
A future version of CUB will change the debug_synchronous
behavior of
device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).
This will only affect calls to CUB device-scope algorithms launched from
device-side code with debug_synchronous = true
. These algorithms will continue
to print extra debugging information, but they will no longer synchronize after
kernel launches.
- #305: The template parameters of
cub::DispatchScan
have changed to support the newcub::FutureValue
helper. More details under "New Features". - #377: Remove broken
operator->()
fromcub::TransformInputIterator
, since this cannot be implemented without returning a temporary object's address. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
- #305: Add overloads to
cub::DeviceScan
algorithms that allow the output of a previous kernel to be used asinitial_value
without explicit synchronization. See the newcub::FutureValue
helper for details. Thanks to Xiang Gao (@zasdfgbnm) for this contribution. - #354: Add
cub::BlockRunLengthDecode
algorithm. Thanks to Elias Stehle (@elstehle) for this contribution. - #357: Add
cub::DeviceSegmentedSort
, an optimized version ofcub::DeviceSegmentedSort
with improved load balancing and small array performance. - #376: Add "by key" overloads to
cub::DeviceScan
. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
- #349: Doxygen and unused variable fixes.
- #363: Maintenance updates for the new
cub::DeviceMergeSort
algorithms. - #382: Fix several
-Wconversion
warnings. Thanks to Matt Stack (@matt-stack) for this contribution. - #388: Fix debug assertion on MSVC when using
cub::CachingDeviceAllocator
. - #395: Support building with
__CUDA_NO_HALF_CONVERSIONS__
. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
CUB 1.14.0 is a major release accompanying the NVIDIA HPC SDK 21.9.
This release provides the often-requested merge sort algorithm, ported from the
thrust::sort
implementation. Merge sort provides more flexibility than the
existing radix sort by supporting arbitrary data types and comparators, though
radix sorting is still faster for supported inputs. This functionality is
provided through the new cub::DeviceMergeSort
and cub::BlockMergeSort
algorithms.
The namespace wrapping mechanism has been overhauled for 1.14. The existing
macros (CUB_NS_PREFIX
/CUB_NS_POSTFIX
) can now be replaced by a single macro,
CUB_WRAPPED_NAMESPACE
, which is set to the name of the desired wrapped
namespace. Defining a similar THRUST_CUB_WRAPPED_NAMESPACE
macro will embed
both thrust::
and cub::
symbols in the same external namespace. The
prefix/postfix macros are still supported, but now require a new
CUB_NS_QUALIFIER
macro to be defined, which provides the fully qualified CUB
namespace (e.g. ::foo::cub
). See cub/util_namespace.cuh
for details.
- #350: When the
CUB_NS_[PRE|POST]FIX
macros are set,CUB_NS_QUALIFIER
must also be defined to the fully qualified CUB namespace (e.g.#define CUB_NS_QUALIFIER ::foo::cub
). Note that this is handled automatically when using the new[THRUST_]CUB_WRAPPED_NAMESPACE
mechanism.
- #322: Ported the merge sort algorithm from Thrust:
cub::BlockMergeSort
andcub::DeviceMergeSort
are now available. - #326: Simplify the namespace wrapper macros, and detect when Thrust's symbols are in a wrapped namespace.
- #160, #163, #352: Fixed several bugs in
cub::DeviceSpmv
and added basic tests for this algorithm. Thanks to James Wyles and Seunghwa Kang for their contributions. - #328: Fixed error handling bug and incorrect debugging output in
cub::CachingDeviceAllocator
. Thanks to Felix Kallenborn for this contribution. - #335: Fixed a compile error affecting clang and NVRTC. Thanks to Jiading Guo for this contribution.
- #351: Fixed some errors in the
cub::DeviceHistogram
documentation.
- #348: Add an example that demonstrates how to use dynamic shared memory with a CUB block algorithm. Thanks to Matthias Jouanneaux for this contribution.
CUB 1.13.1 is a minor release accompanying the CUDA Toolkit 11.5.
This release provides a new hook for embedding the cub::
namespace inside
a custom namespace. This is intended to work around various issues related to
linking multiple shared libraries that use CUB. The existing CUB_NS_PREFIX
and
CUB_NS_POSTFIX
macros already provided this capability; this update provides a
simpler mechanism that is extended to and integrated with Thrust. Simply define
THRUST_CUB_WRAPPED_NAMESPACE
to a namespace name, and both thrust::
and
cub::
will be placed inside the new namespace. Using different wrapped
namespaces for each shared library will prevent issues like those reported in
NVIDIA/thrust#1401.
- #326: Add
THRUST_CUB_WRAPPED_NAMESPACE
hooks.
CUB 1.13.0 is the major release accompanying the NVIDIA HPC SDK 21.7 release.
Notable new features include support for striped data arrangements in block
load/store utilities, bfloat16
radix sort support, and fewer restrictions on
offset iterators in segmented device algorithms. Several bugs
in cub::BlockShuffle
, cub::BlockDiscontinuity
, and cub::DeviceHistogram
have been addressed. The amount of code generated in cub::DeviceScan
has been
greatly reduced, leading to significant compile-time improvements when targeting
multiple PTX architectures.
This release also includes several user-contributed documentation fixes that will be reflected in CUB's online documentation in the coming weeks.
- #320: Deprecated
cub::TexRefInputIterator<T, UNIQUE_ID>
. Usecub::TexObjInputIterator<T>
as a replacement.
- #274: Add
BLOCK_LOAD_STRIPED
andBLOCK_STORE_STRIPED
functionality tocub::BlockLoadAlgorithm
andcub::BlockStoreAlgorithm
. Thanks to Matthew Nicely (@mnicely) for this contribution. - #291:
cub::DeviceSegmentedRadixSort
andcub::DeviceSegmentedReduce
now support different types for begin/end offset iterators. Thanks to Sergey Pavlov (@psvvsp) for this contribution. - #306: Add
bfloat16
support tocub::DeviceRadixSort
. Thanks to Xiang Gao (@zasdfgbnm) for this contribution. - #320: Introduce a new
CUB_IGNORE_DEPRECATED_API
macro that disables deprecation warnings on Thrust and CUB APIs.
- #277: Fixed sanitizer warnings in
RadixSortScanBinsKernels
. Thanks to Andy Adinets (@canonizer) for this contribution. - #287:
cub::DeviceHistogram
now correctly handles cases whereOffsetT
is not anint
. Thanks to Dominique LaSalle (@nv-dlasalle) for this contribution. - #311: Fixed several bugs and added tests for the
cub::BlockShuffle
collective operations. - #312: Eliminate unnecessary kernel instantiations when
compiling
cub::DeviceScan
. Thanks to Elias Stehle (@elstehle) for this contribution. - #319: Fixed out-of-bounds memory access on debugging builds
of
cub::BlockDiscontinuity::FlagHeadsAndTails
. - #320: Fixed harmless missing return statement warning in
unreachable
cub::TexObjInputIterator
code path.
- Several documentation fixes are included in this release.
- #275: Fixed comments describing the
cub::If
andcub::Equals
utilities. Thanks to Rukshan Jayasekara (@rukshan99) for this contribution. - #290: Documented that
cub::DeviceSegmentedReduce
will produce consistent results run-to-run on the same device for pseudo-associated reduction operators. Thanks to Himanshu (@himanshu007-creator) for this contribution. - #298:
CONTRIBUTING.md
now refers to Thrust's build instructions for developer builds, which is the preferred way to build the CUB test harness. Thanks to Xiang Gao (@zasdfgbnm) for contributing. - #301: Expand
cub::DeviceScan
documentation to include in-place support and add tests. Thanks to Xiang Gao (@zasdfgbnm) for this contribution. - #307: Expand
cub::DeviceRadixSort
andcub::BlockRadixSort
documentation to clarify stability, in-place support, and type-specific bitwise transformations. Thanks to Himanshu (@himanshu007-creator) for contributing. - #316: Move
WARP_TIME_SLICING
documentation to the correct location. Thanks to Peter Han (@peter9606) for this contribution. - #321: Update URLs from deprecated github.com to preferred github.io. Thanks to Lilo Huang (@lilohuang) for this contribution.
- #275: Fixed comments describing the
CUB 1.12.1 is a trivial patch release that slightly changes the phrasing of a deprecation message.
CUB 1.12.0 is a bugfix release accompanying the NVIDIA HPC SDK 21.3 and the CUDA Toolkit 11.4.
Radix sort is now stable when both +0.0 and -0.0 are present in the input (they
are treated as equivalent).
Many compilation warnings and subtle overflow bugs were fixed in the device
algorithms, including a long-standing bug that returned invalid temporary
storage requirements when num_items
was close to (but not
exceeding) INT32_MAX
.
Support for Clang < 7.0 and MSVC < 2019 (aka 19.20/16.0/14.20) is now
deprecated.
- #256: Deprecate Clang < 7 and MSVC < 2019.
- #218: Radix sort now treats -0.0 and +0.0 as equivalent for floating point types, which is required for the sort to be stable. Thanks to Andy Adinets for this contribution.
- #247: Suppress newly triggered warnings in Clang. Thanks to Andrew Corrigan for this contribution.
- #249: Enable stricter warning flags. This fixes a number of outstanding issues:
- #258: Use correct
OffsetT
inDispatchRadixSort::InitPassConfig
. Thanks to Felix Kallenborn for this contribution. - #259: Remove some problematic
__forceinline__
annotations.
- #123: Fix incorrect issue number in changelog. Thanks to Peet Whittaker for this contribution.
CUB 1.11.0 is a major release accompanying the CUDA Toolkit 11.3 release, providing bugfixes and performance enhancements.
It includes a new DeviceRadixSort
backend that improves performance by up to
2x on supported keys and hardware.
Our CMake package and build system continue to see improvements
with add_subdirectory
support, installation rules, status messages, and other
features that make CUB easier to use from CMake projects.
The release includes several other bugfixes and modernizations, and received updates from 11 contributors.
- #201: The intermediate accumulator type used when
DeviceScan
is invoked with different input/output types is now consistent with P0571. This may produce different results for some edge cases when compared with earlier releases of CUB.
- #204: Faster
DeviceRadixSort
, up to 2x performance increase for 32/64-bit keys on Pascal and up (SM60+). Thanks to Andy Adinets for this contribution. - Unroll loops in
BlockRadixRank
to improve performance for 32-bit keys by 1.5-2x on Clang CUDA. Thanks to Justin Lebar for this contribution. - #200: Allow CUB to be added to CMake projects via
add_subdirectory
. - #214: Optionally add install rules when included with
CMake's
add_subdirectory
. Thanks to Kai Germaschewski for this contribution.
- #215: Fix integer truncation in
AgentReduceByKey
,AgentScan
, andAgentSegmentFixup
. Thanks to Rory Mitchell for this contribution. - #225: Fix compile-time regression when defining
CUB_NS_PREFIX
/CUB_NS_POSTFIX
macro. Thanks to Elias Stehle for this contribution. - #210: Fix some edge cases in
DeviceScan
:- Use values from the input when padding temporary buffers. This prevents custom functors from getting unexpected values.
- Prevent integer truncation when using large indices via the
DispatchScan
layer. - Use timesliced reads/writes for types > 128 bytes.
- #217: Fix and add test for cmake package install rules. Thanks to Keith Kraus and Kai Germaschewski for testing and discussion.
- #170, #233: Update CUDA version checks to behave on Clang
CUDA and
nvc++
. Thanks to Artem Belevich, Andrew Corrigan, and David Olsen for these contributions. - #220, #216: Various fixes for Clang CUDA. Thanks to Andrew Corrigan for these contributions.
- #231: Fix signedness mismatch warnings in unit tests.
- #231: Suppress GPU deprecation warnings.
- #214: Use semantic versioning rules for our CMake package's compatibility checks. Thanks to Kai Germaschewski for this contribution.
- #214: Use
FindPackageHandleStandardArgs
to print standard status messages when our CMake package is found. Thanks to Kai Germaschewski for this contribution. - #207: Fix
CubDebug
usage inCachingDeviceAllocator::DeviceAllocate
. Thanks to Andreas Hehn for this contribution. - Fix documentation for
DevicePartition
. Thanks to ByteHamster for this contribution. - Clean up unused code in
DispatchScan
. Thanks to ByteHamster for this contribution.
- #213: Remove tuning policies for unsupported hardware (<SM35).
- References to the old Github repository and branch names were updated.
- Github's
thrust/cub
repository is nowNVIDIA/cub
- Development has moved from the
master
branch to themain
branch.
- Github's
CUB 1.10.0 is the major release accompanying the NVIDIA HPC SDK 20.9 release and the CUDA Toolkit 11.2 release. It drops support for C++03, GCC < 5, Clang < 6, and MSVC < 2017. It also overhauls CMake support. Finally, we now have a Code of Conduct for contributors: https://github.com/NVIDIA/cub/blob/main/CODE_OF_CONDUCT.md
- C++03 is no longer supported.
- GCC < 5, Clang < 6, and MSVC < 2017 are no longer supported.
- C++11 is deprecated.
Using this dialect will generate a compile-time warning.
These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_CPP_DIALECT
orCUB_IGNORE_DEPRECATED_CPP_11
. Suppression is only a short term solution. We will be dropping support for C++11 in the near future. - CMake < 3.15 is no longer supported.
- The default branch on GitHub is now called
main
.
- Added install targets to CMake builds.
- C++17 support.
- NVIDIA/thrust#1244: Check for macro collisions with system headers during header testing.
- NVIDIA/thrust#1153: Switch to placement new instead of assignment to construct items in uninitialized memory. Thanks to Hugh Winkler for this contribution.
- #38: Fix
cub::DeviceHistogram
forsize_t
OffsetT
s. Thanks to Leo Fang for this contribution. - #35: Fix GCC-5 maybe-uninitialized warning. Thanks to Rong Ou for this contribution.
- #36: Qualify namespace for
va_printf
in_CubLog
. Thanks to Andrei Tchouprakov for this contribution.
CUB 1.9.10-1 is the minor release accompanying the NVIDIA HPC SDK 20.7 release and the CUDA Toolkit 11.1 release.
- NVIDIA/thrust#1217: Move static local in cub::DeviceCount to a separate host-only function because NVC++ doesn't support static locals in host-device functions.
Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5 release.
It adds CMake find_package
support.
C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated.
Starting with the upcoming 1.10.0 release, C++03 support will be dropped
entirely.
- Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own version of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
- C++03 and C++11 are deprecated.
Using these dialects will generate a compile-time warning.
These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_CPP_DIALECT
(to suppress C++03 and C++11 deprecation warnings) orCUB_IGNORE_DEPRECATED_CPP_11
(to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future. - GCC < 5, Clang < 6, and MSVC < 2017 are deprecated.
Using these compilers will generate a compile-time warning.
These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_COMPILER
. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.
- CMake
find_package
support. Just point CMake at thecmake
folder in your CUB include directory (ex:cmake -DCUB_DIR=/usr/local/cuda/include/cub/cmake/ .
) and then you can add CUB to your CMake project withfind_package(CUB REQUIRED CONFIG)
.
CUB 1.9.9 is the release accompanying the CUDA Toolkit 11.0 release. It introduces CMake support, version macros, platform detection machinery, and support for NVC++, which uses Thrust (and thus CUB) to implement GPU-accelerated C++17 Parallel Algorithms. Additionally, the scan dispatch layer was refactored and modernized. C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. Starting with the upcoming 1.10.0 release, C++03 support will be dropped entirely.
- Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own version of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
- C++03 and C++11 are deprecated.
Using these dialects will generate a compile-time warning.
These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_CPP_DIALECT
(to suppress C++03 and C++11 deprecation warnings) orCUB_IGNORE_DEPRECATED_CPP11
(to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future. - GCC < 5, Clang < 6, and MSVC < 2017 are deprecated.
Using these compilers will generate a compile-time warning.
These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_COMPILER
. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.
- CMake support. Thanks to Francis Lemaire for this contribution.
- Refactorized and modernized scan dispatch layer. Thanks to Francis Lemaire for this contribution.
- Policy hooks for device-wide reduce, scan, and radix sort facilities to simplify tuning and allow users to provide custom policies. Thanks to Francis Lemaire for this contribution.
<cub/version.cuh>
:CUB_VERSION
,CUB_VERSION_MAJOR
,CUB_VERSION_MINOR
,CUB_VERSION_SUBMINOR
, andCUB_PATCH_NUMBER
.- Platform detection machinery:
<cub/util_cpp_dialect.cuh>
: Detects the C++ standard dialect.<cub/util_compiler.cuh>
: host and device compiler detection.<cub/util_deprecated.cuh>
:CUB_DEPRECATED
.- <cub/config.cuh>
: Includes
<cub/util_arch.cuh>,
<cub/util_compiler.cuh>,
<cub/util_cpp_dialect.cuh>,
<cub/util_deprecated.cuh>,
<cub/util_macro.cuh>,
<cub/util_namespace.cuh>`
cub::DeviceCount
andcub::DeviceCountUncached
, caching abstractions forcudaGetDeviceCount
.
- Lazily initialize the per-device CUDAattribute caches, because CUDA context creation is expensive and adds up with large CUDA binaries on machines with many GPUs. Thanks to the NVIDIA PyTorch team for bringing this to our attention.
- Make
cub::SwitchDevice
avoid setting/resetting the device if the current device is the same as the target device.
- Add explicit failure parameter to CAS in the CUB attribute cache to workaround a GCC 4.8 bug.
- Revert a change in reductions that changed the signedness of the
lane_id
variable to suppress a warning, as this introduces a bug in optimized device code. - Fix initialization in
cub::ExclusiveSum
. Thanks to Conor Hoekstra for this contribution. - Fix initialization of the
std::array
in the CUB attribute cache. - Fix
-Wsign-compare
warnings. Thanks to Elias Stehle for this contribution. - Fix
test_block_reduce.cu
to build without parameters. Thanks to Francis Lemaire for this contribution. - Add missing includes to
grid_even_share.cuh
. Thanks to Francis Lemaire for this contribution. - Add missing includes to
thread_search.cuh
. Thanks to Francis Lemaire for this contribution. - Add missing includes to
cub.cuh
. Thanks to Felix Kallenborn for this contribution.
CUB 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3 release. It contains modifications necessary to serve as the implementation of NVC++'s GPU-accelerated C++17 Parallel Algorithms.
CUB 1.9.8 is the first release of CUB to be officially supported and included in the CUDA Toolkit. When compiling CUB in C++11 mode, CUB now caches calls to CUDA attribute query APIs, which improves performance of these queries by 20x to 50x when they are called concurrently by multiple host threads.
- (C++11 or later) Cache calls to
cudaFuncGetAttributes
andcudaDeviceGetAttribute
withincub::PtxVersion
andcub::SmVersion
. These CUDA APIs acquire locks to CUDA driver/runtime mutex and perform poorly under contention; with the caching, they are 20 to 50x faster when called concurrently. Thanks to Bilge Acun for bringing this issue to our attention. DispatchReduce
now takes anOutputT
template parameter so that users can specify the intermediate type explicitly.- Radix sort tuning policies updates to fix performance issues for element types smaller than 4 bytes.
- Change initialization style from copy initialization to direct initialization
(which is more permissive) in
AgentReduce
to allow a wider range of types to be used with it. - Fix bad signed/unsigned comparisons in
WarpReduce
. - Fix computation of valid lanes in warp-level reduction primitive to correctly handle the case where there are 0 input items per warp.
CUB 1.8.0 introduces changes to the cub::Shuffle*
interfaces.
- The interfaces of
cub::ShuffleIndex
,cub::ShuffleUp
, andcub::ShuffleDown
have been changed to allow for better computation of the PTX SHFL control constant for logical warps smaller than 32 threads.
- #112: Fix
cub::WarpScan
's broadcast of warp-wide aggregate for logical warps smaller than 32 threads.
CUB 1.7.5 adds support for radix sorting __half
keys and improved sorting
performance for 1 byte keys.
It was incorporated into Thrust 1.9.2.
- Radix sort support for
__half
keys. - Radix sort tuning policy updates to improve 1 byte key performance.
- Syntax tweaks to mollify Clang.
- #127:
cub::DeviceRunLengthEncode::Encode
returns incorrect results. - #128: 7-bit sorting passes fail for SM61 with large values.
CUB 1.7.4 is a minor release that was incorporated into Thrust 1.9.1-2.
- #114: Can't pair non-trivially-constructible values in radix sort.
- #115:
cub::WarpReduce
segmented reduction is broken in CUDA 9 for logical warp sizes smaller than 32.
CUB 1.7.3 is a minor release.
- #110:
cub::DeviceHistogram
null-pointer exception bug for iterator inputs.
CUB 1.7.2 is a minor release.
- #108: Device-wide reduction is now "run-to-run" deterministic for pseudo-associative reduction operators (like floating point addition).
CUB 1.7.1 delivers improved radix sort performance on SM7x (Volta) GPUs and a number of bug fixes.
- Radix sort tuning policies updated for SM7x (Volta).
- #104:
uint64_t
cub::WarpReduce
broken for CUB 1.7.0 on CUDA 8 and older. - #103: Can't mix Thrust from CUDA 9.0 and CUB.
- #102: CUB pulls in
windows.h
which definesmin
/max
macros that conflict withstd::min
/std::max
. - #99: Radix sorting crashes NVCC on Windows 10 for SM52.
- #98: cuda-memcheck: --tool initcheck failed with lineOfSight.
- #94: Git clone size.
- #93: Accept iterators for segment offsets.
- #87: CUB uses anonymous unions which is not valid C++.
- #44: Check for C++11 is incorrect for Visual Studio 2013.
CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs. It is compatible with independent thread scheduling. It was incorporated into Thrust 1.9.0-5.
- Remove
cub::WarpAll
andcub::WarpAny
. These functions served to emulate__all
and__any
functionality for SM1x devices, which did not have those operations. However, SM1x devices are now deprecated in CUDA, and the interfaces of these two functions are now lacking the lane-mask needed for collectives to run on SM7x and newer GPUs which have independent thread scheduling.
- Remove any assumptions of implicit warp synchronization to be compatible with SM7x's (Volta) independent thread scheduling.
- #86: Incorrect results with reduce-by-key.
CUB 1.6.4 improves radix sorting performance for SM5x (Maxwell) and SM6x (Pascal) GPUs.
- Radix sort tuning policies updated for SM5x (Maxwell) and SM6x (Pascal) - 3.5B and 3.4B 32 byte keys/s on TitanX and GTX 1080, respectively.
- Restore fence work-around for scan (reduce-by-key, etc.) hangs in CUDA 8.5.
- #65:
cub::DeviceSegmentedRadixSort
should allow inputs to have pointer-to-const type. - Mollify Clang device-side warnings.
- Remove out-dated MSVC project files.
CUB 1.6.3 improves support for Windows, changes
cub::BlockLoad
/cub::BlockStore
interface to take the local data type,
and enhances radix sort performance for SM6x (Pascal) GPUs.
cub::BlockLoad
andcub::BlockStore
are now templated by the local data type, instead of theIterator
type. This allows for output iterators havingvoid
as theirvalue_type
(e.g. discard iterators).
- Radix sort tuning policies updated for SM6x (Pascal) GPUs - 6.2B 4 byte keys/s on GP100.
- Improved support for Windows (warnings, alignment, etc).
- #74:
cub::WarpReduce
executes reduction operator for out-of-bounds items. - #72:
cub:InequalityWrapper::operator
should be non-const. - #71:
cub::KeyValuePair
won't work ifKey
has non-trivial constructor. - #69: cub::BlockStore::Store
doesn't compile if
OutputIteratorT::value_typeisn't
T`. - #68:
cub::TilePrefixCallbackOp::WarpReduce
doesn't permit PTX arch specialization.
CUB 1.6.2 (previously 1.5.5) improves radix sort performance for SM6x (Pascal) GPUs.
- Radix sort tuning policies updated for SM6x (Pascal) GPUs.
- Fix AArch64 compilation of
cub::CachingDeviceAllocator
.
CUB 1.6.1 (previously 1.5.4) is a minor release.
- Fix radix sorting bug introduced by scan refactorization.
CUB 1.6.0 changes the scan and reduce interfaces. Exclusive scans now accept an "initial value" instead of an "identity value". Scans and reductions now support differing input and output sequence types. Additionally, many bugs have been fixed.
- Device/block/warp-wide exclusive scans have been revised to now accept an "initial value" (instead of an "identity value") for seeding the computation with an arbitrary prefix.
- Device-wide reductions and scans can now have input sequence types that are different from output sequence types (as long as they are convertible).
- Reduce repository size by moving the doxygen binary to doc repository.
- Minor reduction in
cub::BlockScan
instruction counts.
- Issue #55: Warning in
cub/device/dispatch/dispatch_reduce_by_key.cuh
. - Issue #59:
cub::DeviceScan::ExclusiveSum
can't prefix sum of float into double. - Issue #58: Infinite loop in
cub::CachingDeviceAllocator::NearestPowerOf
. - Issue #47:
cub::CachingDeviceAllocator
needs to clean up CUDA global error state upon successful retry. - Issue #46: Very high amount of needed memory from the
cub::DeviceHistogram::HistogramEven
. - Issue #45:
cub::CachingDeviceAllocator
fails with debug output enabled
CUB 1.5.2 enhances cub::CachingDeviceAllocator
and improves scan performance
for SM5x (Maxwell).
- Improved medium-size scan performance on SM5x (Maxwell).
- Refactored
cub::CachingDeviceAllocator
:- Now spends less time locked.
- Uses C++11's
std::mutex
when available. - Failure to allocate a block from the runtime will retry once after freeing cached allocations.
- Now respects max-bin, fixing an issue where blocks in excess of max-bin were still being retained in the free cache.
- Fix for generic-type reduce-by-key
cub::WarpScan
for SM3x and newer GPUs.
CUB 1.5.1 is a minor release.
- Fix for incorrect
cub::DeviceRadixSort
output for some small problems on SM52 (Mawell) GPUs. - Fix for macro redefinition warnings when compiling
thrust::sort
.
CUB 1.5.0 introduces segmented sort and reduction primitives.
- Segmented device-wide operations for device-wide sort and reduction primitives.
- #36:
cub::ThreadLoad
generates compiler errors when loading from pointer-to-const. - #29:
cub::DeviceRadixSort::SortKeys<bool>
yields compiler errors. - #26: Misaligned address after
cub::DeviceRadixSort::SortKeys
. - #25: Fix for incorrect results and crashes when radix sorting 0-length problems.
- Fix CUDA 7.5 issues on SM52 GPUs with SHFL-based warp-scan and warp-reduction on non-primitive data types (e.g. user-defined structs).
- Fix small radix sorting problems where 0 temporary bytes were required and
users code was invoking
malloc(0)
on some systems where that returnsNULL
. CUB assumed the user was asking for the size again and not running the sort.
CUB 1.4.1 is a minor release.
- Allow
cub::DeviceRadixSort
andcub::BlockRadixSort
on bool types.
- Fix minor CUDA 7.0 performance regressions in
cub::DeviceScan
andcub::DeviceReduceByKey
. - Remove requirement for callers to define the
CUB_CDP
macro when invoking CUB device-wide rountines using CUDA dynamic parallelism. - Fix headers not being included in the proper order (or missing includes) for some block-wide functions.
CUB 1.4.0 adds cub::DeviceSpmv
, cub::DeviceRunLength::NonTrivialRuns
,
improves cub::DeviceHistogram
, and introduces support for SM5x (Maxwell)
GPUs.
cub::DeviceSpmv
methods for multiplying sparse matrices by dense vectors, load-balanced using a merge-based parallel decomposition.cub::DeviceRadixSort
sorting entry-points that always return the sorted output into the specified buffer, as opposed to thecub::DoubleBuffer
in which it could end up in either buffer.cub::DeviceRunLengthEncode::NonTrivialRuns
for finding the starting offsets and lengths of all non-trivial runs (i.e., length > 1) of keys in a given sequence. Useful for top-down partitioning algorithms like MSD sorting of very-large keys.
- Support and performance tuning for SM5x (Maxwell) GPUs.
- Updated cub::DeviceHistogram implementation that provides the same "histogram-even" and "histogram-range" functionality as IPP/NPP. Provides extremely fast and, perhaps more importantly, very uniform performance response across diverse real-world datasets, including pathological (homogeneous) sample distributions.
CUB 1.3.2 is a minor release.
- Fix
cub::DeviceReduce
where reductions of small problems (small enough to only dispatch a single thread block) would run in the default stream (stream zero) regardless of whether an alternate stream was specified.
CUB 1.3.1 is a minor release.
- Workaround for a benign WAW race warning reported by cuda-memcheck
in
cub::BlockScan
specialized forBLOCK_SCAN_WARP_SCANS
algorithm. - Fix bug in
cub::DeviceRadixSort
where the algorithm may sort more key bits than the caller specified (up to the nearest radix digit). - Fix for ~3%
cub::DeviceRadixSort
performance regression on SM2x (Fermi) and SM3x (Kepler) GPUs.
CUB 1.3.0 improves how thread blocks are expressed in block- and warp-wide
primitives and adds an enhanced version of cub::WarpScan
.
- CUB's collective (block-wide, warp-wide) primitives underwent a minor
interface refactoring:
- To provide the appropriate support for multidimensional thread blocks,
The interfaces for collective classes are now template-parameterized by
X, Y, and Z block dimensions (with
BLOCK_DIM_Y
andBLOCK_DIM_Z
being optional, andBLOCK_DIM_X
replacingBLOCK_THREADS
). Furthermore, the constructors that accept remapped linear thread-identifiers have been removed: all primitives now assume a row-major thread-ranking for multidimensional thread blocks. - To allow the host program (compiled by the host-pass) to accurately determine the device-specific storage requirements for a given collective (compiled for each device-pass), the interfaces for collective classes are now (optionally) template-parameterized by the desired PTX compute capability. This is useful when aliasing collective storage to shared memory that has been allocated dynamically by the host at the kernel call site.
- Most CUB programs having typical 1D usage should not require any changes to accomodate these updates.
- To provide the appropriate support for multidimensional thread blocks,
The interfaces for collective classes are now template-parameterized by
X, Y, and Z block dimensions (with
- Added "combination"
cub::WarpScan
methods for efficiently computing both inclusive and exclusive prefix scans (and sums).
- Fix for bug in
cub::WarpScan
(which affectedcub::BlockScan
andcub::DeviceScan
) where incorrect results (e.g., NAN) would often be returned when parameterized for floating-point types (fp32, fp64). - Workaround for ptxas error when compiling with with -G flag on Linux (for debug instrumentation).
- Fixes for certain scan scenarios using custom scan operators where code compiled for SM1x is run on newer GPUs of higher compute-capability: the compiler could not tell which memory space was being used collective operations and was mistakenly using global ops instead of shared ops.
CUB 1.2.3 is a minor release.
- Fixed access violation bug in
cub::DeviceReduce::ReduceByKey
for non-primitive value types. - Fixed code-snippet bug in
ArgIndexInputIteratorT
documentation.
CUB 1.2.2 adds a new variant of cub::BlockReduce
and MSVC project solections
for examples.
- MSVC project solutions for device-wide and block-wide examples
- New algorithmic variant of cub::BlockReduce for improved performance when using commutative operators (e.g., numeric addition).
- Inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly.
CUB 1.2.0 adds cub::DeviceReduce::ReduceByKey
and
cub::DeviceReduce::RunLengthEncode
and support for CUDA 6.0.
cub::DeviceReduce::ReduceByKey
.cub::DeviceReduce::RunLengthEncode
.
- Improved
cub::DeviceScan
,cub::DeviceSelect
,cub::DevicePartition
performance. - Documentation and testing:
- Added performance-portability plots for many device-wide primitives.
- Explain that iterator (in)compatibilities with CUDA 5.0 (and older) and Thrust 1.6 (and older).
- Revised the operation of temporary tile status bookkeeping for
cub::DeviceScan
(and similar) to be safe for current code run on future platforms (now uses proper fences).
- Fix
cub::DeviceScan
bug where Windows alignment disagreements between host and device regarding user-defined data types would corrupt tile status. - Fix
cub::BlockScan
bug where certain exclusive scans on custom data types for theBLOCK_SCAN_WARP_SCANS
variant would return incorrect results for the first thread in the block. - Added workaround to make
cub::TexRefInputIteratorT
work with CUDA 6.0.
CUB 1.1.1 introduces texture and cache modifier iterators, descending sorting,
cub::DeviceSelect
, cub::DevicePartition
, cub::Shuffle*
, and
cub::MaxSMOccupancy
.
Additionally, scan and sort performance for older GPUs has been improved and
many bugs have been fixed.
- Refactored block-wide I/O (
cub::BlockLoad
andcub::BlockStore
), removing cache-modifiers from their interfaces.cub::CacheModifiedInputIterator
andcub::CacheModifiedOutputIterator
should now be used withcub::BlockLoad
andcub::BlockStore
to effect that behavior.
cub::TexObjInputIterator
,cub::TexRefInputIterator
,cub::CacheModifiedInputIterator
, andcub::CacheModifiedOutputIterator
types for loading & storing arbitrary types through the cache hierarchy. They are compatible with Thrust.- Descending sorting for
cub::DeviceRadixSort
andcub::BlockRadixSort
. - Min, max, arg-min, and arg-max operators for
cub::DeviceReduce
. cub::DeviceSelect
(select-unique, select-if, and select-flagged).cub::DevicePartition
(partition-if, partition-flagged).- Generic
cub::ShuffleUp
,cub::ShuffleDown
, andcub::ShuffleIndex
for warp-wide communication of arbitrary data types (SM3x and up). cub::MaxSmOccupancy
for accurately determining SM occupancy for any given kernel function pointer.
- Improved
cub::DeviceScan
andcub::DeviceRadixSort
performance for older GPUs (SM1x to SM3x). - Renamed device-wide
stream_synchronous
param todebug_synchronous
to avoid confusion about usage. - Documentation improvements:
- Added simple examples of device-wide methods.
- Improved doxygen documentation and example snippets.
- Improved test coverege to include up to 21,000 kernel variants and 851,000 unit tests (per architecture, per platform).
- Fix misc `cub::DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when operating on non-primitive types for older architectures SM1x.
- SHFL-based scans and reductions produced incorrect results for multi-word types (size > 4B) on Linux.
- For
cub::WarpScan
-based scans, not all threads in the first warp were entering the prefix callback functor. cub::DeviceRadixSort
had a race condition with key-value pairs for pre-SM35 architectures.cub::DeviceRadixSor
bitfield-extract behavior with long keys on 64-bit Linux was incorrect.cub::BlockDiscontinuity
failed to compile for types other thanint32_t
/uint32_t
.- CUDA Dynamic Parallelism (CDP, e.g. device-callable) versions of device-wide methods now report the same temporary storage allocation size requirement as their host-callable counterparts.
CUB 1.0.2 is a minor release.
- Corrections to code snippet examples for
cub::BlockLoad
,cub::BlockStore
, andcub::BlockDiscontinuity
. - Cleaned up unnecessary/missing header includes.
You can now safely include a specific .cuh (instead of
cub.cuh
). - Bug/compilation fixes for
cub::BlockHistogram
.
CUB 1.0.1 adds cub::DeviceRadixSort
and cub::DeviceScan
.
Numerous other performance and correctness fixes and included.
- New collective interface idiom (specialize/construct/invoke).
cub::DeviceRadixSort
. Implements short-circuiting for homogenous digit passes.cub::DeviceScan
. Implements single-pass "adaptive-lookback" strategy.
- Significantly improved documentation (with example code snippets).
- More extensive regression test suit for aggressively testing collective variants.
- Allow non-trially-constructed types (previously unions had prevented aliasing temporary storage of those types).
- Improved support for SM3x SHFL (collective ops now use SHFL for types larger than 32 bits).
- Better code generation for 64-bit addressing within
cub::BlockLoad
/cub::BlockStore
. cub::DeviceHistogram
now supports histograms of arbitrary bins.- Updates to accommodate CUDA 5.5 dynamic parallelism.
- Workarounds for SM10 codegen issues in uncommonly-used
cub::WarpScan
/cub::WarpReduce
specializations.
CUB 0.9.3 is a minor release.
- Various documentation updates and corrections.
- Fixed compilation errors for SM1x.
- Fixed compilation errors for some WarpScan entrypoints on SM3x and up.
CUB 0.9.3 adds histogram algorithms and work management utility descriptors.
cub::DevicHistogram256
.cub::BlockHistogram256
.cub::BlockScan
algorithm variantBLOCK_SCAN_RAKING_MEMOIZE
, which trades more register consumption for less shared memory I/O.cub::GridQueue
,cub::GridEvenShare
, work management utility descriptors.
- Updates to
cub::BlockRadixRank
to usecub::BlockScan
, which improves performance on SM3x by using SHFL. - Allow types other than builtin types to be used in
cub::WarpScan::*Sum
methods if they only haveoperator+
overloaded. Previously they also required to support assignment fromint(0)
. - Update
cub::BlockReduce
'sBLOCK_REDUCE_WARP_REDUCTIONS
algorithm to work even when block size is not an even multiple of warp size. - Refactoring of
cub::DeviceAllocator
interface andcub::CachingDeviceAllocator
implementation.
CUB 0.9.2 adds cub::WarpReduce
.
cub::WarpReduce
, which uses the SHFL instruction when applicable.cub::BlockReduce
now uses thiscub::WarpReduce
instead of implementing its own.
- Documentation updates and corrections.
- Fixes for 64-bit Linux compilation warnings and errors.
CUB 0.9.1 is a minor release.
- Fix for ambiguity in
cub::BlockScan::Reduce
between generic reduction and summation. Summation entrypoints are now called::Sum()
, similar to the convention incub::BlockScan
. - Small edits to documentation and download tracking.
Initial preview release. CUB is the first durable, high-performance library of cooperative block-level, warp-level, and thread-level primitives for CUDA kernel programming.