v2.4.0
What’s New
We are still hard at work in CCCL on paying down lots of technical debt, improving infrastructure, and various other simplifications as part of the unification of Thrust/CUB/libcu++. In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.
Thrust
As part of our kernel consolidation effort, kernels of thrust::unique_by_key, thrust::copy_if, and thrust::partition algorithms are now consolidated in CUB. Kernel consolidation achieves two goals. First, it delivers the latest optimizations of CUB algorithms to Thrust users. Apart from the performance improvements, it introduces support of large problem sizes (64-bit offsets) into Thrust algorithms.
CUB
cub::DeviceSelect::UniqueByKey
now supports equality operator and large problem sizes.- New cub::DeviceFor family of algorithms goes beyond conventional
cub::DeviceFor::ForEach
.cub::DeviceFor::ForEachCopy
can provide you with additional performance benefits from vectorized memory accesses. - Many CUB algorithms now support CUDA graph capture mode.
libcudacxx
- Added new
cuda::ptx
namespace with wrappers for inline-PTX instructions cuda::std::complex
specializations for CUDA typesbfloat
andhalf
.
What's Changed
- Implement remaining ranges iterator concepts and modernize array by @miscco in #627
- Fix C++11 support of recently added tests by @ahendriksen in #651
- Update CUDA newest to CTK 12.3 by @jrhemstad in #629
- Add
cuda::ptx::*
namespace by @ahendriksen in #574 - The test seems to pass just fine by @miscco in #654
- Fixes discard_memory compilation failure for pre-Volta by @elstehle in #637
- Reduce benchmarking time by @gevtushenko in #657
- Add CCCL_VERSION and script for updating version by @jrhemstad in #652
- Fixes compiler error for extended fp type data gen by @elstehle in #666
- fixup
___CUDA_VPTX
->_CUDA_VPTX
by @wmaxey in #664 - Attempt to WAR CUB / RDC / MSVC issue by @gevtushenko in #669
- Rework our system header approach to be more error proof by @miscco in #661
- Project automation - fix sync action and draft setting step by @jarmak-nv in #625
- Fix fallback when checking git repo by @wmaxey in #1085
- Currently the verbose option does not work beacuse of a typo in the argument handling by @miscco in #1088
- Adds virtual shared memory helper and tests by @elstehle in #619
- Add
cuda::ptx::st_async
by @ahendriksen in #1078 - Add
cuda::ptx::red_async
by @ahendriksen in #1080 - Remove libcudacxx symlinks by @wmaxey in #1075
- Move PTX tests that missed the symlink PR by @wmaxey in #1098
- Fix truncation of constant value by @gevtushenko in #1097
- Add
cuda::ptx:mbarrier_{try/test}_wait{_parity}
by @ahendriksen in #674 - Initial CUB/NVRTC support by @gevtushenko in #1081
- Fix
cuda::ptx::red.async
for int32_t types by @ahendriksen in #1102 - Fix local test runs with lit by @miscco in #1108
- Fix config when only non-CDPv1 arches are enabled. by @alliepiper in #1109
- Do not replace the sccache binary for windows by @miscco in #1115
- Test cuda graph capture by @gevtushenko in #1112
- Fix overflow bug for >2^32 elements in thrust::shuffle by @djns99 in #1074
- Introduce CUB transform reduce by @gevtushenko in #1091
- Add infrastructure for compile-time CUB tests by @gevtushenko in #1124
- Fix GCC6 / FP8 warning by @gevtushenko in #1130
- Fix thrust transform reduce bench by @gevtushenko in #1133
- Fix
ptx.st.async.compile.pass.cpp
failing in C++11. by @wmaxey in #1132 - Fix
_LIBCUDACXX_UNREACHABLE
for old MSVC by @miscco in #1114 - Allow filtering P0 benchmarks by @gevtushenko in #1135
- Update barrier_arrive_tx.md docs by @gonzalobg in #1147
- Update std iterators by @miscco in #672
- Fix argument name in windows CI by @miscco in #1145
- Fix XFAIL condition for subsumption tests by @miscco in #1144
- Project Automation - remove draft automation + reduce permissions by @jarmak-nv in #1154
- Use rst in block-scope docs by @gevtushenko in #1150
- Fix errors when find_package(CCCL) is called twice. by @alliepiper in #1157
- Fix icc / cub by @gevtushenko in #1152
- Abort testing on unsupported dialect flags by @wmaxey in #1158
- Run with latest nvbench by @robertmaynard in #583
- Set finer-grain workflow permissions by @jrhemstad in #1163
- Port device docs to rst by @gevtushenko in #1160
- CI log improvements by @jrhemstad in #621
- Setup documentation and corresponding github action by @wmaxey in #1118
- Update Docs links in README.md by @wmaxey in #1169
- Fix GCC 13 by @gevtushenko in #1175
- Add missing exit from
run-as-coder
by @jrhemstad in #1176 - Adds new virtual shared memory facility to DeviceMergeSort by @elstehle in #1117
- Add first batch of Catch2 tests for DeviceRadixSort by @alliepiper in #1164
- Implement math functions for
thrust::complex
by @miscco in #1178 - Use anchors in matrix.yaml by @jrhemstad in #1193
- Ensure the targets that Thrust creates are global. by @robertmaynard in #1182
- Fix availability of
is_constant_evaluated
on old MSVC by @miscco in #1180 - Enable std::variant for libcu++ by @miscco in #1076
- Implement
enable_borrowed_range
by @miscco in #1196 - Reduce thrust benchmarks noise by @gevtushenko in #1203
- Prepare more algorithms by @miscco in #1161
- Add icc compiler to CI matrix by @jrhemstad in #1159
- Unify handling of dialects by @miscco in #1200
- Add argument to build/test scripts for additional cmake options by @jrhemstad in #620
- Move definitions of execution space macros into
cccl
by @miscco in #1199 - Adds new virtual shared memory facility to
DeviceSelect::UniqueByKey
by @elstehle in #1197 - Add Catch2 tests for cub::DeviceSegmentedRadixSort by @alliepiper in #1214
- Fix the example on README.md by @so298 in #1220
- Add missing overloads for thrust::pow by @miscco in #1222
- Fix 'nvc++ -stdpar' by @dkolsen-pgi in #1224
- Fix examples in reduce docs by @gevtushenko in #1230
- Do not benchmark small problem sizes by @gevtushenko in #1243
- Implement
enable_view
by @miscco in #1208 - Refactors
thrust::unique_by_key
to usecub::DeviceSelect::UniqueByKey
by @elstehle in #1245 - Fix merge conflict from incoming PR by @miscco in #1250
- Disable
fast-math
for ICC by @miscco in #1252 - Fix a typo in thrust-config.cmake by @valgur in #1259
- Implement
ranges::{c}begin
andranges::{c}end
by @miscco in #1256 - Switch to entropy-based stopping criterion by @gevtushenko in #1280
- Fix a sync bug in
stream_ref::wait
by @PointKernel in #1238 - Silence some static asserts in ptx helpers by @miscco in #1257
- Restore docs images by @jrhemstad in #1285
- Clarify Thrust/CUB ABI guarantees by @jrhemstad in #1269
- Fix MSVC issues by @miscco in #1261
- Ensure that
cuda::std::pair
is potentially trivially copyable by @miscco in #1249 - Update packman to fix CUB docs by @gevtushenko in #1291
- Implement
ranges::{c}rbegin
by @miscco in #1295 - Make
cuda::stream_ref
universally available by @miscco in #1293 - Properly test internal headers by @miscco in #1258
- Remove remaining C++03 compatibility from unit tests by @Blonck in #1228
- Add some documentation for
memory_resource
by @miscco in #1217 - Filter axis values in perf analysis by @gevtushenko in #1304
- Get CCCL revision outside of git repo by @gevtushenko in #1305
- [DOC]: Move ptx.md out of extended API by @ahendriksen in #1308
- Implement
ranges::{c}rend
by @miscco in #1301 - thrust/mr: fix the case of reuising a block for a smaller alloc. by @griwes in #1232
- Allow offloading samples by @gevtushenko in #1316
- [DOC]: Fix documentation links by @ahendriksen in #1311
- Separate windows and Linux CI matrix by @jrhemstad in #1206
- Revert "Separate windows and Linux CI matrix " by @jrhemstad in #1324
- Introduce CUB ForEach algorithms by @gevtushenko in #1302
- Cleanup transitive includes of
<cuda/std/functional>
by @miscco in #1253 - Implement
ranges::{c}data
by @miscco in #1313 - Remove stale comments from README by @jrhemstad in #1328
- Ports
cub::DeviceMergeSort
tests to Catch2 by @elstehle in #1319 - Implement
ranges::size
andranges::ssize
by @miscco in #1330 - PTX: Add helper functions for dsmem by @ahendriksen in #1336
- Remove double "ignore" in discard_iterator.h docs by @gonidelis in #1342
- PTX: Add
cuda::ptx::fence
by @ahendriksen in #1341 - Replace deprecated
_VSTD
macro withstd
by @rupprecht in #1331 - PTX: Add
cuda::ptx::mapa
andcuda::ptx::getctarank
by @ahendriksen in #1345 - Cleanup our
__cccl_config
by @miscco in #1322 - Update to devcontainers 24.04 by @jrhemstad in #1357
- ♻️📝 Update
mode
example to usethrust::unique_count
by @codereport in #1354 - Switch to NV runners for Windows. by @wmaxey in #1356
- Implement
ranges::empty
by @miscco in #1338 - PTX: Add
cuda::ptx::get_sreg
by @ahendriksen in #1351 - Fix godbolt link. by @jrhemstad in #1369
- Implement ranges concepts by @miscco in #1364
- Print helpful error message in test scripts when no GPU is found by @jrhemstad in #1362
- Implement
ranges::dangling
by @miscco in #1371 - Ensure that thrust fancy iterators are
trivially_copy_constructible
when possible by @miscco in #1368 - Improve compiler detection defines by @Yaraslaut in #1320
- Use relative includes for our public headers by @miscco in #1325
- Implement
ranges::view_interface
by @miscco in #1377 - Use checked allocators in CUB catch2 tests by @alliepiper in #1271
- small update to docs for CTK by @ZelboK in #1378
- Fix order of system_header supression and includes by @miscco in #1323
- Hide API accepting kernel pointers by @gevtushenko in #1395
- Refactors
ChooseOffsetT
to use::cuda::std
and introduces alias templatechoose_offset_t
by @elstehle in #1405 - Cleanup our delegated constructor workaround by @miscco in #1404
- Implement
ranges::subrange
by @miscco in #1387 - Test large arrays in in device radix sort by @alliepiper in #1349
- CMake support absolute CMAKE_INSTALL_LIBDIR values by @robertmaynard in #1393
- Fixes integer overflows in index computation when indexes approach
numeric_limits<OffsetT>::max()
by @elstehle in #1419 - Fix ptx usage to account for PTX ISA availability by @miscco in #1359
- Refactors
thrust::copy_if
to usecub::DeviceSelect
by @elstehle in #1379 - Fix include of <thrust/random.h> with NVC++ by @dkolsen-pgi in #1417
- Do not use VLAs in
cp_async_bulk_tensor_*
tests by @miscco in #1423 - Add support for sm_90a in <nv/target> API by @ahendriksen in #1411
- Add additional build job for sm90 by @jrhemstad in #1428
- Rework
<span>
to be latest revision by @miscco in #1415 - PTX: Add
cuda::ptx:cp_async_bulk_*
by @ahendriksen in #1403 - Prepare namespace
ranges::views
by @miscco in #1434 - PTX: Add
cuda::ptx:barrier_cluster_{arrive,wait}
by @ahendriksen in #1366 - Refactor
thrust::[stable_]partition[_copy]
to usecub::DevicePartition
by @elstehle in #1435 - Fix
common_reference
ofpair
by @miscco in #1438 - Properly check whether a string is alphanumeric by @miscco in #1443
- Remove
cuda::ptx::mapa
by @ahendriksen in #1442 - Add
cuda::ptx:tensormap_{replace,cp_fenceproxy}
by @ahendriksen in #1441 - Enable more algorithms for internal use by @miscco in #1432
- Cleanup diagnostic handling by @miscco in #1420
- Create patch 2.4.0 by @wmaxey in #1455
- Address various issues from internal CI by @miscco in #1462
- Extent gcc miscompilation workaround for replace.cu by @miscco in #1461
- Fix CUB docs image fetcher by @gevtushenko in #1466
- Add
cuda::ptx::cp_reduce_async_bulk
by @ahendriksen in #1445 - Restore disabling benchmarks from ci scripts (removed in #493) by @wmaxey in #1458
- Add test coverage for SM90 without PTX ISA 8.0 by @miscco in #1468
- Ensure that we can use
std::ignore
on device by @miscco in #1470 - Move
.multicast
tests out into their own file by @miscco in #1478 - Ensure that we can test libcu++ against architectures < 70 by @miscco in #1475
- Reduce number of instantiations in
set_symmetric_difference
tests by @miscco in #1476 - Fixx test issues against gcc-6 by @miscco in #1477
- Improve code block CSS in libcu++ docs by @Nyrio in #1483
- Address issues with MSVC2017 by @miscco in #1479
- Remove libcxx tests by @miscco in #1480
- Separate CUB's catch2 test binaries by default for CI. by @alliepiper in #1482
- Add Dev Containers guide for WSL by @gonidelis in #1394
- PTX: add
cuda::mbarrier_init
by @ahendriksen in #1491 - Remove legacy Thrust/CUB CI files. by @bdice in #1504
- Fix issues with ambiguous calls to
addressof
inthrust::optional
by @miscco in #1499 - Ensure that we play nicely with std::iterators by @miscco in #1511
- Try harder to unwrap nested
thrust::tuple_of_iterator_references
by @miscco in #1469 - Match_any testing single bit by fusing into single LOP3 instruction by @IlyaGrebnov in #1372
- Revert "Refactor
thrust::complex
as a struct derived fromcuda::std::complex
(#454)" by @miscco in #1497 - Removes arch filtering of sm 90 for rdc builds by @elstehle in #1506
- Adds test for
cub::PtxVersion
by @elstehle in #1521 - Fix tuple backwards compatibility by @miscco in #1522
- [FEA] Split ptx.h by @ahendriksen in #1520
- Make libcudacxx's codegen part of CI and add it to the project. by @wmaxey in #1526
- Ensure that we can run
reduce_by_key
with const inputs by @miscco in #1528 - Disallow float offset type in cub::segmented_reducde by @gonidelis in #1430
- cuda::std::complex specializations for half and bfloat by @griwes in #1140
- Rebase 2.4.x with main. by @wmaxey in #1472
- [BACKPORT]: Provide backfills for missing
__half
functionality by @miscco in #1544 - [BACKPORT] Fix usage of naked array with 0 elements in sm90 barrier tests. (#1546) by @wmaxey in #1549
- [BACKPORT] Fix unused variable warning for __can_use_complete_tx (#1547) by @wmaxey in #1550
New Contributors
- @djns99 made their first contribution in #1074
- @so298 made their first contribution in #1220
- @valgur made their first contribution in #1259
- @PointKernel made their first contribution in #1238
- @rupprecht made their first contribution in #1331
- @codereport made their first contribution in #1354
- @Yaraslaut made their first contribution in #1320
- @Nyrio made their first contribution in #1483
- @IlyaGrebnov made their first contribution in #1372
Full Changelog: v2.3.2...v2.4.0