Release v2.4.0 · NVIDIA/cccl

What’s New

We are still hard at work in CCCL on paying down lots of technical debt, improving infrastructure, and various other simplifications as part of the unification of Thrust/CUB/libcu++. In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

Thrust

As part of our kernel consolidation effort, kernels of thrust::unique_by_key, thrust::copy_if, and thrust::partition algorithms are now consolidated in CUB. Kernel consolidation achieves two goals. First, it delivers the latest optimizations of CUB algorithms to Thrust users. Apart from the performance improvements, it introduces support of large problem sizes (64-bit offsets) into Thrust algorithms.

CUB

cub::DeviceSelect::UniqueByKey now supports equality operator and large problem sizes.
New cub::DeviceFor family of algorithms goes beyond conventional cub::DeviceFor::ForEach. cub::DeviceFor::ForEachCopy can provide you with additional performance benefits from vectorized memory accesses.
Many CUB algorithms now support CUDA graph capture mode.

libcudacxx

Added new cuda::ptx namespace with wrappers for inline-PTX instructions
cuda::std::complex specializations for CUDA types bfloat and half.

What's Changed

Implement remaining ranges iterator concepts and modernize array by @miscco in #627
Fix C++11 support of recently added tests by @ahendriksen in #651
Update CUDA newest to CTK 12.3 by @jrhemstad in #629
Add cuda::ptx::* namespace by @ahendriksen in #574
The test seems to pass just fine by @miscco in #654
Fixes discard_memory compilation failure for pre-Volta by @elstehle in #637
Reduce benchmarking time by @gevtushenko in #657
Add CCCL_VERSION and script for updating version by @jrhemstad in #652
Fixes compiler error for extended fp type data gen by @elstehle in #666
fixup ___CUDA_VPTX -> _CUDA_VPTX by @wmaxey in #664
Attempt to WAR CUB / RDC / MSVC issue by @gevtushenko in #669
Rework our system header approach to be more error proof by @miscco in #661
Project automation - fix sync action and draft setting step by @jarmak-nv in #625
Fix fallback when checking git repo by @wmaxey in #1085
Currently the verbose option does not work beacuse of a typo in the argument handling by @miscco in #1088
Adds virtual shared memory helper and tests by @elstehle in #619
Add cuda::ptx::st_async by @ahendriksen in #1078
Add cuda::ptx::red_async by @ahendriksen in #1080
Remove libcudacxx symlinks by @wmaxey in #1075
Move PTX tests that missed the symlink PR by @wmaxey in #1098
Fix truncation of constant value by @gevtushenko in #1097
Add cuda::ptx:mbarrier_{try/test}_wait{_parity} by @ahendriksen in #674
Initial CUB/NVRTC support by @gevtushenko in #1081
Fix cuda::ptx::red.async for int32_t types by @ahendriksen in #1102
Fix local test runs with lit by @miscco in #1108
Fix config when only non-CDPv1 arches are enabled. by @alliepiper in #1109
Do not replace the sccache binary for windows by @miscco in #1115
Test cuda graph capture by @gevtushenko in #1112
Fix overflow bug for >2^32 elements in thrust::shuffle by @djns99 in #1074
Introduce CUB transform reduce by @gevtushenko in #1091
Add infrastructure for compile-time CUB tests by @gevtushenko in #1124
Fix GCC6 / FP8 warning by @gevtushenko in #1130
Fix thrust transform reduce bench by @gevtushenko in #1133
Fix ptx.st.async.compile.pass.cpp failing in C++11. by @wmaxey in #1132
Fix _LIBCUDACXX_UNREACHABLE for old MSVC by @miscco in #1114
Allow filtering P0 benchmarks by @gevtushenko in #1135
Update barrier_arrive_tx.md docs by @gonzalobg in #1147
Update std iterators by @miscco in #672
Fix argument name in windows CI by @miscco in #1145
Fix XFAIL condition for subsumption tests by @miscco in #1144
Project Automation - remove draft automation + reduce permissions by @jarmak-nv in #1154
Use rst in block-scope docs by @gevtushenko in #1150
Fix errors when find_package(CCCL) is called twice. by @alliepiper in #1157
Fix icc / cub by @gevtushenko in #1152
Abort testing on unsupported dialect flags by @wmaxey in #1158
Run with latest nvbench by @robertmaynard in #583
Set finer-grain workflow permissions by @jrhemstad in #1163
Port device docs to rst by @gevtushenko in #1160
CI log improvements by @jrhemstad in #621
Setup documentation and corresponding github action by @wmaxey in #1118
Update Docs links in README.md by @wmaxey in #1169
Fix GCC 13 by @gevtushenko in #1175
Add missing exit from run-as-coder by @jrhemstad in #1176
Adds new virtual shared memory facility to DeviceMergeSort by @elstehle in #1117
Add first batch of Catch2 tests for DeviceRadixSort by @alliepiper in #1164
Implement math functions for thrust::complex by @miscco in #1178
Use anchors in matrix.yaml by @jrhemstad in #1193
Ensure the targets that Thrust creates are global. by @robertmaynard in #1182
Fix availability of is_constant_evaluated on old MSVC by @miscco in #1180
Enable std::variant for libcu++ by @miscco in #1076
Implement enable_borrowed_range by @miscco in #1196
Reduce thrust benchmarks noise by @gevtushenko in #1203
Prepare more algorithms by @miscco in #1161
Add icc compiler to CI matrix by @jrhemstad in #1159
Unify handling of dialects by @miscco in #1200
Add argument to build/test scripts for additional cmake options by @jrhemstad in #620
Move definitions of execution space macros into cccl by @miscco in #1199
Adds new virtual shared memory facility to DeviceSelect::UniqueByKey by @elstehle in #1197
Add Catch2 tests for cub::DeviceSegmentedRadixSort by @alliepiper in #1214
Fix the example on README.md by @so298 in #1220
Add missing overloads for thrust::pow by @miscco in #1222
Fix 'nvc++ -stdpar' by @dkolsen-pgi in #1224
Fix examples in reduce docs by @gevtushenko in #1230
Do not benchmark small problem sizes by @gevtushenko in #1243
Implement enable_view by @miscco in #1208
Refactors thrust::unique_by_key to use cub::DeviceSelect::UniqueByKey by @elstehle in #1245
Fix merge conflict from incoming PR by @miscco in #1250
Disable fast-math for ICC by @miscco in #1252
Fix a typo in thrust-config.cmake by @valgur in #1259
Implement ranges::{c}begin and ranges::{c}end by @miscco in #1256
Switch to entropy-based stopping criterion by @gevtushenko in #1280
Fix a sync bug in stream_ref::wait by @PointKernel in #1238
Silence some static asserts in ptx helpers by @miscco in #1257
Restore docs images by @jrhemstad in #1285
Clarify Thrust/CUB ABI guarantees by @jrhemstad in #1269
Fix MSVC issues by @miscco in #1261
Ensure that cuda::std::pair is potentially trivially copyable by @miscco in #1249
Update packman to fix CUB docs by @gevtushenko in #1291
Implement ranges::{c}rbegin by @miscco in #1295
Make cuda::stream_ref universally available by @miscco in #1293
Properly test internal headers by @miscco in #1258
Remove remaining C++03 compatibility from unit tests by @Blonck in #1228
Add some documentation for memory_resource by @miscco in #1217
Filter axis values in perf analysis by @gevtushenko in #1304
Get CCCL revision outside of git repo by @gevtushenko in #1305
[DOC]: Move ptx.md out of extended API by @ahendriksen in #1308
Implement ranges::{c}rend by @miscco in #1301
thrust/mr: fix the case of reuising a block for a smaller alloc. by @griwes in #1232
Allow offloading samples by @gevtushenko in #1316
[DOC]: Fix documentation links by @ahendriksen in #1311
Separate windows and Linux CI matrix by @jrhemstad in #1206
Revert "Separate windows and Linux CI matrix " by @jrhemstad in #1324
Introduce CUB ForEach algorithms by @gevtushenko in #1302
Cleanup transitive includes of <cuda/std/functional> by @miscco in #1253
Implement ranges::{c}data by @miscco in #1313
Remove stale comments from README by @jrhemstad in #1328
Ports cub::DeviceMergeSort tests to Catch2 by @elstehle in #1319
Implement ranges::size and ranges::ssize by @miscco in #1330
PTX: Add helper functions for dsmem by @ahendriksen in #1336
Remove double "ignore" in discard_iterator.h docs by @gonidelis in #1342
PTX: Add cuda::ptx::fence by @ahendriksen in #1341
Replace deprecated _VSTD macro with std by @rupprecht in #1331
PTX: Add cuda::ptx::mapa and cuda::ptx::getctarank by @ahendriksen in #1345
Cleanup our __cccl_config by @miscco in #1322
Update to devcontainers 24.04 by @jrhemstad in #1357
♻️📝 Update mode example to use thrust::unique_count by @codereport in #1354
Switch to NV runners for Windows. by @wmaxey in #1356
Implement ranges::empty by @miscco in #1338
PTX: Add cuda::ptx::get_sreg by @ahendriksen in #1351
Fix godbolt link. by @jrhemstad in #1369
Implement ranges concepts by @miscco in #1364
Print helpful error message in test scripts when no GPU is found by @jrhemstad in #1362
Implement ranges::dangling by @miscco in #1371
Ensure that thrust fancy iterators are trivially_copy_constructible when possible by @miscco in #1368
Improve compiler detection defines by @Yaraslaut in #1320
Use relative includes for our public headers by @miscco in #1325
Implement ranges::view_interface by @miscco in #1377
Use checked allocators in CUB catch2 tests by @alliepiper in #1271
small update to docs for CTK by @ZelboK in #1378
Fix order of system_header supression and includes by @miscco in #1323
Hide API accepting kernel pointers by @gevtushenko in #1395
Refactors ChooseOffsetT to use ::cuda::std and introduces alias template choose_offset_t by @elstehle in #1405
Cleanup our delegated constructor workaround by @miscco in #1404
Implement ranges::subrange by @miscco in #1387
Test large arrays in in device radix sort by @alliepiper in #1349
CMake support absolute CMAKE_INSTALL_LIBDIR values by @robertmaynard in #1393
Fixes integer overflows in index computation when indexes approach numeric_limits<OffsetT>::max() by @elstehle in #1419
Fix ptx usage to account for PTX ISA availability by @miscco in #1359
Refactors thrust::copy_if to use cub::DeviceSelect by @elstehle in #1379
Fix include of <thrust/random.h> with NVC++ by @dkolsen-pgi in #1417
Do not use VLAs in cp_async_bulk_tensor_* tests by @miscco in #1423
Add support for sm_90a in <nv/target> API by @ahendriksen in #1411
Add additional build job for sm90 by @jrhemstad in #1428
Rework <span> to be latest revision by @miscco in #1415
PTX: Add cuda::ptx:cp_async_bulk_* by @ahendriksen in #1403
Prepare namespace ranges::views by @miscco in #1434
PTX: Add cuda::ptx:barrier_cluster_{arrive,wait} by @ahendriksen in #1366
Refactor thrust::[stable_]partition[_copy] to use cub::DevicePartition by @elstehle in #1435
Fix common_reference of pair by @miscco in #1438
Properly check whether a string is alphanumeric by @miscco in #1443
Remove cuda::ptx::mapa by @ahendriksen in #1442
Add cuda::ptx:tensormap_{replace,cp_fenceproxy} by @ahendriksen in #1441
Enable more algorithms for internal use by @miscco in #1432
Cleanup diagnostic handling by @miscco in #1420
Create patch 2.4.0 by @wmaxey in #1455
Address various issues from internal CI by @miscco in #1462
Extent gcc miscompilation workaround for replace.cu by @miscco in #1461
Fix CUB docs image fetcher by @gevtushenko in #1466
Add cuda::ptx::cp_reduce_async_bulk by @ahendriksen in #1445
Restore disabling benchmarks from ci scripts (removed in #493) by @wmaxey in #1458
Add test coverage for SM90 without PTX ISA 8.0 by @miscco in #1468
Ensure that we can use std::ignore on device by @miscco in #1470
Move .multicast tests out into their own file by @miscco in #1478
Ensure that we can test libcu++ against architectures < 70 by @miscco in #1475
Reduce number of instantiations in set_symmetric_difference tests by @miscco in #1476
Fixx test issues against gcc-6 by @miscco in #1477
Improve code block CSS in libcu++ docs by @Nyrio in #1483
Address issues with MSVC2017 by @miscco in #1479
Remove libcxx tests by @miscco in #1480
Separate CUB's catch2 test binaries by default for CI. by @alliepiper in #1482
Add Dev Containers guide for WSL by @gonidelis in #1394
PTX: add cuda::mbarrier_init by @ahendriksen in #1491
Remove legacy Thrust/CUB CI files. by @bdice in #1504
Fix issues with ambiguous calls to addressof in thrust::optional by @miscco in #1499
Ensure that we play nicely with std::iterators by @miscco in #1511
Try harder to unwrap nested thrust::tuple_of_iterator_references by @miscco in #1469
Match_any testing single bit by fusing into single LOP3 instruction by @IlyaGrebnov in #1372
Revert "Refactor thrust::complex as a struct derived from cuda::std::complex (#454)" by @miscco in #1497
Removes arch filtering of sm 90 for rdc builds by @elstehle in #1506
Adds test for cub::PtxVersion by @elstehle in #1521
Fix tuple backwards compatibility by @miscco in #1522
[FEA] Split ptx.h by @ahendriksen in #1520
Make libcudacxx's codegen part of CI and add it to the project. by @wmaxey in #1526
Ensure that we can run reduce_by_key with const inputs by @miscco in #1528
Disallow float offset type in cub::segmented_reducde by @gonidelis in #1430
cuda::std::complex specializations for half and bfloat by @griwes in #1140
Rebase 2.4.x with main. by @wmaxey in #1472
[BACKPORT]: Provide backfills for missing __half functionality by @miscco in #1544
[BACKPORT] Fix usage of naked array with 0 elements in sm90 barrier tests. (#1546) by @wmaxey in #1549
[BACKPORT] Fix unused variable warning for __can_use_complete_tx (#1547) by @wmaxey in #1550

New Contributors

@djns99 made their first contribution in #1074
@so298 made their first contribution in #1220
@valgur made their first contribution in #1259
@PointKernel made their first contribution in #1238
@rupprecht made their first contribution in #1331
@codereport made their first contribution in #1354
@Yaraslaut made their first contribution in #1320
@Nyrio made their first contribution in #1483
@IlyaGrebnov made their first contribution in #1372

Full Changelog: v2.3.2...v2.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.4.0

What’s New

Thrust

CUB

libcudacxx

What's Changed

New Contributors

Contributors