Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use sysroot 2.17 on CUDA < 11.8. #745

Merged

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Jan 22, 2025

Description

Nightly CI was failing on CUDA 11.4 due to #741. This fixes it.
Old versions of nvcc for CUDA 11 do not support sysroot 2.28 until CUDA 11.8.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The cmake-format.json is up to date with these changes.
  • I have added new files under rapids-cmake/
    • I have added include guards (include_guard(GLOBAL))
    • I have added the associated docs/ rst file and update the api.rst

@bdice bdice requested a review from a team as a code owner January 22, 2025 19:48
@bdice bdice requested review from jameslamb and removed request for a team January 22, 2025 19:48
@vyasr vyasr added bug Something isn't working non-breaking Introduces a non-breaking change labels Jan 22, 2025
@vyasr
Copy link
Contributor

vyasr commented Jan 22, 2025

/merge

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bradley! 🙏

Think CUDA 11 ARM still needs 2.28. Have added a suggestion below

As the CI failure was only on x86_64, think this would be ok

Comment on lines +64 to +68
- matrix:
arch: aarch64
cuda: "11.[2456]"
packages:
- sysroot_linux-aarch64==2.17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARM is not compatible with GLIBC 2.17

$ strings /opt/conda/envs/cuda11/lib/libcurand.so | grep GLIBC
GLIBC_2.17
GLIBC_2.27
GLIBC_2.17

So think we should stick to GLIBC 2.28 here

Suggested change
- matrix:
arch: aarch64
cuda: "11.[2456]"
packages:
- sysroot_linux-aarch64==2.17
- matrix:
arch: aarch64
cuda: "11.[2456]"
packages:
- sysroot_linux-aarch64==2.28

Copy link
Contributor Author

@bdice bdice Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. We can equivalently just erase this and let it use the matrix entry below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that looks cleaner. Let's do it 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops: this is not correct. https://anaconda.org/conda-forge/nvcc_linux-aarch64/files?version=11.4

The aarch64 packages for nvcc 11.4 (up to 11.7) require sysroot 2.17. The current state of the PR (using sysroot 2.17) is correct for all nvcc versions < 11.8, and is the same on x86 and ARM.

Copy link
Contributor Author

@bdice bdice Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that CUDA libraries require glibc versions newer than 2.17 -- but the nvcc compiler metapackage for CUDA 11 is only installable with sysroot 2.17 until we get to recent builds of nvcc 11.8 that permit newer sysroots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakirkham I am going to merge this as-is so that we can unblock nightly CI.

I think there's nothing else we should do here, but if you want to lift the restrictions for old CUDA compilers, a follow-up PR would be welcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had looked into this recently. Here is what I observed:

Typically that is enough to justify a repodata patch and it makes sense

Additionally noticed that nvcc_{{ target_platform }}'s pinnings could be handled more systematically, which is now fixed

Following this submitted a repodata patch

The repodata patch landed before last night's nightly run, which passed in CUDA 11.4 (the affected job)

Note there is a CUDA 12.5 failure in that nightly, but that is due to actual test failures

The following tests FAILED:
	570 - install_relocatable-with-gtest_discover_tests-ninja_configure (Failed)
	571 - install_relocatable-with-gtest_discover_tests-ninja (Not Run) ninja test
Errors while running CTest

So AFAICT this issue is already fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful. I apologize, I didn't see those PRs, just the end of the Slack conversation, and thought this might have been the best we could do. I appreciate you taking the extra steps, which now allow us to simplify things.

I filed a follow-up to simplify the dependencies.yaml and use sysroot 2.28 everywhere in #749.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry as well. I meant to update this thread with those details and just hadn't had time to do so. The CUDA 12.8 rollout started very soon after

If you are feeling stuck, please feel free to DM me. Am here to help

@bdice bdice requested a review from jakirkham January 23, 2025 03:16
@bdice bdice dismissed jakirkham’s stale review January 25, 2025 00:36

Merging to unblock nightly CI, a follow-up may be created later.

@rapids-bot rapids-bot bot merged commit e84fd34 into rapidsai:branch-25.02 Jan 25, 2025
17 checks passed
@bdice bdice mentioned this pull request Jan 25, 2025
7 tasks
rapids-bot bot pushed a commit that referenced this pull request Jan 26, 2025
This removes some special cases added in #745 around sysroot pinnings. Those were needed for CUDA 11.4 until a recent repodata patch. Thanks @jakirkham!

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - https://github.com/jakirkham

URL: #749
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants