Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XCP-D v0.10.0rc1 and running into crashes on the HPC but not my local linux box #1288

Closed
dkp opened this issue Oct 13, 2024 · 5 comments
Closed
Labels
bug Issues noting problems and PRs fixing those problems.

Comments

@dkp
Copy link

dkp commented Oct 13, 2024

Summary

I am trying XCP-D v0.10.0rc1 and running into crashes on the HPC but not my local linux box. Version xcp-d_v0.8.3.sif runs without issue in both environments on this same fmriprep dataset (though 0.83 requires --file-format cifti --warp-surfaces-native2std)

See attached slurm log and crash report

Additional details

  • xcp_d version: XCP-D v0.10.0rc1
  • Apptainer version Local Linux Box: apptainer version 1.3.3 (local linx box)
  • Apptainer Version HPC: apptainer version 1.3.2-1.el7

OS's compared:

Local linux Box: Ubuntu 24.04.1 LTS
HPC: CentOS Linux 7 (Core)

Input data:

fmriprep 24.1.1 (run as follows):

apptainer run --cleanenv --bind ${MRIS}/data:/data:ro --bind ${APP_DERIV_DIR}:/outputs --bind ${WORK_DIR}:/work ${APP} /data /outputs participant --participant_label ${Subject} --fs-license-file ${HOME}/license.txt -w /work --stop-on-first-crash --ignore slicetiming --cifti-output 91k --output-spaces fsLR fsnative fsaverage MNI152NLin6Asym:res-2

What were you trying to do?

XCP-D command (same on both systems):

# Minimal linc run for XCPD-10.1

apptainer run --cleanenv --bind ${FMRIPREP_DERIV_DIR}:/fmriprep:ro --bind ${WORK_DIR}:/work --bind ${XCPD_DERIV_DIR}:/out ${APP} /fmriprep /out participant --participant_label ${Subject} --fs-license-file ${HOME}/license.txt --mode linc --stop-on-first-crash --head_radius 50 -w /work

What did you expect to happen?

I expected the 0.10.0 pipeline to run in both environments just like the 0.83 pipeline before it

What actually happened?

The 10.0 pipeline ran correctly on the local linux box, but failed, with the same call and same data on the HPC.

Reproducing the bug

This seems to be specific to some interaction with the HPC that has changed between XCP-D version 0.83 and version 0.10.0. I have not tested intermediate versions.

crash-20241013-113252-dkp-surface_sphere_project_unproject-636e1e22-8be1-435f-b46f-0e610f0d122a.txt

slurm-xcpdfail.txt

@dkp dkp added the bug Issues noting problems and PRs fixing those problems. label Oct 13, 2024
@mattcieslak
Copy link
Contributor

Hi @dkp!

I ran into this exact same issue with qsiprep awhile back. It's the ABI tags in the libQt5 library. It produces a very tricky error message that says the library isn't there when it is - the host system just can't load it because of those tags.

@tsalo here is where the tags get stripped out. Does this happen in the xcpd build?

@tsalo
Copy link
Member

tsalo commented Oct 15, 2024

It does not, but I can add it. Thanks!

@tsalo
Copy link
Member

tsalo commented Oct 15, 2024

I just merged #1293, which should hopefully fix the problem. @dkp once pennlinc/xcp_d:unstable updates on DockerHub (should happen in ~2 hours), would you be willing to try it out on your HPC?

@dkp
Copy link
Author

dkp commented Oct 15, 2024

Thank you, I will try ASAP (hopefully today, and will let you know as soon as I have results)

@dkp dkp closed this as completed Oct 16, 2024
@dkp
Copy link
Author

dkp commented Oct 16, 2024

I ran it a couple of ways. But, the most recent was a clean run with no work directory and no previous derivatives.
It worked! The output looks appropriate and complete (from skimming it) and slurm reports success. Yay!! Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues noting problems and PRs fixing those problems.
Projects
None yet
Development

No branches or pull requests

3 participants