Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for potential integer under/overflow issue with reading very large files #631

Merged
merged 1 commit into from
Feb 20, 2025

Conversation

briandobbins
Copy link
Contributor

@briandobbins briandobbins commented Feb 15, 2025

Fix overflow/underflow issues during qsort for the SUBSET
rearranger.

The comparator function given to the qsort call in the rearranger
returns a 32-bit integer, but the operands can be 64-bit. Depending
on the difference in value, you can get an overflow or underflow,
which means you read an incorrect amount of data.

@dqwu
Copy link
Contributor

dqwu commented Feb 17, 2025

@briandobbins
compare_offsets is passed to qsort in the subset rearranger:

int subset_rearrange_create(iosystem_desc_t *ios, int maplen, PIO_Offset *compmap,
                            const int *gdimlen, int ndims, io_desc_t *iodesc)
{
    ...
    /* sort the mapping, this will transpose the data into IO order */
    qsort(map, iodesc->llen, sizeof(mapsort), compare_offsets);
    ...
}

E3SM/SCREAM uses the box rearranger by default, which might explain why you did not reproduce this issue on the ne1024/128-level grids.

To test the subset rearranger in E3SM/SCREAM, you need to explicitly change PIO_REARRANGER (default value is 1) to 2:
./xmlchange PIO_REARRANGER=2

@jayeshkrishna
Copy link
Contributor

Original PR text

Hi all,

This is a fix we recently made to PIO which addresses a bug when using 64-bit NetCDF mode which can result in reading an incorrect amount of data on very large grids. Basically, the comparator function given to the qsort call in the rearranger returns a 32-bit integer, but the operands can be 64-bit. Depending on the difference in value, you can get an overflow or underflow, which means you read an incorrect amount of data.

For us this only occurs in our 42M column CAM-MPAS runs with our 58-level config or higher -- that puts us at 2.43B cells. I figure SCREAM must be reading data differently (by column maybe?) since otherwise I'd have thought it'd hit there, too, on the ne1024 / 128-level grids, but evidently not! If you have insight there, I'm all ears.

Technically, this does add a few instructions to the function, but it hasn't had a measurable impact on performance on the whole.

Cheers,

Brian

@jayeshkrishna jayeshkrishna added bug Next Release Enhancements slated for the upcoming (next) release labels Feb 17, 2025
@jayeshkrishna
Copy link
Contributor

@dqwu : Can you also add a simple C unit test for this case (testing qsort with simple arrays containing large PIO offset values)?

@dqwu
Copy link
Contributor

dqwu commented Feb 17, 2025

@dqwu : Can you also add a simple C unit test for this case (testing qsort with simple arrays containing large PIO offset values)?

Yes, we can create another PR to add a simple C unit test that reproduces the integer underflow/overflow issue, which is expected to fail without the patch in this PR.

dqwu added a commit that referenced this pull request Feb 17, 2025
This PR adds additional test cases to ensure the internal comparison
function compare_offsets() correctly handles the differences between
64-bit offset values, including edge cases with extremely large or
small values.

The new test cases would fail with the current master branch, and
they are expected to be fixed by PR #631.

* dqwu/test_compare_offsets:
  Adding 64-bit offset comparison test cases to test_compare_offsets()
@dqwu
Copy link
Contributor

dqwu commented Feb 17, 2025

The potential integer under/overflow issue is now reproducible with an updated C unit test in PR #632 (to be fixed by this PR).

jayeshkrishna added a commit that referenced this pull request Feb 18, 2025
Fix overflow/underflow issues during qsort for the SUBSET
rearranger.

The comparator function given to the qsort call in the rearranger
returns a 32-bit integer, but the operands can be 64-bit. Depending
on the difference in value, you can get an overflow or underflow,
which means you read an incorrect amount of data.

* briandobbins/compare_offsets_fix:
  Fixes integer under/overflow issue when casting 64-bit offsets to 32-bit qsort comparator return value
dqwu added a commit that referenced this pull request Feb 20, 2025
This PR adds additional test cases to ensure the internal comparison
function compare_offsets() correctly handles the differences between
64-bit offset values, including edge cases with extremely large or
small values.

The new test cases would fail with the current master branch, and
they are expected to be fixed by PR #631.
@jayeshkrishna jayeshkrishna merged commit 7967c28 into E3SM-Project:master Feb 20, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Next Release Enhancements slated for the upcoming (next) release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants