Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silent truncation of records in Tabix range retrieval after networking failure from S3 bucket #1851

Closed
ChristopherWilks opened this issue Oct 15, 2024 · 3 comments
Assignees

Comments

@ChristopherWilks
Copy link

Hi,

First, thanks for the great tools, I use Tabix/Bgzip extensively in my work and am very grateful for the continued support of you folks continuously making them better (especially the extension of S3/GCS support)!

I think this may be related to this #1037, and/or if it is or part of another issue I missed in my brief search of the issues list, feel free to close/merge it in there. Related to that @daviesrob may be interested in this ticket.

I noticed recently that when running many concurrent tabix queries---using GNU parallel with -j80---against a small set of bgzipped/indexed files on an S3 bucket from an EC2 instance in the same AWS region, that I was seeing empty results from a few of them when there should have been actual records pulled down, but no errors were reported (return status was 0 for all queries).

I am using bash with set -exo pipefail, so I found this odd. [I'm fine with a minority of errors cropping up as long as they're reported---I'll just re-run those queries.]

My working hypothesis is that I'm overloading the networking stack (probably a receive buffer somewhere) on the system and that libcurl is reporting errors for a few of the concurrent jobs, but these aren't being fully caught and reported by Tabix. That said, libcurl maybe the culprit but I'm assuming it's not in this case.

I'm using version htslib 1.20, but the section of the code where I think this issue is (below) doesn't appear to be different between 1.20 and the current development branch.

I went back and added some of my own manual debug fprintf's to hfile_libcurl.c where I think the problem may be occurring, just before this line,

return got;
, and compiled without optimizations to get full debugging info (not shown here but I did run a bunch of straces as well):

fprintf(stderr,"in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: %ld,%d,%d,%ld,%d\n",got,fp->finished,fp->final_result, to_skip, errno);

The one test instance where I saw something relevant was here:

in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 18882,0,-1,-1,0
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 32193,0,-1,-1,0
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 25206,1,0,-1,0
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 25206,1,0,-1,0
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 0,1,0,-1,0
[W::bgzf_read_block] EOF marker is absent. The input may be truncated
    Command being timed: "htslib-1.20/tabix -D s3://S3_PATH_TO_BUCKET/allpairs.byfeature.gz chr12:11456460-11457010"
....
Exit status: 0

That range has records in the bgzipped file on S3, but the output was empty and I noticed that got here was 0 which is not being caught by libcurl_read(...) in this case.

My quick and dirty solution was to simply add:

if(got == 0) { return -1; }

and that seemed to fix it (in the sense of reporting an error when this happens, which is all I want) though I haven't run extensive tests.

I'm not claiming this fixes all the issues, but it does seem to get at a potential gap in the error checking in that file.

Thanks,
Chris

@whitwham
Copy link
Contributor

Using your fprintf statement I get this:
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 0,1,0,-1,0
at the end of every download from s3. It looks like a normal part of the process.

Can you check if it appears on your working tabixes?

@whitwham
Copy link
Contributor

whitwham commented Nov 5, 2024

@ChristopherWilks, did you have a chance to do more checks?

@whitwham
Copy link
Contributor

Closing because of lack of response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants