Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add htsget support for SAGE #541

Closed
wants to merge 3 commits into from

Conversation

brainstorm
Copy link

@brainstorm brainstorm commented Apr 8, 2024

This PR adds support for GA4GH's htsget protocol. In order to test the server out I've used our own htsget server @umccr, htsget-rs, like so:

$ docker run --platform linux/amd64 -p 8081:8081 -p 8080:8080 -v $HOME/dev/umccr/sage-data/sample_data:/data/bam ghcr.io/umccr/htsget-rs:latest

And then running the following commandline SAGE instantiation:

java -Xmx15G -jar software/sage_v3.4.jar \
-reference MDX230458 \
-reference_bam htsget://localhost:8080/reads/data/bam/MDX230458_normal.sliced \
-tumor MDX230466 \
-tumor_bam htsget://localhost:8080/reads/data/bam/MDX230466_tumor.sliced \
-ref_genome ../sage-data/reference_data/genome/GRCh38_full_analysis_set_plus_decoy_hla.fa \
-ref_genome_version 38 \
-hotspots ../sage-data/reference_data/sage/KnownHotspots.somatic.38.vcf.gz \
-panel_bed ../sage-data/reference_data/sage/ActionableCodingPanel.38.bed.gz \
-coverage_bed ../sage-data/reference_data/sage/CoverageCodingPanel.38.bed.gz \
-high_confidence_bed ../sage-data/reference_data/sage/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7.bed.gz \
-ensembl_data_dir ../sage-data/reference_data/sage/ensembl_data \
-write_bqr_plot \
-vis_variants chr17:7674220:C:T \
-threads 16 \
-output_vcf ../sage-data/output/MDX230466.sage.somatic.vcf.gz

Please note the htsget:// URIs in -reference_bam and -tumor_bam. That's the addition on this pullrequest: being able to access resources remotely, not based on a local filesystem. This change has been targeted for SAGE, but there's no reason to believe that it couldn't be applied to (all?) the other tools present in this repo, extending the distributed storage benefit to all your toolchain (and oncoanalyser).

The command line arguments above (and the -v arguments on the docker container) assume both big and private data stored in sage-data that I'll not be able to share publicly, but I hope that you can reproduce it under your premises? I found it hard to put together a minimal integration test for this since it involves quite big files. On the unit/functional side, I'm assuming that there's enough test coverage on htsget from htsjdk, but I'd be happy to take guidance on tests you might see lacking in this PR.

It would be preferable to extend this htsget support for VCF files as well as BAM files, but unfortunately there's no support in the htsjdk library for it at the time of writing this, /cc @lbergelson, @cmnbroad.

Thanks @scwatts @ohofmann @reisingerf @mmalenic for making this possible!

@brainstorm brainstorm marked this pull request as ready for review April 8, 2024 10:26
@brainstorm
Copy link
Author

A successful run's output should look like this:

20:31:40.419 [INFO ] Sage version 6.6.6
20:31:41.330 [INFO ] read 7384 coverage entries from bed file: ../sage-data/reference_data/sage/CoverageCodingPanel.38.bed.gz
20:31:41.338 [INFO ] read 5976 panel entries from bed file: ../sage-data/reference_data/sage/ActionableCodingPanel.38.bed.gz
20:31:41.382 [INFO ] read 9434 hotspots from vcf: ../sage-data/reference_data/sage/KnownHotspots.somatic.38.vcf.gz
20:31:41.595 [INFO ] read 438100 high-confidence entries from bed file: ../sage-data/reference_data/sage/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7.bed.gz
20:31:41.610 [INFO ] writing to file: ../sage-data/output/MDX230466.sage.somatic.vcf.gz
WARNING	2024-04-08 20:31:42	HtsgetRequest	Supported htsget protocol version: vnd.ga4gh.htsget.v1.2.0may not be compatible with received content type: application/json
WARNING	2024-04-08 20:31:42	HtsgetRequest	Supported htsget protocol version: vnd.ga4gh.htsget.v1.2.0may not be compatible with received content type: application/json
WARNING	2024-04-08 20:31:43	HtsgetRequest	Supported htsget protocol version: vnd.ga4gh.htsget.v1.2.0may not be compatible with received content type: application/json
WARNING	2024-04-08 20:31:43	HtsgetRequest	Supported htsget protocol version: vnd.ga4gh.htsget.v1.2.0may not be compatible with received content type: application/json
20:31:43.926 [INFO ] base quality recalibration cache generated
WARNING	2024-04-08 20:31:43	HtsgetRequest	Supported htsget protocol version: vnd.ga4gh.htsget.v1.2.0may not be compatible with received content type: application/json
20:31:44.014 [INFO ] chromosome(chr17) executing 1 regions
WARNING	2024-04-08 20:31:44	HtsgetRequest	Supported htsget protocol version: vnd.ga4gh.htsget.v1.2.0may not be compatible with received content type: application/json
WARNING	2024-04-08 20:31:44	HtsgetRequest	Supported htsget protocol version: vnd.ga4gh.htsget.v1.2.0may not be compatible with received content type: application/json
WARNING	2024-04-08 20:31:44	HtsgetRequest	Supported htsget protocol version: vnd.ga4gh.htsget.v1.2.0may not be compatible with received content type: application/json
WARNING	2024-04-08 20:31:44	HtsgetRequest	Supported htsget protocol version: vnd.ga4gh.htsget.v1.2.0may not be compatible with received content type: application/json
WARNING	2024-04-08 20:31:44	HtsgetRequest	Supported htsget protocol version: vnd.ga4gh.htsget.v1.2.0may not be compatible with received content type: application/json
20:31:46.051 [INFO ] chromosome(chr17) analysis complete
20:31:46.171 [INFO ] Sage complete, mins(0.076)

Please note that the WARNINGs should be addressed upstream by htsjdk team (protocol used by our server is now 1.3.0, htsjdk needs to catch up with it).

@brainstorm
Copy link
Author

I chose to add the --no-release flag in hmftools-build.py since I didn't want to break your current CI/CD scripts although I think that hmftools.py [build|release] or similar would be more suitable for third party developers than the current "deploy by default" setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants