-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: CLIN-3706 support more input gvcf file extensions #52
fix: CLIN-3706 support more input gvcf file extensions #52
Conversation
b599930
to
e1cd340
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a single comment on function name
workflows/postprocessing.nf
Outdated
def tbi_input = input_channel.filter{meta, vcf -> !file(vcf + ".tbi").exists()} | ||
def tbi_output = initial_tabix(tbi_input) | ||
def with_generated_tbi = tbi_input.join(tbi_output).map{meta, vcf, tbi -> [meta, [vcf, tbi]]} | ||
def exclude_mnps(input_channel, do_exclude_mnps) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we now have 3 entities named exlude_mnps: a parameter, a function and a process. Maybe this could be renamed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will rename to handle_mnps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…l scenarios There was a problem in the GenotypeGVFCs process for solo families when skipping the excludeMnps step This PR address the problem and add more robustness to the pipeline regarding input file extensions. It will also have the benefit to automatically create the .tbi file if it is missing.
e1cd340
to
a6e9c27
Compare
This pull request introduces a new step at the beginning of the workflow (BCFTOOLS VIEW) to standardize the input VCF files.
The current implementation will fail if the file extension is .gvcf in a solo family and if the exclude mnps step is skipped. This issue is caused by an imperfection in the GENOTYPEGVCFS module (from nf-core) that makes an incorrect assumption about file extensions. Specifically, it incorrectly assumes that the VCF file is a gendb url when the extension is not .vcf.gz.
To fix this problem and prevent other data compatibility issues with file extensions and index files, we introduce a new step at the beginning of the workflow (BCFTOOLS_VIEW) to standardize the input VCF files. The vcf files will be saved with a .g.vcf.gz extension, which is standard in nf-core processes. If the index file is not present, it will be generated.
Tests
Run the pipeline locally:
nextflow run main.nf -profile test,docker
-The exclude mnps step should be executed
-Check that the output files of the process BCFTOOLS_VIEW
-Check that the bodies of the final output files are identical to those from obtain via the main branch
Reproduce the file extension bug locally using the main branch. Then switch to the PR branch and check that the pipeline runs successfully.
-Compare the results with initial dataset (with .g.vcf.gz extensions) and check that the body of the splitMultiAllelics output is identical.
Tests in juno
Run the pipeline normally
-Check that exclude mnp step is executed
Run the pipeline with .gvcf.gz files instead .g.vcf.gz files and no .tbi files, exclude_mnps set to false
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.docs/reference_data.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).