Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: CLIN-3706 support more input gvcf file extensions #52

Merged

Conversation

LysianeBouchard
Copy link
Contributor

@LysianeBouchard LysianeBouchard commented Dec 18, 2024

This pull request introduces a new step at the beginning of the workflow (BCFTOOLS VIEW) to standardize the input VCF files.

The current implementation will fail if the file extension is .gvcf in a solo family and if the exclude mnps step is skipped. This issue is caused by an imperfection in the GENOTYPEGVCFS module (from nf-core) that makes an incorrect assumption about file extensions. Specifically, it incorrectly assumes that the VCF file is a gendb url when the extension is not .vcf.gz.

To fix this problem and prevent other data compatibility issues with file extensions and index files, we introduce a new step at the beginning of the workflow (BCFTOOLS_VIEW) to standardize the input VCF files. The vcf files will be saved with a .g.vcf.gz extension, which is standard in nf-core processes. If the index file is not present, it will be generated.

Tests

  1. Run the pipeline locally:
    nextflow run main.nf -profile test,docker
    -The exclude mnps step should be executed
    -Check that the output files of the process BCFTOOLS_VIEW
    -Check that the bodies of the final output files are identical to those from obtain via the main branch

  2. Reproduce the file extension bug locally using the main branch. Then switch to the PR branch and check that the pipeline runs successfully.

  • The exclude mnp step should be skipped
    -Compare the results with initial dataset (with .g.vcf.gz extensions) and check that the body of the splitMultiAllelics output is identical.
  1. Remove the .tbi files and check that the pipeline still runs successfully without exclude mnps

Tests in juno

Run the pipeline normally
-Check that exclude mnp step is executed

  • the pipeline should be executed successfully

Run the pipeline with .gvcf.gz files instead .g.vcf.gz files and no .tbi files, exclude_mnps set to false

  • the exclude mnps process should be skipped
  • the pipeline should be executed successfully

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • Reference Data Documentation in docs/reference_data.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@LysianeBouchard LysianeBouchard marked this pull request as draft December 18, 2024 14:15
@LysianeBouchard LysianeBouchard force-pushed the fix/CLIN-3706-support-more-file-extensions-for-raw-gvcf branch 5 times, most recently from b599930 to e1cd340 Compare December 18, 2024 18:03
@LysianeBouchard LysianeBouchard marked this pull request as ready for review December 18, 2024 18:56
Copy link
Contributor

@FelixAntoineLeSieur FelixAntoineLeSieur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a single comment on function name

def tbi_input = input_channel.filter{meta, vcf -> !file(vcf + ".tbi").exists()}
def tbi_output = initial_tabix(tbi_input)
def with_generated_tbi = tbi_input.join(tbi_output).map{meta, vcf, tbi -> [meta, [vcf, tbi]]}
def exclude_mnps(input_channel, do_exclude_mnps) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we now have 3 entities named exlude_mnps: a parameter, a function and a process. Maybe this could be renamed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will rename to handle_mnps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

…l scenarios

There was a problem in the GenotypeGVFCs process for solo families when skipping the excludeMnps step
This PR address the problem and add more robustness to the pipeline regarding input file extensions.
It will also have the benefit to automatically create the .tbi file if it is missing.
@LysianeBouchard LysianeBouchard force-pushed the fix/CLIN-3706-support-more-file-extensions-for-raw-gvcf branch from e1cd340 to a6e9c27 Compare December 18, 2024 19:39
@LysianeBouchard LysianeBouchard merged commit ea8a954 into main Dec 18, 2024
3 checks passed
@LysianeBouchard LysianeBouchard deleted the fix/CLIN-3706-support-more-file-extensions-for-raw-gvcf branch December 18, 2024 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants