Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate sequencing #62

Open
ericblanc20 opened this issue Jul 15, 2021 · 2 comments
Open

Duplicate sequencing #62

ericblanc20 opened this issue Jul 15, 2021 · 2 comments
Assignees

Comments

@ericblanc20
Copy link
Contributor

How to work with duplicate sequencing?

Duplicate sequencing

In the DKTK Master programme, patients are sometimes sequenced several times, apparently from the same tumor sample (according to DKFZ sample ids), several weeks or months apart.

When the second sequencing of the same sample comes in, I would increase the extract or library number, depending on the value of the DKFZ sample id. I would then have a library with an id which looks as one of the two below:

<donor_id>-<sample_id>-DNA2-WES1
<donor_id>-<sample_id>-DNA1-WES2

Current situation

  • SODAR allows any sample, extract & library id, in particular <donor_id>-<sample_id>-DNA2-WES1 in its ISATAB description & in the directory structure.
  • cubi-tk sodar pull-raw-data downloads only one library per sample and per assay. It means that if both <donor_id>-<sample_id>-DNA1-WES1 and <donor_id>-<sample_id>-DNA2-WES1 are present in SODAR, only one of them will be downloaded (I am not sure which one).
  • snappy silently assigns the sample id to <donor_id>-<sample_id>-DNA1-WES1, regardless of the input folder name.
  • Because of the behaviour above, results obtained from <donor_id>-<sample_id>-DNA2-WES1 could be uploaded to <donor_id>-<sample_id>-DNA1-WES1 by cubi-tk.

Points for discussion

  • The current situation can lead to mistakes (although the number of such cases is quite marginal). What is the best way to highlight potential problems to the user?
  • There is a case to keep both sequencing data in SODAR, because results from the first sequencing may have gone to the clinicians, and I think it is important to keep track of them. Besides, duplicate sequencing can be useful when benchmarking methods.
  • Am I overlooking anything, either in snappy or cubi-tk, that would address the issue.
  • Guidelines to deal with these problems?
@ericblanc20
Copy link
Contributor Author

Example

  • SODAR project UUID: 1139b0ad-c6e4-4cc3-9d78-5f347f5e4bb6
  • Assay: transcriptome profiling, UUIDc7e98062-8f07-4427-b3d3-a780c7226e8f

In this project, the first tumor sample of donor 10_LR has been re-sequenced. Libraries 10-LR-T1-RNA1-mRNA_seq1 & 10-LR-T1-RNA2-mRNA_seq1 are both present in SODAR.

cubi-tk download

Download command

cubi-tk sodar pull-raw-data \
    --assay c7e98062-8f07-4427-b3d3-a780c7226e8f \
    1139b0ad-c6e4-4cc3-9d78-5f347f5e4bb6 \
    downloaded_from_SODAR

List of files for tumor samples of donor 10_LR:

(sak) [blance_c@med0624 cubi-tk_tst]$ ls -d downloaded_from_SODAR/10_LR-T*
downloaded_from_SODAR/10_LR-T1-RNA2-mRNA_seq1  downloaded_from_SODAR/10_LR-T2-RNA1-mRNA_seq1  downloaded_from_SODAR/10_LR-T3-RNA1-mRNA_seq1

The files for sample 10_LR-T1-RNA1-mRNA_seq1 are missing, and the fastq files for 10_LR-T1 are indeed those for 10_LR-T1-RNA2-mRNA_seq1 (no mixup between RNA1 & RNA2 at this level).

Running the snappy pipeline

Using the cancer sample sheet below:

[Metadata]				
schema	matched_cancer			
schema_version	v1			
title	Re-sequencing
description	Example of re-sequencing same data (different extracts)
				
[Data]				
patientName	sampleName	isTumor	libraryType	folderName
10_LR	T1	Y	mRNA_seq	10_LR-T1-RNA2-mRNA_seq1
10_LR	T2	Y	mRNA_seq	10_LR-T2-RNA1-mRNA_seq1
10_LR	T3	Y	mRNA_seq	10_LR-T3-RNA1-mRNA_seq1

snappy renames library 10_LR-T1-RNA2-mRNA_seq1 to 10_LR-T1-RNA1-mRNA_seq1 (here shown for the hla_typing step, but an identical behaviour is observed with ngs_mapping):

(snappy) [blance_c@med0624 hla_typing]$ snappy-snake --cores 4 > /dev/null 2>&1
(snappy) [blance_c@med0624 hla_typing]$ ls output
optitype.10_LR-T1-RNA1-mRNA_seq1  optitype.10_LR-T2-RNA1-mRNA_seq1  optitype.10_LR-T3-RNA1-mRNA_seq1

The results for sample 10_LR-T1 is named after library 10_LR-T1-RNA1-mRNA_seq1, but they are instead obtained from library 10_LR-T1-RNA2-mRNA_seq1.

Note that the problem doesn't appear to show up when the generic biomedsheet's schema is used

@eudesbarbosa
Copy link
Member

Changes in snappy pull-raw-data (#118) partially address this is as they clarify the files being transferred (see below). The catch is that it would require a snappy-project and #120 to be fixed:

Command:

cubi-tk snappy pull-raw-data \
           --assay-uuid c7e98062-8f07-4427-b3d3-a780c7226e8f \
           1139b0ad-c6e4-4cc3-9d78-5f347f5e4bb6

Directory structure:

`-- <SAMPLE_ID>-T1-RNA1-mRNA_seq1
    |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1
    |   `-- raw_data
    |       `-- 2020-01-28
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_001_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_001_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_001_R2.fastq.gz
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_001_R2.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_002_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_002_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_002_R2.fastq.gz
    |           `-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_002_R2.fastq.gz.md5
    |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1
    |   `-- raw_data
    |       `-- 2020-01-28
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_001_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_001_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_001_R2.fastq.gz
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_001_R2.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_002_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_002_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_002_R2.fastq.gz
    |           `-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_002_R2.fastq.gz.md5
    |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1
    |   `-- raw_data
    |       `-- 2020-01-28
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_001_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_001_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_001_R2.fastq.gz
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_001_R2.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_002_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_002_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_002_R2.fastq.gz
    |           `-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_002_R2.fastq.gz.md5
    |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1
    |   `-- raw_data
    |       `-- 2020-02-14
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_001_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_001_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_001_R2.fastq.gz
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_001_R2.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_002_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_002_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_002_R2.fastq.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants