Duplicate sequencing #62

ericblanc20 · 2021-07-15T09:34:30Z

How to work with duplicate sequencing?

Duplicate sequencing

In the DKTK Master programme, patients are sometimes sequenced several times, apparently from the same tumor sample (according to DKFZ sample ids), several weeks or months apart.

When the second sequencing of the same sample comes in, I would increase the extract or library number, depending on the value of the DKFZ sample id. I would then have a library with an id which looks as one of the two below:

<donor_id>-<sample_id>-DNA2-WES1
<donor_id>-<sample_id>-DNA1-WES2

Current situation

SODAR allows any sample, extract & library id, in particular <donor_id>-<sample_id>-DNA2-WES1 in its ISATAB description & in the directory structure.
cubi-tk sodar pull-raw-data downloads only one library per sample and per assay. It means that if both <donor_id>-<sample_id>-DNA1-WES1 and <donor_id>-<sample_id>-DNA2-WES1 are present in SODAR, only one of them will be downloaded (I am not sure which one).
snappy silently assigns the sample id to <donor_id>-<sample_id>-DNA1-WES1, regardless of the input folder name.
Because of the behaviour above, results obtained from <donor_id>-<sample_id>-DNA2-WES1 could be uploaded to <donor_id>-<sample_id>-DNA1-WES1 by cubi-tk.

Points for discussion

The current situation can lead to mistakes (although the number of such cases is quite marginal). What is the best way to highlight potential problems to the user?
There is a case to keep both sequencing data in SODAR, because results from the first sequencing may have gone to the clinicians, and I think it is important to keep track of them. Besides, duplicate sequencing can be useful when benchmarking methods.
Am I overlooking anything, either in snappy or cubi-tk, that would address the issue.
Guidelines to deal with these problems?

The text was updated successfully, but these errors were encountered:

ericblanc20 · 2021-09-28T16:24:10Z

Example

SODAR project UUID: 1139b0ad-c6e4-4cc3-9d78-5f347f5e4bb6
Assay: transcriptome profiling, UUIDc7e98062-8f07-4427-b3d3-a780c7226e8f

In this project, the first tumor sample of donor 10_LR has been re-sequenced. Libraries 10-LR-T1-RNA1-mRNA_seq1 & 10-LR-T1-RNA2-mRNA_seq1 are both present in SODAR.

`cubi-tk` download

Download command

cubi-tk sodar pull-raw-data \
    --assay c7e98062-8f07-4427-b3d3-a780c7226e8f \
    1139b0ad-c6e4-4cc3-9d78-5f347f5e4bb6 \
    downloaded_from_SODAR

List of files for tumor samples of donor 10_LR:

(sak) [blance_c@med0624 cubi-tk_tst]$ ls -d downloaded_from_SODAR/10_LR-T*
downloaded_from_SODAR/10_LR-T1-RNA2-mRNA_seq1  downloaded_from_SODAR/10_LR-T2-RNA1-mRNA_seq1  downloaded_from_SODAR/10_LR-T3-RNA1-mRNA_seq1

The files for sample 10_LR-T1-RNA1-mRNA_seq1 are missing, and the fastq files for 10_LR-T1 are indeed those for 10_LR-T1-RNA2-mRNA_seq1 (no mixup between RNA1 & RNA2 at this level).

Running the `snappy` pipeline

Using the cancer sample sheet below:

[Metadata]				
schema	matched_cancer			
schema_version	v1			
title	Re-sequencing
description	Example of re-sequencing same data (different extracts)
				
[Data]				
patientName	sampleName	isTumor	libraryType	folderName
10_LR	T1	Y	mRNA_seq	10_LR-T1-RNA2-mRNA_seq1
10_LR	T2	Y	mRNA_seq	10_LR-T2-RNA1-mRNA_seq1
10_LR	T3	Y	mRNA_seq	10_LR-T3-RNA1-mRNA_seq1

snappy renames library 10_LR-T1-RNA2-mRNA_seq1 to 10_LR-T1-RNA1-mRNA_seq1 (here shown for the hla_typing step, but an identical behaviour is observed with ngs_mapping):

(snappy) [blance_c@med0624 hla_typing]$ snappy-snake --cores 4 > /dev/null 2>&1
(snappy) [blance_c@med0624 hla_typing]$ ls output
optitype.10_LR-T1-RNA1-mRNA_seq1  optitype.10_LR-T2-RNA1-mRNA_seq1  optitype.10_LR-T3-RNA1-mRNA_seq1

The results for sample 10_LR-T1 is named after library 10_LR-T1-RNA1-mRNA_seq1, but they are instead obtained from library 10_LR-T1-RNA2-mRNA_seq1.

Note that the problem doesn't appear to show up when the generic biomedsheet's schema is used

eudesbarbosa · 2022-09-09T08:38:59Z

Changes in snappy pull-raw-data (#118) partially address this is as they clarify the files being transferred (see below). The catch is that it would require a snappy-project and #120 to be fixed:

Command:

cubi-tk snappy pull-raw-data \
           --assay-uuid c7e98062-8f07-4427-b3d3-a780c7226e8f \
           1139b0ad-c6e4-4cc3-9d78-5f347f5e4bb6

Directory structure:

`-- <SAMPLE_ID>-T1-RNA1-mRNA_seq1
    |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1
    |   `-- raw_data
    |       `-- 2020-01-28
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_001_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_001_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_001_R2.fastq.gz
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_001_R2.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_002_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_002_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_002_R2.fastq.gz
    |           `-- <SAMPLE_ID>-N1-RNA1-mRNA_seq1_002_R2.fastq.gz.md5
    |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1
    |   `-- raw_data
    |       `-- 2020-01-28
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_001_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_001_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_001_R2.fastq.gz
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_001_R2.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_002_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_002_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_002_R2.fastq.gz
    |           `-- <SAMPLE_ID>-N2-RNA1-mRNA_seq1_002_R2.fastq.gz.md5
    |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1
    |   `-- raw_data
    |       `-- 2020-01-28
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_001_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_001_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_001_R2.fastq.gz
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_001_R2.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_002_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_002_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_002_R2.fastq.gz
    |           `-- <SAMPLE_ID>-N3-RNA1-mRNA_seq1_002_R2.fastq.gz.md5
    |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1
    |   `-- raw_data
    |       `-- 2020-02-14
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_001_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_001_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_001_R2.fastq.gz
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_001_R2.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_002_R1.fastq.gz
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_002_R1.fastq.gz.md5
    |           |-- <SAMPLE_ID>-N4-RNA1-mRNA_seq1_002_R2.fastq.gz

ericblanc20 assigned holtgrewe, mbenary, ericblanc20 and eudesbarbosa Jul 15, 2021

This was referenced Sep 8, 2022

Restrict snappy pull-raw-data to FASTQ files #118

Merged

Error while parsing optional columns in snappy pull-sheets #120

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate sequencing #62

Duplicate sequencing #62

ericblanc20 commented Jul 15, 2021

ericblanc20 commented Sep 28, 2021

eudesbarbosa commented Sep 9, 2022

Duplicate sequencing #62

Duplicate sequencing #62

Comments

ericblanc20 commented Jul 15, 2021

How to work with duplicate sequencing?