The implementation of compactors used in Tabula Sapiens Smart-Seq 2 analysis.
Compactors are seed-based contigs constrained to FASTQ read-length. NOMAD-called anchors are taken as input and >= 1 compactors are generated to represent all sequence diversity downstream of the anchor.
This script takes as input the results of the Salzman Lab's NextFlow implementation of NOMAD. In, the user specifies:
- A path to the samplesheet used input to NOMAD.
- A path to the NOMAD results directory.
Pull this repository to your machine, make the above changes to, and call with a single argument to specify the run name: we might do sbatch 'test_compactors'.
It is recommended to verify that the column names on lines 55 and 56 of agree with the relevant column names in the native NOMAD summary.tsv output.
This code requires the following:
- Python 3.9
- NumPy 1.20.3
- Pandas 1.3.1
- BioPython 1.79 is used to parallize jobs parsing samplesheet FASTQs for reads containing NOMAD-called anchors. In cases where the user has a few very large FASTQs (10s of GB), it is recommended to set fastqs_to_process_in_parallel to 1. In cases where the user has many FASTQs (100s, as in the case with Tabula Sapiens SS2 data), it is recommended to set fastqs_to_process_in_parallel to 10. This parameter can be found on line 10 of is used to parallelize jobs generating compactors from anchor-specific intermediate files (sets of reads containing an anchor). We recommend that the user set anchors_to_process_in_parallel to a value such that {total anchors input to compactor generation} / anchors_to_process_in_parallel < 1500. This parameter can be found on line 10 of uses 2 parameters to control the amount of reads we collect per anchor in each FASTQ. By default, we take 200 anchor-reads per FASTQ; this default can be modified on line 10 of By default, we also permit 1 million reads to be selected per FASTQ; this can be modified on line 30 of