Compactor Generation

The implementation of compactors used in Tabula Sapiens Smart-Seq 2 analysis.

Compactors are seed-based contigs constrained to FASTQ read-length. NOMAD-called anchors are taken as input and >= 1 compactors are generated to represent all sequence diversity downstream of the anchor.

Usage

This script takes as input the results of the Salzman Lab's NextFlow implementation of NOMAD. In aspen.sh, the user specifies:

A path to the samplesheet used input to NOMAD.
A path to the NOMAD results directory.

Pull this repository to your machine, make the above changes to aspen.sh, and call aspen.sh with a single argument to specify the run name: we might do sbatch aspen.sh 'test_compactors'.

It is recommended to verify that the column names on lines 55 and 56 of submit_parse.py agree with the relevant column names in the native NOMAD summary.tsv output.

Dependencies

This code requires the following:

Python 3.9
NumPy 1.20.3
Pandas 1.3.1
BioPython 1.79

Parallelization tuning

submit_parse.py is used to parallize jobs parsing samplesheet FASTQs for reads containing NOMAD-called anchors. In cases where the user has a few very large FASTQs (10s of GB), it is recommended to set fastqs_to_process_in_parallel to 1. In cases where the user has many FASTQs (100s, as in the case with Tabula Sapiens SS2 data), it is recommended to set fastqs_to_process_in_parallel to 10. This parameter can be found on line 10 of submit_parse.py.

parallel_generate.py is used to parallelize jobs generating compactors from anchor-specific intermediate files (sets of reads containing an anchor). We recommend that the user set anchors_to_process_in_parallel to a value such that {total anchors input to compactor generation} / anchors_to_process_in_parallel < 1500. This parameter can be found on line 10 of parallel_generate.py.

FASTQ downsampling

make_intermediates.py uses 2 parameters to control the amount of reads we collect per anchor in each FASTQ. By default, we take 200 anchor-reads per FASTQ; this default can be modified on line 10 of make_intermediates.py. By default, we also permit 1 million reads to be selected per FASTQ; this can be modified on line 30 of make_intermediates.py.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
LICENSE		LICENSE
README.md		README.md
aspen.py		aspen.py
aspen.sh		aspen.sh
make_intermediates.py		make_intermediates.py
parallel_generate.py		parallel_generate.py
submit_parse.py		submit_parse.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compactor Generation

Usage

Dependencies

Parallelization tuning

FASTQ downsampling

About

Releases

Packages

Languages

License

salzman-lab/compactor

Folders and files

Latest commit

History

Repository files navigation

Compactor Generation

Usage

Dependencies

Parallelization tuning

FASTQ downsampling

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages