-
Notifications
You must be signed in to change notification settings - Fork 3
6. Input and output data
These options are always mandatory: --contigs
and --reference
. --contigs
should point to the path of a FASTA file with one or more DNA sequences and --reference
should point to the path of a FASTA file with one or more amino acid sequences. It doesn't matter whether the sequences were aligned or not since gap characters (-
) are removed before the alignment step.
Each sequence identifier line should be annotated with a species name in the following format: <species><delimiter><rest>
. The default species delimiter is an at sign @
, so the format becomes <species>@<rest>
. For example, D_melanogaster@16S
is a valid sequence identifier format.
To add the species name, D_melanogaster
to a set of sequences with the name contigs.fa
, you can use sed
:
Tip: remove
-i
to show the results before committing to the replacement
sed -i 's/^>/\>D_melanogaster@/g' contigs.fa
Reverting back to an unannotated format is as easy as typing the following into your command-line:
sed -i 's/\>D_melanogaster@/g' contigs.fa
If you provide a directory as an input, Patchwork will look for all of the FASTA files within that directory (but not in the subdirectories). These are the filetype extensions that are recognized by Patchwork: .fa
, .fas
, .fasta
, .fna
, .faa
, .fsa
, .ffn
, and .frn
.
If you did not specify an output directory using the --output-dir
flag, then the output files
will be saved to a folder called patchwork_output
. Here is an overview of the output directory:
patchwork_output/
├── database.dmnd
├── diamond_blastx.log
├── diamond_makedb.log
├── diamond_out
│ ├── ...
│ └── SEQUENCE_NAME.tsv
├── dna_query_sequences
│ ├── ...
│ └── SEQUENCE_NAME.fas
├── plots
│ ├── percent_identity.png
│ └── query_coverage.png
├── query_sequences
│ ├── ...
│ └── SEQUENCE_NAME.fas
├── sequence_stats
| ├── average.csv
| └── statistics.csv
├── trimmed_alignments.txt
└── untrimmed_alignments.txt
Here is a short description of what each of these files are:
-
database.dmnd
is the datebase file that was created by DIAMOND -
diamond_blastx.log
contain the DIAMOND output in plain-text -
diamond_makedb.log
contain the DIAMOND output for the database construction in plain-text -
diamond_out
contains the result for each individual reference sequence -
dna_query_sequences
contains all the merged, non-translated (i.e. DNA) query sequences in FASTA format -
plots
contains basic visualizations of result quality (plots/percent_identity.png
andplots/query_coverage.png
) -
query_sequences
contains all the merged and translated query sequences in FASTA format -
sequence_stats
contains basic statistics for both each individual search (sequence_stats/statistics.csv
) and everything together (sequence_stats/average.csv
) -
trimmed_alignments.txt
anduntrimmed_alignments.txt
contain a visual representation of each alignment (trimmed and untrimmed, respectively)
The following is an example of the structure of a [trimmed_|untrimmed_]alignments.txt
file:
1. -----------------------------------------------------------------------------
Reference ID: Helobdella_robusta@366936at33208_6412_0:004149
Reference Length: 294
Query Length: 294
Contigs: 320
Matches: 193
Mismatches: 101
Deletions: 0
Occupancy: 1.0
seq: 1 VEEYEKLERIGEGTYGVVYKAKNVKTNTLVALKRGRFDNEEEGVPGTAIREISLLEALEH 60
||||| ||||||| | |||| | ||||| | | |||| | ||| || | |
ref: 1 MQKYEKLEKIGEGTYGTVFKAKNRETQEIVALKRVRLDDDDEGVPSSALREICLLKELNH 60
seq: 61 PNIVTLQDVIETEKKIYLVFEYLTMDLKKYMDALNGELPPDTVKTFLFQLLRGLAYCHAR 120
||| | || ||| ||||| ||||| | ||| ||||| | ||| ||| || |
ref: 61 KNIVRLCDVLHSEKKLTLVFEYSDQDLKKYFDSCNGEIDPDTVKSFMYQLLKGLAFCHGR 120
seq: 121 RILHRDLKPQNLLINKNGELKLADFGLARAFGVPVRCYTHEVVTLWYRAPEVLLQDKLYT 180
|||||||||||||||||||||||||||||| ||||| |||||||| | || |||
ref: 121 NVLHRDLKPQNLLINKNGELKLADFGLARAFGIPVRCYSAEVVTLWYRPPDVLFGAKLYS 180
seq: 181 TSIDLWSVGCIFGELANAGRPLWPGNDISDECKNIIKLLGTPTDDTWPEGYQLSQLKPYP 240
|||| || |||| ||||||||| |||| | | | ||||||| |||| || ||||
ref: 181 TSIDMWSAGCIFAELANAGRPLFPGNDVDDQLKRIFKLLGTPTEDTWPGFTQLPEYKPYP 240
seq: 241 LFESLTEKLQIVPFIENNFTSFLLRLLTYNPQKRITASDALNHPYFSELNANVK 294
| | | ||||| || || || | | | | |||| ||| |
ref: 241 LYPSSTNWLQIVPKLNSKGRDLLLSLLVCNPSQRMGADDSMKHSYFSEMNANLK 294