Skip to content

6. Input and output data

Clara Köhne edited this page Jan 9, 2023 · 15 revisions

Input data

These options are always mandatory: --contigs and --reference. --contigs should point to the path of a FASTA file with one or more DNA sequences and --reference should point to the path of a FASTA file with one or more amino acid sequences. It doesn't matter whether the sequences were aligned or not since gap characters (-) are removed before the alignment step.

Species annotation

Each sequence identifier line should be annotated with a species name in the following format: <species><delimiter><rest>. The default species delimiter is an at sign @, so the format becomes <species>@<rest>. For example, D_melanogaster@16S is a valid sequence identifier format.

To add the species name, D_melanogaster to a set of sequences with the name contigs.fa, you can use sed:

Tip: remove -i to show the results before committing to the replacement

sed -i 's/^>/\>D_melanogaster@/g' contigs.fa

Reverting back to an unannotated format is as easy as typing the following into your command-line:

sed -i 's/\>D_melanogaster@/g' contigs.fa

Query contigs

Reference sequences

Providing a directory

If you provide a directory as an input, Patchwork will look for all of the FASTA files within that directory (but not in the subdirectories). These are the filetype extensions that are recognized by Patchwork: .fa, .fas, .fasta, .fna, .faa, .fsa, .ffn, and .frn.

Output data

If you did not specify an output directory using the --output-dir flag, then the output files will be saved to a folder called patchwork_output. Here is an overview of the output directory:

patchwork_output/
├── database.dmnd
├── diamond_blastx.log
├── diamond_makedb.log
├── diamond_out
│   ├── ...
│   └── SEQUENCE_NAME.tsv
├── dna_query_sequences
│   ├── ...
│   └── SEQUENCE_NAME.fas
├── plots
│   ├── percent_identity.png
│   └── query_coverage.png
├── query_sequences
│   ├── ...
│   └── SEQUENCE_NAME.fas
├── sequence_stats
|   ├── average.csv
|   └── statistics.csv
├── trimmed_alignments.txt
└── untrimmed_alignments.txt

Here is a short description of what each of these files are:

  • database.dmnd is the datebase file that was created by DIAMOND
  • diamond_blastx.log contain the DIAMOND output in plain-text
  • diamond_makedb.log contain the DIAMOND output for the database construction in plain-text
  • diamond_out contains the result for each individual reference sequence
  • dna_query_sequences contains all the merged, non-translated (i.e. DNA) query sequences in FASTA format
  • plots contains basic visualizations of result quality (plots/percent_identity.png and plots/query_coverage.png)
  • query_sequences contains all the merged and translated query sequences in FASTA format
  • sequence_stats contains basic statistics for both each individual search (sequence_stats/statistics.csv) and everything together (sequence_stats/average.csv)
  • trimmed_alignments.txt and untrimmed_alignments.txt contain a visual representation of each alignment (trimmed and untrimmed, respectively)

The following is an example of the structure of a [trimmed_|untrimmed_]alignments.txt file:

1. -----------------------------------------------------------------------------

Reference ID:        Helobdella_robusta@366936at33208_6412_0:004149
Reference Length:    294
Query Length:        294
Contigs:             320
Matches:             193
Mismatches:          101
Deletions:           0
Occupancy:           1.0

  seq:   1 VEEYEKLERIGEGTYGVVYKAKNVKTNTLVALKRGRFDNEEEGVPGTAIREISLLEALEH  60
              ||||| ||||||| | ||||  |   ||||| | |   ||||  | ||| ||  | |
  ref:   1 MQKYEKLEKIGEGTYGTVFKAKNRETQEIVALKRVRLDDDDEGVPSSALREICLLKELNH  60

  seq:  61 PNIVTLQDVIETEKKIYLVFEYLTMDLKKYMDALNGELPPDTVKTFLFQLLRGLAYCHAR 120
            ||| | ||   |||  |||||   ||||| |  |||  ||||| |  ||| ||| || |
  ref:  61 KNIVRLCDVLHSEKKLTLVFEYSDQDLKKYFDSCNGEIDPDTVKSFMYQLLKGLAFCHGR 120

  seq: 121 RILHRDLKPQNLLINKNGELKLADFGLARAFGVPVRCYTHEVVTLWYRAPEVLLQDKLYT 180
             |||||||||||||||||||||||||||||| |||||  |||||||| | ||   ||| 
  ref: 121 NVLHRDLKPQNLLINKNGELKLADFGLARAFGIPVRCYSAEVVTLWYRPPDVLFGAKLYS 180

  seq: 181 TSIDLWSVGCIFGELANAGRPLWPGNDISDECKNIIKLLGTPTDDTWPEGYQLSQLKPYP 240
           |||| || |||| ||||||||| ||||  |  | | ||||||| ||||   ||   ||||
  ref: 181 TSIDMWSAGCIFAELANAGRPLFPGNDVDDQLKRIFKLLGTPTEDTWPGFTQLPEYKPYP 240

  seq: 241 LFESLTEKLQIVPFIENNFTSFLLRLLTYNPQKRITASDALNHPYFSELNANVK 294
           |  | |  |||||         || ||  ||  |  | |   | |||| ||| |
  ref: 241 LYPSSTNWLQIVPKLNSKGRDLLLSLLVCNPSQRMGADDSMKHSYFSEMNANLK 294