-
Notifications
You must be signed in to change notification settings - Fork 4
File formats
It is strongly recommended to avoid sequence names containing the following characters: (:,[]);
These characters will cause errors for phylogenetic analyses since they are part of the NEWICK syntax. Some program will refuse to have -
or spaces in the names.
Names must start with >
>seq1
ATCG
ACCC
>seq2
TCAT
AAAA
Sequences can either be interleaved or sequential. Sequence names are not restricted in length unlike the original format The first line of the file must contain the alignment length followed by a space and the number of sequences. Optionally the file format can be specified by appending a space and i (for interleaved) or s (for sequential). If the format specification is not present then the file is assumed to be interleaved.
Interleaved format:
2 8 i
seq1 ATCG
seq2 TCAT
ACCC
AAAA
Sequential format:
2 8 s
seq1 ATCG
ACCC
seq2 TCAT
AAAA
Sequences can either be interleaved or sequential. Interleaved sequences cannot contain spaces in their name.
Interleaved format:
#mega
Tile: My interleaved alignment
#seq1 ATCG
#seq2 TCAT
#seq1 ACCC
#seq2 AAAA
Sequential format:
#mega
Tile: My sequential alignment
#seq1
ATCG
ACCC
#seq2
TCAT
AAAA
#seq1
ATCG
ACCC
#seq2
TCAT
AAAA
Names cannot contain spaces.
#seq1 ATCG 4
#seq2 TCAT 4
#seq1 ACCC 8
#seq2 AAAA 8
Names must be preceded by >
followed by a 2 character sequence type and semi-colon. The next line is the sequence description.
Sequences end with a star.
Sequence type:
- P1 - Protein (complete)
- F1 - Protein (fragment)
- D1 - DNA (e.g. EMBOSS seqret output)
- DL - DNA (linear)
- DC - DNA (circular)
- RL - RNA (linear)
- RC - RNA (circular)
- N3 - tRNA
- N1 - Other functional RNA
- XX - Unknown
>DL;seq1
Sequence 1 description
ATCG
ACCC
*
>DL;seq2
Sequence 2 description
TCAT
AAAA
*
Names cannot contain spaces.
# STOCKHOLM 1.0
seq1 ATCG
seq2 TCAT
seq1 ACCC
seq2 AAAA
Comments can be inserted using square brackets. Sequence names can contain spaces as long as they are between single or double quotes. The file must contain ntax (number of sequences) and nchar (alignment length) as specified below.
More information is available in this paper
#NEXUS
[That's a comment]
Begin taxa;
dimensions ntax=2;
taxlabels
'seq 1'
'seq 2'
;
end;
Begin characters;
dimensions nchar=8;
format datatype=dna gap=-;
matrix
'seq 1' ATCG AC-C
'seq 2' TCAT AAAA
;
end;
The following example is also valid:
#NEXUS
[That's a comment]
Begin data;
dimensions nchar=8 ntax=2;
format datatype=dna gap=-;
matrix
'seq 1' ATCG AC-C
'seq 2' TCAT AAAA
;
end;
Sequence names can contain spaces as long as they are between single or double quotes.
Sequence names cannot contain any of these characters as they are part of the NEWICK syntax (:,[]);
Example
((taxon1:0.1,taxon2:0,2),taxon3:0.3);
Sequence names can contain spaces as long as they are between single or double quotes.
Sequence names cannot contain any of these characters as they are part of the NEWICK syntax (:,[]);
Comments can be inserted using square brackets.
More information is available in this paper
#NEXUS
[That's a comment]
Begin trees;
TREE tree1 = ((taxon1:[&rate=0.003]0.1,taxon2:[&rate=0.003]0,2),taxon3:[&rate=0.003]0.3);
end;
Another example using a translate block:
#NEXUS
[That's a comment]
Begin trees;
Translate
1 taxon1,
2 taxon2,
3 taxon3
;
TREE tree1 = ((1:[&rate=0.003]0.1,2:[&rate=0.003]0,2),3:[&rate=0.003]0.3);
end;