-
Notifications
You must be signed in to change notification settings - Fork 2
Exploring potentially novel transcripts
After combining sample assemblies with gffcompare (thousands at a time), the resulting combined.gtf
files can be explored for potentially novel transcripts.
Filtering combined.gtf
to keep only multi-exon transcripts found in at least 10 samples,
and having any of the class codes k,m,n,j,o,i,y,u,r (potentially novel)
gtf_flt -n10 -c kmnjoiyur dlpfc.combined.gtf > dlpfc.n10.kmnjoiyur.gtf
Map all these potentially novel transcripts to the reference transcript to isolate the novel junctions (last column):
trmap -J ~/work/ref/gencode.v32.annotation.gtf dlpfc.n10.kmnjoiyur.gtf | sort -k1,1 > dlpfc.n10.kmnjoiyur.jtable
The output columns are: 1) tID, 2) chr:strand:exons, 3) code|ref_transcripts, 4) num_ref_genes, 5)novel_junctions
Generate a transcript info table with gffread:
gffread --table @id,num_samples,@geneid,@numexons,@chr,@strand,@start,@end \
dlpfc.n10.kmnjoiyur.gtf| sort -k1,1 > dlpfc.n10.tinfo.tab
Columns of interests are: 1) tid, 2) num_samples, 3) XLOC, 4) num_exons, 5-8: genome location
Join the above output of gffread with the output of trmap -J, keeping only column 2,3,4 from the tinfo.tab file and columns 5, 2-4 from the jtable file:
echo "xtid,num_samples,xloc,num_ref_genes,chr,strand,num_exons,novel_juncs,ref_transcripts" | tr ',' '\t' | \
cat - <(join --check-order -t $'\t' -o '1.1 1.2 1.3 2.4 1.5 1.6 1.4 2.5 2.3' \
dlpfc.n10.tinfo.tab dlpfc.n10.kmnjoiyur.jtable) > dlpfc.novel_s10.tab