Skip to content

Exploring potentially novel transcripts

Geo Pertea edited this page Jan 28, 2021 · 2 revisions

Exploring potentially novel transcripts

After combining sample assemblies with gffcompare (thousands at a time), the resulting combined.gtf files can be explored for potentially novel transcripts.

Filtering combined.gtf to keep only multi-exon transcripts found in at least 10 samples, and having any of the class codes k,m,n,j,o,i,y,u,r (potentially novel)

gtf_flt -n10 -c kmnjoiyur dlpfc.combined.gtf > dlpfc.n10.kmnjoiyur.gtf

Map all these potentially novel transcripts to the reference transcript to isolate the novel junctions (last column):

trmap -J ~/work/ref/gencode.v32.annotation.gtf dlpfc.n10.kmnjoiyur.gtf | sort -k1,1 > dlpfc.n10.kmnjoiyur.jtable

The output columns are: 1) tID, 2) chr:strand:exons, 3) code|ref_transcripts, 4) num_ref_genes, 5)novel_junctions

Generate a transcript info table with gffread:

gffread --table @id,num_samples,@geneid,@numexons,@chr,@strand,@start,@end \
 dlpfc.n10.kmnjoiyur.gtf| sort -k1,1 > dlpfc.n10.tinfo.tab

Columns of interests are: 1) tid, 2) num_samples, 3) XLOC, 4) num_exons, 5-8: genome location

Join the above output of gffread with the output of trmap -J, keeping only column 2,3,4 from the tinfo.tab file and columns 5, 2-4 from the jtable file:

echo "xtid,num_samples,xloc,num_ref_genes,chr,strand,num_exons,novel_juncs,ref_transcripts" | tr ',' '\t' | \
cat  - <(join --check-order -t $'\t' -o '1.1 1.2 1.3 2.4 1.5 1.6 1.4 2.5 2.3' \
  dlpfc.n10.tinfo.tab dlpfc.n10.kmnjoiyur.jtable) > dlpfc.novel_s10.tab
Clone this wiki locally