You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 3, 2024. It is now read-only.
I've encountered and fixed a couple of bugs with make_file_for_subsampling_from_collapsed.py, and figured I could document them here since similar GitHub issues have helped me.
Bug 1: Overestimating gene counts for saturation with SQANTI classifications
I found that saturation curves I built using SQANTI classifications (--by refgene or refisoform) were largely overestimating gene counts compared to curves without them (--by pbgene or pbid). It turns out that for novel isoforms in novel genes, SQANTI is assigning unique gene labels regardless of whether they belong to the same pbgene. I'm not certain if it's an issue with SQANTI, my data, or maybe an undocumented change in one of the PacBio output files, but Cupcake uses the associated_gene column in the SQANTI classification file and ends up treating each novel isoform as its own gene during subsampling.
Solution
To fix it, I made a small change to the SQANTI classification parsing (for loop starting at line 63) to only use the associated_gene column for non-novel genes. For novel genes, it sets refgene to a format that matches the refisoform category more closely using the gene ID assigned by PacBio collapse; e.g. for refisoform "novel_PB.2.5" we'll have refgene "novel_PB.2":
forrinDictReader(open(sqanti_class), delimiter='\t'):
ifr['associated_transcript'] =='novel':
refisoform='novel_'+r['isoform']
else:
refisoform=r['associated_transcript']
# for novel genes, sets refgene to a sensible gene IDifr['associated_gene'].startswith("novelGene"):
# trim isoform number off isoform, use that to make novel refgenerefgene="novel_"+r['isoform'].rsplit(".",1)[0]
else: # if it's not novel, use the 'associated_gene' column to set refgenerefgene=r['associated_gene']
match_dict[r['isoform']] = {'refgene': refgene, # use refgene var instead of always using 'associated_gene''refisoform': refisoform,
'category': r['structural_category']}
Bug 2: Updates to isoseq collapse caused make_file_for_subsampling_from_collapse.py to report very low counts
Sometime after isoseq collapse v3.5.0 the format of the *.abundance.txt file was changed. The count_fl column is still present, but appears to have a different meaning than it used to. Instead, there's a new column named fl_assoc that looks like it contains counts that count_fl used to have. I couldn't find documentation on this change, but I did find that the *.abundance.txt headers have some ambiguous descriptions:
old count_fl: "Number of associated FL reads"
new count_fl: "Number of input FL reads associated with isoform"
fl_assoc: "Total number of FL reads associated with isosform"
Regardless of the new meaning, the new count_fl has much lower values including a lot of 1's, so the default --min_fl_count 2 when subsampling filters out most of the isoforms.
Solution
My solution is to check if *.abundance.txt has the fl_assoc column; if it does, it uses fl_assoc. Otherwise, it uses count_fl. This allows for backwards compatibility with files collapsed using older versions of isoseq. This change is in addition to the tweak from #111 (swapping the order of the "or" statement to fix an error when using the --include_single_exons flag). Here's the changes:
count_col_name=""# initialize count_col_nameforrinDictReader(f, delimiter='\t'):
ifcount_col_name=="": # if it hasn't been set yet, set itcount_col_name='fl_assoc'if'fl_assoc'inr.keys() else'count_fl'ifinclude_single_exonsorr['pbid'] ingood_ids:
to_write['all'][r['pbid']] =r[count_col_name]
I can't guarantee these will fix these problems indefinitely or in all cases, but this is what solved them for me. Hope this helps someone!
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I've encountered and fixed a couple of bugs with make_file_for_subsampling_from_collapsed.py, and figured I could document them here since similar GitHub issues have helped me.
Bug 1: Overestimating gene counts for saturation with SQANTI classifications
I found that saturation curves I built using SQANTI classifications (
--by refgene
orrefisoform
) were largely overestimating gene counts compared to curves without them (--by pbgene
orpbid
). It turns out that for novel isoforms in novel genes, SQANTI is assigning unique gene labels regardless of whether they belong to the samepbgene
. I'm not certain if it's an issue with SQANTI, my data, or maybe an undocumented change in one of the PacBio output files, but Cupcake uses theassociated_gene
column in the SQANTI classification file and ends up treating each novel isoform as its own gene during subsampling.Solution
To fix it, I made a small change to the SQANTI classification parsing (for loop starting at line 63) to only use the
associated_gene
column for non-novel genes. For novel genes, it sets refgene to a format that matches the refisoform category more closely using the gene ID assigned by PacBio collapse; e.g. for refisoform "novel_PB.2.5" we'll have refgene "novel_PB.2":Bug 2: Updates to
isoseq collapse
causedmake_file_for_subsampling_from_collapse.py
to report very low countsSometime after
isoseq collapse v3.5.0
the format of the*.abundance.txt
file was changed. Thecount_fl
column is still present, but appears to have a different meaning than it used to. Instead, there's a new column namedfl_assoc
that looks like it contains counts thatcount_fl
used to have. I couldn't find documentation on this change, but I did find that the*.abundance.txt
headers have some ambiguous descriptions:count_fl
: "Number of associated FL reads"count_fl
: "Number of input FL reads associated with isoform"fl_assoc
: "Total number of FL reads associated with isosform"Regardless of the new meaning, the new
count_fl
has much lower values including a lot of 1's, so the default--min_fl_count 2
when subsampling filters out most of the isoforms.Solution
My solution is to check if
*.abundance.txt
has thefl_assoc
column; if it does, it usesfl_assoc
. Otherwise, it usescount_fl
. This allows for backwards compatibility with files collapsed using older versions of isoseq. This change is in addition to the tweak from #111 (swapping the order of the "or" statement to fix an error when using the --include_single_exons flag). Here's the changes:Original (lines 85-87):
Fixed:
I can't guarantee these will fix these problems indefinitely or in all cases, but this is what solved them for me. Hope this helps someone!
The text was updated successfully, but these errors were encountered: