You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm trying to use SICILIAN on single cell data for a project I am working on and I'm trying to start with the latest Gencode annotation. I ran into some issue trying to create the annotator files. I am running scripts/create_annotator.py on the Gencode Comprehensive gene annotation (ALL) GTF file like this:
python3 scripts/create_annotator.py -g ~/hg38_all/gtf/gencode.v45.chr_patch_hapl_scaff.annotation.gtf -a test
Very rapidly, I'm getting the following error:
Traceback (most recent call last):
File "/home/ubuntu/work/SICILIAN/scripts/create_annotator.py", line 137, in <module>
main()
File "/home/ubuntu/work/SICILIAN/scripts/create_annotator.py", line 122, in main
splices = get_splices(gtf_df)
File "/home/ubuntu/work/SICILIAN/scripts/create_annotator.py", line 82, in get_splices
gtf_df["exon_number"] = gtf_df.apply(get_exon_number, axis=1)
File "/home/ubuntu/work/SICILIAN/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 10374, in apply
return op.apply().__finalize__(self, method="apply")
File "/home/ubuntu/work/SICILIAN/.venv/lib/python3.10/site-packages/pandas/core/apply.py", line 916, in apply
return self.apply_standard()
File "/home/ubuntu/work/SICILIAN/.venv/lib/python3.10/site-packages/pandas/core/apply.py", line 1063, in apply_standard
results, res_index = self.apply_series_generator()
File "/home/ubuntu/work/SICILIAN/.venv/lib/python3.10/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
results[i] = self.func(v, *self.args, **self.kwargs)
File "/home/ubuntu/work/SICILIAN/scripts/create_annotator.py", line 34, in get_exon_number
return int(
ValueError: invalid literal for int() with base 10: 'ENSE00002234944.1'
I dug a bit into the create_annotator.py and found the source of the bug. Maybe the attributes columns for the Gencode GFF is different formatted from what you are expecting but line 20 breaks on the that GTF. Again, not sure if this is a new feature of the recent GTF but I can suggest a workaround that might be more portable in the future.
I'm not sure who is the owner of the current official specs for GFF2 (Ensembl is referencing GMOD on this dead link http://gmod.org/wiki/GFF2...). From what I can gather, the attributes column is a key:attribute pair semi-colon delineated list. So the get_exon_number() function could be written by splitting on the semi-colon then converting to a dicts like this (ok, Gencode seperates by semi-colon followed by a space, but that could be generalized later if other sources are only using semi-colon):
get_exon_number()
def get_exon_number(row):
try:
if "exon_number" in row["attribute"]:
return int(
{
sub.split(" ")[0]: sub.split(" ")[1]
for sub in row["attribute"].split("; ")
}["exon_number"]
)
except ValueError:
print("exon_number did not contain a number!")
Let me know if this is acceptable, I could submit as a PR
The text was updated successfully, but these errors were encountered:
Hi, I'm trying to use SICILIAN on single cell data for a project I am working on and I'm trying to start with the latest Gencode annotation. I ran into some issue trying to create the annotator files. I am running
scripts/create_annotator.py
on the Gencode Comprehensive gene annotation (ALL) GTF file like this:Very rapidly, I'm getting the following error:
I dug a bit into the
create_annotator.py
and found the source of the bug. Maybe the attributes columns for the Gencode GFF is different formatted from what you are expecting but line 20 breaks on the that GTF. Again, not sure if this is a new feature of the recent GTF but I can suggest a workaround that might be more portable in the future.I'm not sure who is the owner of the current official specs for GFF2 (Ensembl is referencing GMOD on this dead link http://gmod.org/wiki/GFF2...). From what I can gather, the
attributes
column is a key:attribute pair semi-colon delineated list. So theget_exon_number()
function could be written by splitting on the semi-colon then converting to a dicts like this (ok, Gencode seperates by semi-colon followed by a space, but that could be generalized later if other sources are only using semi-colon):get_exon_number()
Let me know if this is acceptable, I could submit as a PR
The text was updated successfully, but these errors were encountered: