Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error processing Gencode GTF #23

Open
mblanche opened this issue Apr 22, 2024 · 0 comments
Open

Error processing Gencode GTF #23

mblanche opened this issue Apr 22, 2024 · 0 comments

Comments

@mblanche
Copy link

Hi, I'm trying to use SICILIAN on single cell data for a project I am working on and I'm trying to start with the latest Gencode annotation. I ran into some issue trying to create the annotator files. I am running scripts/create_annotator.py on the Gencode Comprehensive gene annotation (ALL) GTF file like this:

python3 scripts/create_annotator.py -g ~/hg38_all/gtf/gencode.v45.chr_patch_hapl_scaff.annotation.gtf -a test

Very rapidly, I'm getting the following error:

Traceback (most recent call last):
  File "/home/ubuntu/work/SICILIAN/scripts/create_annotator.py", line 137, in <module>
    main()
  File "/home/ubuntu/work/SICILIAN/scripts/create_annotator.py", line 122, in main
    splices = get_splices(gtf_df)
  File "/home/ubuntu/work/SICILIAN/scripts/create_annotator.py", line 82, in get_splices
    gtf_df["exon_number"] = gtf_df.apply(get_exon_number, axis=1)
  File "/home/ubuntu/work/SICILIAN/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 10374, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/home/ubuntu/work/SICILIAN/.venv/lib/python3.10/site-packages/pandas/core/apply.py", line 916, in apply
    return self.apply_standard()
  File "/home/ubuntu/work/SICILIAN/.venv/lib/python3.10/site-packages/pandas/core/apply.py", line 1063, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/home/ubuntu/work/SICILIAN/.venv/lib/python3.10/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
  File "/home/ubuntu/work/SICILIAN/scripts/create_annotator.py", line 34, in get_exon_number
    return int(
ValueError: invalid literal for int() with base 10: 'ENSE00002234944.1'

I dug a bit into the create_annotator.py and found the source of the bug. Maybe the attributes columns for the Gencode GFF is different formatted from what you are expecting but line 20 breaks on the that GTF. Again, not sure if this is a new feature of the recent GTF but I can suggest a workaround that might be more portable in the future.

I'm not sure who is the owner of the current official specs for GFF2 (Ensembl is referencing GMOD on this dead link http://gmod.org/wiki/GFF2...). From what I can gather, the attributes column is a key:attribute pair semi-colon delineated list. So the get_exon_number() function could be written by splitting on the semi-colon then converting to a dicts like this (ok, Gencode seperates by semi-colon followed by a space, but that could be generalized later if other sources are only using semi-colon):

get_exon_number()

def get_exon_number(row):
    try:
        if "exon_number" in row["attribute"]:
            return int(
                {
                    sub.split(" ")[0]: sub.split(" ")[1]
                    for sub in row["attribute"].split("; ")
                }["exon_number"]
            )
    except ValueError:
        print("exon_number did not contain a number!")

Let me know if this is acceptable, I could submit as a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant