Error processing Gencode GTF #23

mblanche · 2024-04-22T20:27:24Z

Hi, I'm trying to use SICILIAN on single cell data for a project I am working on and I'm trying to start with the latest Gencode annotation. I ran into some issue trying to create the annotator files. I am running scripts/create_annotator.py on the Gencode Comprehensive gene annotation (ALL) GTF file like this:

python3 scripts/create_annotator.py -g ~/hg38_all/gtf/gencode.v45.chr_patch_hapl_scaff.annotation.gtf -a test

Very rapidly, I'm getting the following error:

Traceback (most recent call last):
  File "/home/ubuntu/work/SICILIAN/scripts/create_annotator.py", line 137, in <module>
    main()
  File "/home/ubuntu/work/SICILIAN/scripts/create_annotator.py", line 122, in main
    splices = get_splices(gtf_df)
  File "/home/ubuntu/work/SICILIAN/scripts/create_annotator.py", line 82, in get_splices
    gtf_df["exon_number"] = gtf_df.apply(get_exon_number, axis=1)
  File "/home/ubuntu/work/SICILIAN/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 10374, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/home/ubuntu/work/SICILIAN/.venv/lib/python3.10/site-packages/pandas/core/apply.py", line 916, in apply
    return self.apply_standard()
  File "/home/ubuntu/work/SICILIAN/.venv/lib/python3.10/site-packages/pandas/core/apply.py", line 1063, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/home/ubuntu/work/SICILIAN/.venv/lib/python3.10/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
  File "/home/ubuntu/work/SICILIAN/scripts/create_annotator.py", line 34, in get_exon_number
    return int(
ValueError: invalid literal for int() with base 10: 'ENSE00002234944.1'

I dug a bit into the create_annotator.py and found the source of the bug. Maybe the attributes columns for the Gencode GFF is different formatted from what you are expecting but line 20 breaks on the that GTF. Again, not sure if this is a new feature of the recent GTF but I can suggest a workaround that might be more portable in the future.

I'm not sure who is the owner of the current official specs for GFF2 (Ensembl is referencing GMOD on this dead link http://gmod.org/wiki/GFF2...). From what I can gather, the attributes column is a key:attribute pair semi-colon delineated list. So the get_exon_number() function could be written by splitting on the semi-colon then converting to a dicts like this (ok, Gencode seperates by semi-colon followed by a space, but that could be generalized later if other sources are only using semi-colon):

get_exon_number()

def get_exon_number(row):
    try:
        if "exon_number" in row["attribute"]:
            return int(
                {
                    sub.split(" ")[0]: sub.split(" ")[1]
                    for sub in row["attribute"].split("; ")
                }["exon_number"]
            )
    except ValueError:
        print("exon_number did not contain a number!")

Let me know if this is acceptable, I could submit as a PR

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error processing Gencode GTF #23

Error processing Gencode GTF #23

mblanche commented Apr 22, 2024

Error processing Gencode GTF #23

Error processing Gencode GTF #23

Comments

mblanche commented Apr 22, 2024