Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save the 'entrez_gene' columns in character mode #28

Open
jokergoo opened this issue Mar 28, 2023 · 3 comments
Open

Save the 'entrez_gene' columns in character mode #28

jokergoo opened this issue Mar 28, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@jokergoo
Copy link

First thanks for this great package! Especially it directly outputs three different gene ID types, which saves a lot of time when switching between different gene ID types.

I have a small suggestion. Here in the output table, columns related to "entrez_gene" are stored as integers. I would suggest to change to characters, as what other Bioconducror annotation package does (e.g. org.Hs.eg.db).

gene_sets
# A tibble: 8,209 × 15
   gs_cat gs_su…¹ gs_name gene_…² entre…³ ensem…⁴ human…⁵ human…⁶ human…⁷ gs_id gs_pmid gs_ge…⁸
   <chr>  <chr>   <chr>   <chr>     <int> <chr>   <chr>     <int> <chr>   <chr> <chr>   <chr>  
 1 H      ""      HALLMA… ABCA1        19 ENSG00… ABCA1        19 ENSG00… M5905 267710… ""     
 2 H      ""      HALLMA… ABCB8     11194 ENSG00… ABCB8     11194 ENSG00… M5905 267710… ""     
 3 H      ""      HALLMA… ACAA2     10449 ENSG00… ACAA2     10449 ENSG00… M5905 267710… ""     
 4 H      ""      HALLMA… ACADL        33 ENSG00… ACADL        33 ENSG00… M5905 267710… ""     
 5 H      ""      HALLMA… ACADM        34 ENSG00… ACADM        34 ENSG00… M5905 267710… ""     
 6 H      ""      HALLMA… ACADS        35 ENSG00… ACADS        35 ENSG00… M5905 267710… ""     
 7 H      ""      HALLMA… ACLY         47 ENSG00… ACLY         47 ENSG00… M5905 267710… ""     
 8 H      ""      HALLMA… ACO2         50 ENSG00… ACO2         50 ENSG00… M5905 267710… ""     
 9 H      ""      HALLMA… ACOX1        51 ENSG00… ACOX1        51 ENSG00… M5905 267710… ""     
10 H      ""      HALLMA… ADCY6       112 ENSG00… ADCY6       112 ENSG00… M5905 267710… ""     
# … with 8,199 more rows, 3 more variables: gs_exact_source <chr>, gs_url <chr>,
#   gs_description <chr>, and abbreviated variable names ¹​gs_subcat, ²​gene_symbol,
#   ³​entrez_gene, ⁴​ensembl_gene, ⁵​human_gene_symbol, ⁶​human_entrez_gene, ⁷​human_ensembl_gene,
#   ⁸​gs_geoid
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Imagine we want to convert Entrez IDs to Refseq IDs, and we have a mapping vector (map) where Entrez IDs are the names and Refseq IDs are the values. Then naturally, to convert, we can do:

map[gene_sets$entrez_gene]

This causes the problem because gene_sets$entrez_gene are integers and it is actually treated as numeric indices for the map vector, while not to match to the names in map.

To do it correctly, we need to explicitly convert gene_sets$entrez_gene to characters:

map[as.character(gene_sets$entrez_gene)]

The more severe consequence is, if the maximal numeric value in gene_sets$entrez_gene is smaller than the length of map, executing map[gene_sets$entrez_gene] actually will not generate any warning or error message. And it would generate wrong results silently.

@igordot
Copy link
Owner

igordot commented Mar 28, 2023

Thank you for the suggestion. I don't usually work with Entrez IDs. I didn't notice that Bioconductor annotation packages treat them as characters. I assume it's to be consistent with all other ID types. Do you know if they are treated as integers in other packages?

Since Entrez IDs have always been integers in msigdbr, I hope the change doesn't wouldn't any unintended consequences for older users.

@igordot igordot added the enhancement New feature or request label Mar 28, 2023
@jokergoo
Copy link
Author

Yeah, they are digits and R by default reads them as numbers.

I just have checked:

These standard bioc packages store Entrez IDs as characters:

  • org.*.db, such as org.Hs.eg.db
  • TxDb.*, such as TxDb.Hsapiens.UCSC.hg19.knownGene
  • reactome.db

These standard bioc packages store them as integers:

  • biomaRt
  • Orthology.eg.db

I guess there won't be any conflict to other code. People won't use it for math calculations, e.g. id1 + id2 or mean(entrez_id) :)

@igordot
Copy link
Owner

igordot commented Mar 29, 2023

Thank you for looking into it. It's reassuring to know that there is not a clear standard, so neither option is "wrong". However, org.*.db and TxDb.* packages are probably more authoritative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants