You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First thanks for this great package! Especially it directly outputs three different gene ID types, which saves a lot of time when switching between different gene ID types.
I have a small suggestion. Here in the output table, columns related to "entrez_gene" are stored as integers. I would suggest to change to characters, as what other Bioconducror annotation package does (e.g. org.Hs.eg.db).
gene_sets
# A tibble: 8,209 × 15
gs_cat gs_su…¹ gs_name gene_…² entre…³ ensem…⁴ human…⁵ human…⁶ human…⁷ gs_id gs_pmid gs_ge…⁸
<chr> <chr> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr> <chr> <chr>
1 H "" HALLMA… ABCA1 19 ENSG00… ABCA1 19 ENSG00… M5905 267710… ""
2 H "" HALLMA… ABCB8 11194 ENSG00… ABCB8 11194 ENSG00… M5905 267710… ""
3 H "" HALLMA… ACAA2 10449 ENSG00… ACAA2 10449 ENSG00… M5905 267710… ""
4 H "" HALLMA… ACADL 33 ENSG00… ACADL 33 ENSG00… M5905 267710… ""
5 H "" HALLMA… ACADM 34 ENSG00… ACADM 34 ENSG00… M5905 267710… ""
6 H "" HALLMA… ACADS 35 ENSG00… ACADS 35 ENSG00… M5905 267710… ""
7 H "" HALLMA… ACLY 47 ENSG00… ACLY 47 ENSG00… M5905 267710… ""
8 H "" HALLMA… ACO2 50 ENSG00… ACO2 50 ENSG00… M5905 267710… ""
9 H "" HALLMA… ACOX1 51 ENSG00… ACOX1 51 ENSG00… M5905 267710… ""
10 H "" HALLMA… ADCY6 112 ENSG00… ADCY6 112 ENSG00… M5905 267710… ""
# … with 8,199 more rows, 3 more variables: gs_exact_source <chr>, gs_url <chr>,
# gs_description <chr>, and abbreviated variable names ¹gs_subcat, ²gene_symbol,
# ³entrez_gene, ⁴ensembl_gene, ⁵human_gene_symbol, ⁶human_entrez_gene, ⁷human_ensembl_gene,
# ⁸gs_geoid
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Imagine we want to convert Entrez IDs to Refseq IDs, and we have a mapping vector (map) where Entrez IDs are the names and Refseq IDs are the values. Then naturally, to convert, we can do:
map[gene_sets$entrez_gene]
This causes the problem because gene_sets$entrez_gene are integers and it is actually treated as numeric indices for the map vector, while not to match to the names in map.
To do it correctly, we need to explicitly convert gene_sets$entrez_gene to characters:
map[as.character(gene_sets$entrez_gene)]
The more severe consequence is, if the maximal numeric value in gene_sets$entrez_gene is smaller than the length of map, executing map[gene_sets$entrez_gene] actually will not generate any warning or error message. And it would generate wrong results silently.
The text was updated successfully, but these errors were encountered:
Thank you for the suggestion. I don't usually work with Entrez IDs. I didn't notice that Bioconductor annotation packages treat them as characters. I assume it's to be consistent with all other ID types. Do you know if they are treated as integers in other packages?
Since Entrez IDs have always been integers in msigdbr, I hope the change doesn't wouldn't any unintended consequences for older users.
Thank you for looking into it. It's reassuring to know that there is not a clear standard, so neither option is "wrong". However, org.*.db and TxDb.* packages are probably more authoritative.
First thanks for this great package! Especially it directly outputs three different gene ID types, which saves a lot of time when switching between different gene ID types.
I have a small suggestion. Here in the output table, columns related to "entrez_gene" are stored as integers. I would suggest to change to characters, as what other Bioconducror annotation package does (e.g. org.Hs.eg.db).
Imagine we want to convert Entrez IDs to Refseq IDs, and we have a mapping vector (
map
) where Entrez IDs are the names and Refseq IDs are the values. Then naturally, to convert, we can do:This causes the problem because
gene_sets$entrez_gene
are integers and it is actually treated as numeric indices for themap
vector, while not to match to the names inmap
.To do it correctly, we need to explicitly convert
gene_sets$entrez_gene
to characters:The more severe consequence is, if the maximal numeric value in
gene_sets$entrez_gene
is smaller than the length ofmap
, executingmap[gene_sets$entrez_gene]
actually will not generate any warning or error message. And it would generate wrong results silently.The text was updated successfully, but these errors were encountered: