Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methodology details, and write.gmt helper functions? #18

Open
dereckmezquita opened this issue Apr 17, 2022 · 3 comments
Open

Methodology details, and write.gmt helper functions? #18

dereckmezquita opened this issue Apr 17, 2022 · 3 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@dereckmezquita
Copy link

Hi I came across your package which could potentially save me a lot of work so I thank you.

Could you publish the details on your methods for converting between human to X species? I need this information in order to be able to cite you in my research.

Also will you consider adding helper functions to convert from the data.frame types to a type which can be easily written as a .gmt pathway file?

@igordot
Copy link
Owner

igordot commented Apr 18, 2022

Thank you for your interest. The gene conversion happen using a different package babelgene. The vignette includes some background info, but let me know if anything is unclear. The code for pre-processing the data is available as well if you really want to dive deep.

There are a few different GMT writer functions available, such as cmapR::write_gmt, pathwayPCA::write_gmt, immcp::write_gmt, and rWikiPathways::writeGMT. I have not tried any of them, but I am not sure another function would be solving any new problems.

@igordot igordot added enhancement New feature or request question Further information is requested labels Apr 18, 2022
@dereckmezquita
Copy link
Author

dereckmezquita commented Apr 18, 2022

Thank you for that, babelgene I will look into that.

And thank you for pointing those write gmt functions out for me.

I've written one myself in the past; I suppose what I was really asking for is helper functions for extracting/selecting a database set for example hallmark and then having it extract the related genes along with gene set description URL and the pathway (gene set) name and genes (in original order) and putting it into a different format which could then written to a file as a gmt.

For example, convert HALLMARK dataset to a list of character vectors (list pathways/gene sets; vector gene sets). This should be a list of 50 elements (50 pathways) (as HALLMARK has only 50 pathways) each element of this list holds a character vector of the pathway (gene set) name first, then the description URL as in the standard GMT distributed by Broad, and then the genes.

This object could could then be written line by line using a \t separator would do it.


The tricky parts I am facing in accomplishing this task is extracting the elements relating to specific gene set collections and getting the original order of the genes in a given gene set.

Might you be able to give me some information as to how I could re-find the original order the genes in a given gene set are supposed to go in? As I've understood GSEA gmt files have gene sets and these are in a specific order from most to least important. I don't see this information (ordering) included in the datasets offered here; am I missing something?

As proof of concept I would like to be able to convert the Homo sapiens data back to separate gmt files, which match those distributed by Broad. I don't know how I would get the gene order though.

I am looking for a way to extract the genes relating to these 5 specific pathway collections:

  • msigdb.v7.5.1.symbols.gmt.txt
  • c2.cp.kegg.v7.5.1.symbols.gmt.txt
  • c2.cp.reactome.v7.5.1.symbols.gmt.txt
  • c5.go.bp.v7.5.1.symbols.gmt.txt
  • h.all.v7.5.1.symbols.gmt.txt

Finally thank you again for the package, it is a lot of work - matching human and X species gene names is not a trivial task.

@rLannes
Copy link

rLannes commented Nov 22, 2022

Hi made a custom function,
This function is very BASIC and assume the file does not exist.
feel free to change it as I am very foreign to the R way.

`
to_gmt <- function(data ,gene_id, out_file){

# write a GMT file at <out_file> from the tibble passed in <data> using the column <gene_id> as gene id
# gene_id must take a value present in the tibble colnames.
sets = data %>%  split(x = data[[gene_id]] , f = .$gs_name)
for (name_set in names(sets)){
    description =  data[data$gs_name == name_set, "gs_description"][[1]][1]
    genes = sets[[name_set]]
    genes[length(genes)] = paste(genes[length(genes)], "\n", sep="")
    cat(name_set, description, genes, sep="\t", file = out_file, append = TRUE)
}

}
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants