-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathREADME.Rmd
executable file
·429 lines (299 loc) · 22 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
---
title: ""
author: "`r rworkflows::use_badges(add_doi = '10.1093/bioinformatics/btab658')`"
date: "<h5>README updated: <i>`r format( Sys.Date(), '%b-%d-%Y')`</i></h5>"
output:
github_document
---
```{r, echo=FALSE, include=FALSE}
pkg <- read.dcf("DESCRIPTION", fields = "Package")[1]
description <- read.dcf("DESCRIPTION", fields = "Description")[1]
```
## ``r pkg``: `r gsub("echoverse module: ","", description)`
### The *echoverse*
``r pkg`` is part of the [__*echoverse*__](https://github.com/topics/echoverse), a suite of R packages designed to facilitate different steps in genetic fine-mapping.
``r pkg`` calls each of these other packages (i.e. "modules") internally to create a unified pipeline. However, you can also use each module independently to create your own custom workflows.
#### __*echoverse*__ dependency graph
<img src="./images/echoverse.png" height="400px" style="border-radius: 20px;">
> Made with [`echodeps`](https://github.com/RajLabMSSM/echodeps), yet another __*echoverse*__ module. See [here for the interactive version](https://rajlabmssm.github.io/Fine_Mapping/echolocatoR.dep_graph.html) with package descriptions and links to each GitHub repo.
### Citation
If you use ``r pkg``, or any of the _*echoverse*_ modules, please cite:
> `r citation(pkg)$textVersion`
## Installation
```R
if(!require("remotes")) install.packages("remotes")
remotes::install_github("RajLabMSSM/`r pkg`")
library(`r pkg`)
```
#### Installation troubleshooting
<details>
- Because `echolocatoR` now relies on many subpackages that rely on one another,
sometimes errors can occur when R tries to update one R package before
updating its *echoverse* dependencies (and thus is unable to find new functions).
As *echoverse* stabilizes over time, this should happen less frequently. However,
in the meantime the solution is to simply rerun
`remotes::install_github("RajLabMSSM/echolocatoR")`
until all subpackages are fully updates.
- `susieR`: Sometimes an older version of `susieR` is installed from CRAN (e.g. 0.11.92), but `echofinemap` requires version >= 0.12.0. To get around this, you can install `susieR` directly from GitHub: `devtools::install_github("stephenslab/susieR")`
- System dependencies can sometimes cause issues when using different packages. I've tried to account for as many of these as possible automatically within the code, but using the
__Docker/Singularity__ provided below can further mitigate these issues.
- The R package `XML` (which some *echoverse* subpackages depend on) has some additional system dependencies that must be installed beforehand.
If `XML` does not install automatically, try installing `lbxml` on your system using `brew install libxml2` (MacOS),
`sudo apt-get install libxml2` (Linux) or
`conda install r-xml` if you are running `echolocatoR` from within a
conda environment.
</details>
### [Optional] [Docker/Singularity](https://rajlabmssm.github.io/echolocatoR/articles/docker)
`echolocatoR` now has its own dedicated Docker/Singularity container!
This greatly reduces issues related to system dependency conflicts and provides a containerized interface for Rstudio through your web browser. See [here for installation instructions](https://rajlabmssm.github.io/echolocatoR/articles/docker).
## Documentation
### [Website](https://rajlabmssm.github.io/`r pkg`)
### [Get started](https://rajlabmssm.github.io/`r pkg`/articles/`r pkg`)
### [Bugs/requests](https://github.com/RajLabMSSM/echolocatoR/issues)
Please report any bugs/requests on [GitHub Issues](https://github.com/RajLabMSSM/echolocatoR/issues).
[Contributions](https://github.com/RajLabMSSM/echolocatoR/pulls) are welcome!
### All *echoverse* vignettes
<details>
```{r, results="asis"}
echoverse <- c('echolocatoR','echodata','echotabix',
'echoannot','echoconda','echoLD',
'echoplot','catalogueR','downloadR',
'echofinemap','echodeps', # under construction
'echogithub')
toc <- echogithub::github_pages_vignettes(owner = "RajLabMSSM",
repo = echoverse,
as_toc = TRUE,
verbose = FALSE)
```
</details>
## Introduction
Fine-mapping methods are a powerful means of identifying causal variants
underlying a given phenotype, but are underutilized due to the technical
challenges of implementation. ``r pkg`` is an R package that
automates end-to-end genomics fine-mapping, annotation, and plotting in
order to identify the most probable causal variants associated with a
given phenotype.
It requires minimal input from users (a GWAS or QTL summary statistics
file), and includes a suite of statistical and functional fine-mapping
tools. It also includes extensive access to datasets (linkage
disequilibrium panels, epigenomic and genome-wide annotations, QTL).
The elimination of data gathering and preprocessing steps enables rapid
fine-mapping of many loci in any phenotype, complete with locus-specific
publication-ready figure generation. All results are merged into a
single per-SNP summary file for additional downstream analysis and
results sharing. Therefore ``r pkg`` drastically reduces the
barriers to identifying causal variants by making the entire
fine-mapping pipeline rapid, robust and scalable.
<img src="./images/echolocatoR_Fig1.png" style="border-radius: 10px;">
## Literature
### For applications of ``r pkg`` in the literature, please see:
> 1. E Navarro, E Udine, K de Paiva Lopes, M Parks, G Riboldi, BM Schilder…T Raj (2020) Dysregulation of mitochondrial and proteo-lysosomal genes in Parkinson's disease myeloid cells. Nature Genetics. https://doi.org/10.1101/2020.07.20.212407
> 2. BM Schilder, T Raj (2021) Fine-Mapping of Parkinson’s Disease Susceptibility Loci Identifies Putative Causal Variants. Human Molecular Genetics, ddab294, https://doi.org/10.1093/hmg/ddab294
> 3. K de Paiva Lopes, G JL Snijders, J Humphrey, A Allan, M Sneeboer, E Navarro, BM Schilder…T Raj (2022) Genetic analysis of the human microglial transcriptome across brain regions, aging and disease pathologies. Nature Genetics, https://doi.org/10.1038/s41588-021-00976-y
## `echolocatoR` v1.0 vs. v2.0
There have been a series of major updates between `echolocatoR` v1.0 and v2.0. Here are some of the most notable ones (see **Details**):
<details>
- __*echoverse* subpackages__: `echolocatoR` has been broken into separate subpackages, making it much easier to edit/debug each step of the full `finemap_loci` pipeline, and improving robustness throughout. It also provides greater flexibility for users to construct their own custom pipelines from these modules.
- __`GITHUB_TOKEN`__: GitHub now requires users to create Personal Authentication Tokens (PAT) to avoid download limits. This is essential for installing `echolocatoR` as many resources from GitHub need to be downloaded. See [here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token) for further instructions.
= `echodata::construct_colmap()`: Previously, users were required to input key column name mappings as separate arguments to `echolocatoR::finemap_loci`. This functionality has been deprecated and replaced with a single argument, `colmap=`. This allows users to save the `construct_colmap()` output as a single variable and reuse it later without having to write out each mapping argument again (and helps reduce an already crowded list of arguments).
- __`MungeSumstats`__: `finemap_loci` now accepts the output of [`MungeSumstats::format_sumstats`/`import_sumstats`](https://github.com/neurogenomics/MungeSumstats) as-is (without requiring `colmap=`, so long as `munged=TRUE`). Standardizing your GWAS/QTL summary stats this way greatly reduces (or eliminates) the time taken to do manual formatting.
- __`echolocatoR::finemap_loci` arguments__: Several arguments have been deprecated or had their names changed to be harmonized across all the subpackages and use a unified naming convention. See `?echolocatoR::finemap_loci` for details.
- __`echoconda`__: The *echoverse* subpackage `echoconda` now handles all conda environment creation/use internally and automatically, without the need for users to create the conda environment themselves as a separate step. Also, the default conda env `echoR` has been replaced by `echoR_mini`, which reduces the number of dependencies to just the bare minimum (thus greatly speeding up build time and reducing potential version conflicts).
- __`FINEMAP`__: More outputs from the tool `FINEMAP` are now recorded in the `echolocatoR` results (see `?echofinemap::FINEMAP` or [this Issue](https://github.com/RajLabMSSM/echofinemap/issues/7) for details). Also, a common dependency conflict between `FINEMAP`>=1.4 and MacOS has been resolved (see [this Issue](https://github.com/RajLabMSSM/echofinemap/issues/9) for details.
- __`echodata`__: All example data and data transformation functions have been moved to the *echoverse* subpackage [`echodata`](https://github.com/RajLabMSSM/echodata).
- __`LD_reference=`__: In addition to the *UKB*, *1KGphase1/3* LD reference panels,
`finemap_loci()` can now take custom LD panels by supplying
`finemap_loci(LD_reference=)` with a
list of paths to VCF files (.vcf / vcf.gz / vcf.bgz) or
pre-computed LD matrices with RSIDs as the row/col names
(.rda / .rds / .csv / .tsv. / .txt / .csv.gz / tsv.gz / txt.gz).
- __Expanded fine-mapping methods__: "ABF”, "COJO_conditional", "COJO_joint" "COJO_stepwise","FINEMAP","PAINTOR" (including multi-GWAS and multi-ancestry fine-mapping),"POLYFUN_FINEMAP" ,"POLYFUN_SUSIE","SUSIE"
- __`FINEMAP` fixed__: There were a number of issues with `FINEMAP` due to differing output formats across different versions, system dependency conflicts, and the fact that it can produce multiple Credible Sets. All of these have been fixed and the latest version of `FINEMAP` can be run on all OS platforms.
- __Debug mode__: Within `finemap_loci()` I use a `tryCatch()` when iterating across loci so that if one locus fails, the rest can continue. However this prevents using traceback feature in R, making debugging hard. Thus I now enabled debugging mode via a new argument: `use_tryCatch=FALSE`.
</details>
## Output descriptions
By default, `echolocatoR::finemap_loci()` returns a nested list containing grouped by locus names (e.g. `$BST1`, `$MEX3C`). The results of each locus contain the following elements:
<details>
- `finemap_dat`: Fine-mapping results from all selected methods merged with the original summary statistics (i.e. __Multi-finemap results__).
- `locus_plot`: A nested list containing one or more zoomed views of locus plots.
- `LD_matrix`: The post-processed LD matrix used for fine-mapping.
- `LD_plot`: An LD plot (if used).
- `locus_dir`: Locus directory results are saved in.
- `arguments`: A record of the arguments supplied to `finemap_loci`.
In addition, the following object summarizes the results from the locus-specific elements:
- `merged_dat`: A merged `data.table` with all fine-mapping
results from all loci.
### Multi-finemap results files
The main output of `echolocatoR` are the multi-finemap files (for
example, `echodata::BST1`). They are stored in the locus-specific
*Multi-finemap* subfolders.
#### Column descriptions
- **Standardized GWAS/QTL summary statistics**: e.g.
`SNP`,`CHR`,`POS`,`Effect`,`StdErr`. See `?finemap_loci()` for
descriptions of each.
- **leadSNP**: The designated proxy SNP per locus, which is the SNP
with the smallest p-value by default.
- **\<tool\>.CS**: The 95% probability Credible Set (CS) to which a
SNP belongs within a given fine-mapping tool's results. If a SNP is
not in any of the tool's CS, it is assigned `NA` (or `0` for the
purposes of plotting).
- **\<tool\>.PP**: The posterior probability that a SNP is causal for
a given GWAS/QTL trait.
- **Support**: The total number of fine-mapping tools that include the
SNP in its CS.
- **Consensus_SNP**: By default, defined as a SNP that is included in
the CS of more than `N` fine-mapping tool(s), i.e. `Support>1`
(default: `N=1`).
- **mean.PP**: The mean SNP-wise PP across all fine-mapping tools
used.
- **mean.CS**: If mean PP is greater than the 95% probability
threshold (`mean.PP>0.95`) then `mean.CS` is 1, else 0. This tends
to be a very stringent threshold as it requires a high degree of
agreement between fine-mapping tools.
### Notes
- Separate multi-finemap files are generated for each LD reference
panel used, which is included in the file name (e.g.
*UKB_LD.Multi-finemap.tsv.gz*).
- Each fine-mapping tool defines its CS and PP slightly differently,
so please refer to the associated original publications for the
exact details of how these are calculated (links provided above).
</details>
## Fine-mapping tools
Fine-mapping functions are now implemented via [`echofinemap`](https://github.com/RajLabMSSM/echofinemap):
<details>
- ``r pkg`` will automatically check whether you have the
necessary columns to run each tool you selected in
`echolocatoR::finemap_loci(finemap_methods=...)`.
It will remove any tools that for
which there are missing necessary columns, and produces a message
letting you know which columns are missing.
- Note that some columns (e.g.
`MAF`,`N`,`t-stat`) will be automatically inferred if missing.
- For easy reference, we list the necessary columns here as well.
See `?echodata::construct_colmap()` for descriptions of these columns.
All methods require the columns: `SNP`,`CHR`,`POS`,`Effect`,`StdErr`
</details>
<br>
```{r, fig.width=12}
fm_methods <- echofinemap::required_cols(add_versions = FALSE,
embed_links = TRUE,
verbose = FALSE)
knitr::kable(x = fm_methods)
```
## Datasets
Datasets are now stored/retrieved via the following **echoverse** subpackages:
- [`echodata`](https://github.com/RajLabMSSM/echodata): Pre-computed fine-mapping results. Also handles the semi-automated standardization of summary statistics.
- [`echoannot`](https://github.com/RajLabMSSM/echoannot): Annotates GWAS/QTL summary statistics using epigenomics, pre-compiled annotation matrices, and machine learning model predictions of variant-specific functional impacts.
- [`catalogueR`](https://github.com/RajLabMSSM/catalogueR): Large compendium of fully standardized e/s/t-QTL summary statistics.
For more detailed information about each dataset, use `?`:
```R
### Examples ###
library(echoannot)
?NOTT_2019.interactome # epigenomic annotations
library(echodata)
?BST1 # fine-mapping results
```
<details>
### [__`MungeSumstats`__](https://github.com/neurogenomics/MungeSumstats):
- You can search, import, and standardize any GWAS in the [*Open GWAS*](https://gwas.mrcieu.ac.uk/)
database via [`MungeSumstats`](https://github.com/neurogenomics/MungeSumstats), specifically the functions `find_sumstats` and `import_sumstats`.
### [`catalogueR`](https://github.com/RajLabMSSM/catalogueR): QTLs
#### [eQTL Catalogue](https://www.ebi.ac.uk/eqtl/): `catalogueR::eQTL_Catalogue.query()`
- API access to full summary statistics from many standardized
e/s/t-QTL datasets.
- Data access and colocalization tests facilitated through the
[`catalogueR`](https://github.com/RajLabMSSM/catalogueR) R package.
### [`echodata`](https://github.com/RajLabMSSM/catalogueR): fine-mapping results
#### [__*echolocatoR Fine-mapping Portal*__](https://rajlab.shinyapps.io/Fine_Mapping_Shiny): pre-computed fine-mapping results
- You can visit the *echolocatoR Fine-mapping Portal* to interactively visualize and download pre-computed fine-mapping results across a variety of phenotypes.
- This data can be searched and imported programmatically using `echodata::portal_query()`.
### [`echoannot`](https://github.com/RajLabMSSM/echoannot): Epigenomic & genome-wide annotations
#### [Nott et al. (2019)](https://science.sciencemag.org/content/366/6469/1134.abstract): `echoannot::NOTT2019_*()`
- Data from this publication contains results from cell type-specific
(neurons, oligodendrocytes, astrocytes, microglia, & peripheral
myeloid cells) epigenomic assays (H3K27ac, ATAC, H3K4me3) from *ex vivo* pediatric human brain tissue.
#### [Corces et al.2020](https://doi.org/10.1038/s41588-020-00721-x): `echoannot::CORCES2020_*()`
- Data from this publication contains results from single-cell and bulk chromatin accessibility assays ([sc]ATAC-seq) and chromatin interactions ( [`FitHiChIP`](https://ay-lab.github.io/FitHiChIP/)) from *postmortem* adult human brain tissue.
#### [XGR](http://xgr.r-forge.r-project.org): `echoannot::XGR_download_and_standardize()`
- API access to a diverse library of cell type/line-specific
epigenomic (e.g. __ENCODE__) and other genome-wide annotations.
#### [Roadmap](http://www.roadmapepigenomics.org): `echoannot::ROADMAP_query()`
- API access to cell type-specific epigenomic data.
#### [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html): `echoannot::annotate_snps()`
- API access to various genome-wide SNP annotations (e.g. missense,
nonsynonmous, intronic, enhancer).
#### [HaploR](https://cran.r-project.org/web/packages/haploR/vignettes/haplor-vignette.html): `echoannot::annotate_snps()`
- API access to known per-SNP QTL and epigenomic data hits.
</details>
## Enrichment tools
Annotation enrichment functions are now implemented via [`echoannot`](https://github.com/RajLabMSSM/echoannot):
<details>
### Implemented
#### [XGR](http://xgr.r-forge.r-project.org): `echoannot::XGR_enrichment()`
- Binomial enrichment tests between customisable foreground and
background SNPs.
#### [motifbreakR](https://github.com/Simon-Coetzee/motifBreakR): `echoannot::MOTIFBREAKR()`
- Identification of transcript factor binding motifs (TFBM) and
prediction of SNP disruption to said motifs.
- Includes a comprehensive list of TFBM databases via
[MotifDB](https://bioconductor.org/packages/release/bioc/html/MotifDb.html)
(9,900+ annotated position frequency matrices from 14 public
sources, for multiple organisms).
#### [regioneR](http://bioconductor.org/packages/release/bioc/html/regioneR.html): `echoannot::test_enrichment()`
- Iterative pairwise permutation testing of overlap between all combinations of two [`GRangesList`](https://biodatascience.github.io/compbio/bioc/GRL.html) objects.
### Under construction
#### [GARFIELD](https://www.bioconductor.org/packages/release/bioc/html/garfield.html)
- Genomic enrichment with LD-informed heuristics.
#### [GoShifter](https://github.com/immunogenomics/goshifter)
- LD-informed iterative enrichment analysis.
#### [S-LDSC](https://www.nature.com/articles/ng.3954)
- Genome-wide stratified LD score regression.
- Inlccles 187-annotation baseline model from [Gazal et al.
2018](https://www.nature.com/articles/s41588-018-0231-8).
- You can alternatively supply a custom annotations matrix.
</details>
## LD reference panels
LD reference panels are now queried/processed by [`echoLD`](https://github.com/RajLabMSSM/echoLD),
specifically the function `get_LD()`:
<details>
### [UK Biobank](https://www.ukbiobank.ac.uk)
### [1000 Genomes Phase 1](https://www.internationalgenome.org)
### [1000 Genomes Phase 3](https://www.internationalgenome.org)
### Custom LD panel:
- From user-supplied VCFs
### Custom LD panel
- From user-supplied precomputed LD matrices
</details>
## Plotting
Plotting functions are now implemented via:
- [`echoplot`](https://github.com/RajLabMSSM/echoplot): Multi-track locus plots with GWAS, fine-mapping results, and functional annotations (`plot_locus()`). Can also plot multi-GWAS/QTL and multi-ancestry results (`plot_locus_multi()`).
- [`echoannot`](https://github.com/RajLabMSSM/echoannot): Study-level summary plots showing aggregted info across many loci at once (`super_summary_plot()`).
- [`echoLD`](https://github.com/RajLabMSSM/echoLD): Plot an LD matrix using one of several differnt plotting methods (`plot_LD()`).
## Tabix queries
All queries of [`tabix`](http://www.htslib.org/doc/tabix.html)-indexed files (for rapid data subset extraction) are implemented via [`echotabix`](https://github.com/RajLabMSSM/echotabix).
<details>
- `echotabix::convert_and_query()` detects whether the GWAS summary statistics file you provided is already `tabix`-indexed, and it not, automatically performs all steps necessary to convert it (sorting, `bgzip`-compression, indexing) across a wide variety of scenarios.
- `echotabix::query()` contains many different methods for making tabix queries (e.g. `Rtracklayer`,`echoconda`,`VariantAnnotation`,`seqminer`), each of which fail in certain circumstances. To avoid this, `query()`
automatically selects the method that will work for the particular file being queried and your machine's particular versions of R/Bioconductor/OS, taking the guesswork and troubleshooting out of `tabix` queries.
</details>
## Downloads
Single- and multi-threaded downloads are now implemented via [`downloadR`](https://github.com/RajLabMSSM/downloadR).
<details>
- Multi-threaded downloading is performed using [`axel`](https://github.com/axel-download-accelerator/axel), and is particularly useful for speeding up downloads of large files.
- `axel` is installed via the official *echoverse* [conda](https://docs.conda.io/en/latest/) environment: "echoR_mini". This environment is automatically created by the function `echoconda::yaml_to_env()` when needed.
</details>
<hr>
# Developer
<a href="https://bschilder.github.io/BMSchilder/" target="_blank">Brian
M. Schilder, Bioinformatician II</a>
<a href="https://rajlab.org" target="_blank">Raj Lab</a>
<a href="https://icahn.mssm.edu/about/departments/neuroscience" target="_blank">Department
of Neuroscience, Icahn School of Medicine at Mount Sinai</a>
<hr>
# Session info
<details>
```{r Session Info, attr.output='style="max-height: 200px;"'}
utils::sessionInfo()
```
</details>
<br>