gwas-sumstats-harmoniser

GWAS Summary Statistics Data Harmonisation pipeline

The pipeline workflow is managed by snakemake, so it can be followed in the Snakefile.

This pipeline, brings the variants to the desired genome assembly and then harmonises them. The harmonisation is performed by sumstat_harmoniser which is used to a) find the orientation of the variants, b) resolve RSIDs from locations and alleles and c) orientate the variants to the reference strand.

Installation

The following are required:

python3
HTSlib for tabix
git clone --recurse-submodules https://github.com/EBISPOT/gwas-sumstats-harmoniser.git # clone this repo and submodules
cd gwas-sumstats-harmoniser
virtualenv --python=python3 .venv # create virtual environment
source .venv/bin/activate # activate virtual environment
pip install -r requirements.txt

Executing the pipeline

Environment

The recommended way to run the pipeline is on HPC. Follow the snakemake guidelines for setting up a profile for your cluster. Although not recommended, it is possible to run locally but understand the memory and disk space requirements. See here for LSF (bsub) snakemake profile.

Memory

To optimise for speed, the pipeline allocates 28GB for the mapping rule. You could lower this (edit the Snakefile) to around 20GB, but expect failures anywhere lower.

Disk space

Allow 20GB for the VCF reference files. If using a local synonyms table (see configuration) allow an additional 70GB (for Ensembl release 100).

Configuration

Edit the config.yaml if you want to change from any of the defaults. It's recommended to set an absoulte path for the local_resources. Set local_synonyms to False if you wish to check variant name synonynms against the Ensembl REST API (not recommended, but if you don't have 70GB free space, you can do this).

Execution

The pipeline takes .tsv summary statistics in this format
The name must follow the convention <any identifier><genome assembly number>.tsv e.g. my_summary_stats_37.tsv (37 denotes the genome assembly of the data in the file is hg19 or GRCh37 - see config for the assembly table). Note that his number is not the desired build, that is set in the config.
Assuming the pipeline will run on an LSF cluster and the file we want to harmonise is called /path/to/example37.tsv:
snakemake --configfile config.yaml --profile lsf /path/to/example37/harmonised.qc.tsv
see this for an idea of how to run

Pipeline

First make sure your files are correctly formatted using the validator.
Understand the configuration and edit to you requirements.

Steps

1. Get references

Fetch VCF files from Ensembl and convert to .parquet format

2. Map variants to desired genome assembly

Checks variant IDs against those in the references and updates locations. If not found, or no variant IDs are given, liftover is used.

3. Determine strand consensus

Check the the strand of non-palindromic variants by querying their position and alleles against the refereces. If the percentage is >= threshold (default set to 0.99 in config), a consensus is made. This value (forward, reverse or drop) is carried to the next step.

4. Harmonise associations to reference

see here for more details.

5. QC

Update any missing variant IDs.
If given variant ID is different from the one inferred, check if it is a synonym - if not, drop it.
- The first time this runs, if you have specified to use a local synonyms table in the config, it will need to build that table.
Drop any records with missing mandatory fields.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
harmoniser		harmoniser
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.travis		.travis
10_percent_log_Snakefile		10_percent_log_Snakefile
2022_update_readme.md		2022_update_readme.md
Dockerfile		Dockerfile
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
example_wrapper.sh		example_wrapper.sh
lsf.yaml		lsf.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gwas-sumstats-harmoniser

Installation

Executing the pipeline

Environment

Memory

Disk space

Configuration

Execution

Pipeline

Steps

1. Get references

2. Map variants to desired genome assembly

3. Determine strand consensus

4. Harmonise associations to reference

5. QC

About

Releases

Packages

Languages

ebi-gdp/gwas-sumstats-harmoniser

Folders and files

Latest commit

History

Repository files navigation

gwas-sumstats-harmoniser

Installation

Executing the pipeline

Environment

Memory

Disk space

Configuration

Execution

Pipeline

Steps

1. Get references

2. Map variants to desired genome assembly

3. Determine strand consensus

4. Harmonise associations to reference

5. QC

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages