GWAS Summary Statistics Data Harmonisation pipeline
The pipeline workflow is managed by snakemake
, so it can be followed in the Snakefile.
This pipeline, brings the variants to the desired genome assembly and then harmonises them. The harmonisation is performed by sumstat_harmoniser which is used to a) find the orientation of the variants, b) resolve RSIDs from locations and alleles and c) orientate the variants to the reference strand.
The following are required:
-
python3
-
HTSlib for tabix
-
git clone --recurse-submodules https://github.com/EBISPOT/gwas-sumstats-harmoniser.git
# clone this repo and submodules -
cd gwas-sumstats-harmoniser
-
virtualenv --python=python3 .venv
# create virtual environment -
source .venv/bin/activate
# activate virtual environment -
pip install -r requirements.txt
The recommended way to run the pipeline is on HPC. Follow the snakemake guidelines for setting up a profile for your cluster. Although not recommended, it is possible to run locally but understand the memory and disk space requirements. See here for LSF (bsub) snakemake profile.
To optimise for speed, the pipeline allocates 28GB for the mapping rule. You could lower this (edit the Snakefile) to around 20GB, but expect failures anywhere lower.
Allow 20GB for the VCF reference files. If using a local synonyms table (see configuration) allow an additional 70GB (for Ensembl release 100).
Edit the config.yaml if you want to change from any of the defaults. It's recommended to set an absoulte path for the local_resources
. Set local_synonyms
to False
if you wish to check variant name synonynms against the Ensembl REST API (not recommended, but if you don't have 70GB free space, you can do this).
-
The pipeline takes .tsv summary statistics in this format
-
The name must follow the convention
<any identifier><genome assembly number>.tsv
e.g. my_summary_stats_37.tsv (37 denotes the genome assembly of the data in the file is hg19 or GRCh37 - see config for the assembly table). Note that his number is not the desired build, that is set in the config. -
Assuming the pipeline will run on an LSF cluster and the file we want to harmonise is called
/path/to/example37.tsv
: -
snakemake --configfile config.yaml --profile lsf /path/to/example37/harmonised.qc.tsv
-
see this for an idea of how to run
- First make sure your files are correctly formatted using the validator.
- Understand the configuration and edit to you requirements.
- Fetch VCF files from Ensembl and convert to .parquet format
- Checks variant IDs against those in the references and updates locations. If not found, or no variant IDs are given, liftover is used.
- Check the the strand of non-palindromic variants by querying their position and alleles against the refereces. If the percentage is >= threshold (default set to 0.99 in config), a consensus is made. This value (forward, reverse or drop) is carried to the next step.
- see here for more details.
- Update any missing variant IDs.
- If given variant ID is different from the one inferred, check if it is a synonym - if not, drop it.
- The first time this runs, if you have specified to use a local synonyms table in the config, it will need to build that table.
- Drop any records with missing mandatory fields.