Important
- In June 2024 this repository was archived (made read-only) and the code here deprecated
- The code was refactored and migrated to a new set of Python packages contained in the pygscatalog repository
- The PGS Catalog Calculator v2-beta uses these new Python packages
- The Python package pgscatalog-utils is unaffected by this change, and we will continue to publish updates from the new repository
- If you experience problems with our Python tools, please create issues at the new repository
This repository is a collection of useful tools for downloading and working with scoring files from the
PGS Catalog. This is mostly used internally by the PGS Catalog Calculator (PGScatalog/pgsc_calc
); however, other users may find some of these tools helpful.
download_scorefiles
: Download scoring files by PGS ID (accession) in genome builds GRCh37 or GRCh38combine_scorefile
: Combine multiple scoring files into a single scoring file in 'long' formatmatch_variants
: Match target variants (bim or pvar files) against the output ofcombine_scorefile
to produce scoring files for plink 2ancestry_analysis
: use genetic PCA loadings to compare samples to population reference panels, and report PGS adjusted for these axes of genetic ancestry. The PCs will likely have been generated with FRAPOSA (pgs catalog version)validate_scorefiles
: Check/validate that the scoring files and harmonized scoring files match the PGS Catalog scoring file formats.
$ pip install pgscatalog-utils
$ download_scorefiles -i PGS000922 PGS001229 -o . -b GRCh37
$ combine_scorefiles -s PGS*.txt.gz -o combined.txt
$ match_variants -s combined.txt -t <example.pvar> --min_overlap 0.75 --outdir .
$ validate_scorefiles -t formatted --dir <scoringfiles_directory> --log_dir <logs_directory>
More details are available using the --help
parameter.
Requirements:
- python 3.10
- poetry
$ git clone https://github.com/PGScatalog/pgscatalog_utils.git
$ cd pgscatalog_utils
$ poetry install
$ poetry build
$ pip install --user dist/*.whl
The pgscatalog_utils
package is developed as part of the Polygenic Score (PGS) Catalog
(www.PGSCatalog.org) project, a collaboration between the
University of Cambridge’s Department of Public Health and Primary Care (Michael Inouye, Samuel Lambert, Laurent Gil)
and the European Bioinformatics Institute (Helen Parkinson, Aoife McMahon, Ben Wingfield, Laura Harris).
If
you use the tool we ask you to cite our paper describing software and updated PGS Catalog resource:
-
Lambert, Wingfield et al. (2024) The Polygenic Score Catalog: new functionality and tools to enable FAIR research. medRxiv. doi:10.1101/2024.05.29.24307783.
This work has received funding from EMBL-EBI core funds, the Baker Institute, the University of Cambridge, Health Data Research UK (HDRUK), and the European Union's Horizon 2020 research and innovation programme under grant agreement No 101016775 INTERVENE.