FRAPOSA (Fast and Robust Ancestry Prediction by using Online singular value decomposition and Shrinkage Adjustment)
is a python script to predict the ancestry of study samples by the results of principal component analysis (PCA) on a
reference panel. The software originally accompanied
Zhang, Dey, and Lee. Bionformatics (2020), but is being adapted here
for integration with the Polygenic Score (PGS) Catalog pipeline for calculating PGS
(pgsc_calc
).
- Python 3 (see
pyproject.toml
for package requirements) - PLINK (v1.9) - only needed for data preparation
- Binary PLINK files for the reference set:
refpref_raw.{bed,bim,fam}
- Binary PLINK files for the study set:
stupref_raw.{bed,bim,fam}
- If no study set is given, FRAPOSA will only run PCA on the reference set and output the reference PC scores.
- Reference population membership:
refpref_raw.popu
- Without this file, the study PC scores will still be computed, but you will not be able to predict the population memberships for the study samples.
- Format
- Column 1 and 2: Family and individual IDs (same as in
refpref_raw.fam
) - Column 3: Population membership label
fraposa
:
- Reference PC scores:
{refpref}.pcs
- Study PC scores:
{stupref}.pcs
fraposa_pred
:
- Study ancestry memberships:
{stupref}.popu
fraposa_plot
:
- PC plot:
{stupref}.png
Tip
The preprocessing and analysis steps described below are automated by the PGS Catalog Calculator
The reference and study samples must have the same set of variants (i.e. the two .bim
files must be identical).
To extract the common variants between two datasets, you can use PLINK manually or use the included commvar.sh
script:
./commvar.sh refpref_raw stupref_raw refpref stupref
This command will find the common variants in refpref_raw.{bed,bim,fam}
and stupref_raw.{bed,bim,fam}
and then output the intersected datasets in refpref.{bed,bim,fam}
and stupref.{bed,bim,fam}
.
Important
commvar.sh
requires identical allele orientation between the two datasets.
In the pgsc_calc
pipeline we use a more flexible program pgscatalog-intersect
to identify common variants
FRAPOSA loads all the study samples into memory. If the study set is too large, its samples can be split into smaller
batches. Then FRAPOSA can be run on each batch sequentially or (embarrassingly) parallelly. This can be done by either
(1) only reading a subset of samples from the study stupref.{bed,bim,fam}
files, or (2) splitting the inputs.
You can determine unique chunks of samples to be read into memory. A {stupref}.fam
file can be split into sets of 1000
samples using the following command:
cut -f2 -d ' ' {stupref}.fam | split -l 1000 -a 4 - split_ids_
This will yield a list of files split_ids_*
that you can then run in series or in paraellel, for example:
fraposa {refpref} # run so that the *dat files are already made and consistent across splits of the study data
for x in split_ids_*; do
fraposa {refpref} --stu_filepref {stupref} --stu_filt_iid $x --out $x
done
This has the benefit of keeping space low - it's better to have pr
Just as for extracting the common variants, you can split the study samples manually using PLINK or run the included
script splitindiv.sh
:
./splitindiv.sh stupref n i stupref_batch_i
which divides the samples in stupref.{bed,bim,fam}
evenly into n
batches and saves the samples in batch i
into
stupref_batch_i.{bed,bim,fam}
. For example, if stupref.{bed,bim,fam}
has 100,000 samples, then
./splitindiv.sh stupref 100 12 stupref_batch_12
produces stupref_batch_12.{bed,bim,fam}
that contains sample 12,001 to 13,000. To generate all the 100 batches,
you can use
for i in `seq 1 100`; do
./splitindiv.sh stupref 100 $i stupref_batch_$i
done
To use FRAPOSA with the default settings, run
fraposa --stu_filepref stupref refpref
This will produce refpref.pcs
, which contains the IIDs and reference PC scores, and stupref.pcs
, which contains
the IIDs and the study PC scores. Some intermediate files (*.dat
) are produced to reduce the computation time for
projections (e.g. in paraellel) and checking that the variants match. The full set of options for the script are:
$ fraposa -h
usage: fraposa [-h] [--stu_filepref STU_FILEPREF] [--stu_filt_iid STU_FILT_IID] [--method METHOD] [--dim_ref DIM_REF] [--dim_stu DIM_STU] [--dim_online DIM_ONLINE] [--dim_rand DIM_RAND] [--dim_spikes DIM_SPIKES] [--dim_spikes_max DIM_SPIKES_MAX] [--out OUT] ref_filepref
positional arguments:
ref_filepref Prefix of the binary PLINK file for the reference samples.
options:
-h, --help show this help message and exit
--stu_filepref STU_FILEPREF
Prefix of the binary PLINK file for the study samples.
--stu_filt_iid STU_FILT_IID
File with list of IIDs to extract from the study file
--method METHOD The method for PCA prediction. oadp: most accurate. adp: accurate but slow. sp: fast but inaccurate. Default is odap.
--dim_ref DIM_REF Number of PCs you need.
--dim_stu DIM_STU Number of PCs predicted for the study samples before doing the Procrustes transformation. Only needed for the oadp and adp methods. Default is 2*dim_ref.
--dim_online DIM_ONLINE
Number of PCs to calculate in online SVD. Only needed for the oadp method. Default is 2*dim_stu
--dim_rand DIM_RAND Number of reference PCs to calculate when using randomized online SVD
--dim_spikes DIM_SPIKES
Number of PCs to adjust for shrinkage. Only needed for the ap method. If this argument is not set, dim_spikes_max will be used.
--dim_spikes_max DIM_SPIKES_MAX
The maximal number of PCs to adjust for shrinkage. Only needed for the ap method. This argument will be ignored if dim_spikes is set. Default is 4*dim_ref.
--out OUT Prefix of output file(s). Default is stu_filepref
FRAPOSA_PGSC includes 3 methods for PC score prediction.
-
OADP (default and recommended): This method is fast and provides robust PC score prediction by using the online SVD algorithm.
-
SP (fast but inaccurate): This method is similar to AP and is the standard method of PC prediction. It computes the PC loadings of the reference set and projects the (standardized) study samples onto them. Its speed is the same as AP but does not adjust for the shrinkage bias, which makes it inaccurate when the number of variants greatly exceeds the sample size.
-
ADP (accurate but slow): This method is similar to OADP but has a much higher computation complexity. While OADP only computes the top few PCs, ADP computes all the PCs (i.e. running a full eigendecomposition for every study sample). The results are very close to OADP's.
To change the analysis method, set the --method
option. For example,
./fraposa_runner.py --stu_filepref stupref --method sp refpref
The original package also implemented a fourth method that was deprecated to remove the external dependancy on R/hdpca:
AP (also recommended):This method is even faster and its results are close to OADP's.However, sometimes you may want to manually set the number of PCs to be adjusted for shrinkage (i.e. by settingRequired the python package (--dim_spikes
) if you believe that a shrunk PC has not been adjusted automatically.rpy2
), an installation of R and the R package (hdpca
).
Several PCA-related parameters can be changed. For example, to set the number of reference PCs to 20, run
./fraposa_runner.py --stu_filepref stupref --dim_ref 20 refpref
To learn all the options for FRAPOSA, run
./fraposa_runner.py --help
If you have run FRAPOSA previously by using
- the same reference set, and
- different parameter settings
(e.g. by changin
--dim_ref
or--dim_stu
),
then you need to delete all the intermediate .dat
files with the same prefix as this reference set.
FRAPOSA saves the intermediate files related to PCA on the reference set. Specifically, variants used ({refpref}_vars.dat
),
the mean and standard deviation of each variant ({refpref}_mnsd.dat
), singular values ({refpref}_s.dat
), reference
PC loadings ({refpref}_U.dat
), scaled ({refpref}.pcs
) and unscaled (refpref_V.dat
) reference PC scores are saved
and will be automatically loaded if the same reference set is used again. This avoids running PCA on the same reference
set multiple times, especially in the case when the study samples are split into batches and are analyzed with the same
reference set.
WARNING: FRAPOSA only checks whether the reference set file prefix is the same when deciding whether to load the intermediate files. It does not detect whether the parameters have been changed.
After predicting the study samples' PC scores, their ancestry memberships can also be predicted, if the reference
ancestry information refpref.popu
is provided. Running
./predstupopu.py refpref stupref
will produce stupref.popu
, which contains
- IDs (columns 1 and 2)
- the population that the study sample most likely belongs to (column 3)
- how likely the study sample belongs to the population in column 3 (column 4)
- The distance between the study sample and the nearest reference sample (column 5)
- How likely the study sample belongs to each of the refrence populations (columns 6 to 6+q-1, where q is the number of reference populations)
- Names of the refrence populations. This is the same in every row. (columns 6+q to 6+2q-1)
Population prediction is done by using the k-nearest neighbor algorithm to classify the study PC scores with respect to the reference PC scores. You can set the number of neighbors or the method for calculating the weights by using
./predstupopu.py --nneighbors 20 --weights uniform refpref stupref
A simple script for plotting the PC scores is included:
./plotpcs.py refpref stupref
The PC plot will be saved to refpref.png
.
An example data set can be found here, which includes
- Reference data (
thousandGenomes.{bed,bim,fam,popu}
)- This data set contains the 2,492 unrelated samples from the 1000 Genomes project with 637,177 SNPs that are included in the Human Genome Diversity Project. The population memberships of the samples are also provided.
- Study data (
exampleStudySamples.{bed,bim,fam}
)- This data set contains 500 samples with 145,282 SNPs. The ancestry information of these samples can be predicted by FRAPOSA.