Genome-wide association studies have linked millions of genetic variants to human phenotypes, but translating this information clinically has been challenged by a lack of biological understanding and widespread genetic interactions. With the advent of the Transformer deep learning architecture, new opportunities arise in creating predictive biological models that are both accurate and easily interpretable. Toward this goal we describe G2PT, a hierarchical Genotype-to-Phenotype Transformer that models bidirectional information flow among polymorphisms, genes, molecular systems, and phenotypes. G2PT effectively learns to predict metabolic traits in UK Biobank, including risk for diabetes and triglyceride-to-HDL cholesterol (TG/HDL) ratio, outperforming previous polygenic models.
conda environment file environment.yml is provided
conda env create python==3.6 --name envname --file=environment.yml
To train a new model using a custom data set, first make sure that you have a proper virtual environment set up. Also make sure that you have all the required files to run the training scripts:
- Participant Genotype files:
- You can put PLINK binary file
- --flip argument will flip ref. and alt. allele (use to recommend
--flip
argument, which make homozygous alt. as 2)
- --flip argument will flip ref. and alt. allele (use to recommend
- Or you can put tab-delimited file containing personal genotype data to reduce memory usage
- Index will indicate Sample ID.
homozygous_a0
,heterozygous
,homozygous_a1
contain index of SNP by the allele
- You can put PLINK binary file
- Example of tab-delimited genotype file
homozygous_a0 | heterozygous | homozygous_a1 | |
---|---|---|---|
1000909 | 0,1,3,5,7,9 | 2,4,5 | 6,8 |
1000303 | 1,3,6,7,8,9 | 2,5 | 4 |
- Covariates files
- File including covariates and phenotypes.
- same as
.cov
and.pheno
in PLINK- If you want to use subset of covariates, you can put --cov-ids (i.e. with
--cov-ids SEX AGE
, model will use only SEX and AGE as covaritates)
- If you want to use subset of covariates, you can put --cov-ids (i.e. with
- If you do not put
.cov
while you put PLINK bfiles. Covariates will be generated from.fam
file (Sex only) - If you do not put
.pheno
, you should includePHENOTYPE
in training and validation covariate file
- Example of covariates file
FID | IID | PHENOTYPE | SEX | AGE | PC1 | PC2 | ... | PC10 |
---|---|---|---|---|---|---|---|---|
10008090 | 10008090 | 1.2 | 1 | 48 | 3 | 0.3 | ... | 0.5 |
- Ontology (hierarchy) file:
-
--onto : A tab-delimited file that contains the ontology (hierarchy) that defines the structure of a branch of a G2TP model that encodes the genotypes. The first column is always a term (subsystem or pathway), and the second column is a term or a gene. The third column should be set to "default" when the line represents a link between terms, (if you have nested subtree, you can put some name except 'gene'). "gene" when the line represents an annotation link between a term and a gene. The following is an example describing a sample hierarchy.
-
- --subtree_order : if you have nested subtrees in ontology, you can set this option default is
['default']
(no subtree inside)
- --subtree_order : if you have nested subtrees in ontology, you can set this option default is
-
-
- Example of ontology file
parent | child | interaction_type |
---|---|---|
GO:0045834 | GO:0045923 | default |
GO:0045834 | GO:0043552 | default |
GO:0045923 | AKT2 | gene |
GO:0045923 | IL1B | gene |
GO:0043552 | PIK3R4 | gene |
-
--snp2gene : A tab-delimited file for mapping SNPs to genes. The first column indicates SNP, second column for gene, and third for chromosome
-
Example of snp2gene file
SNP_ID | Gene | Chromosome |
---|---|---|
16:56995236:A:C | CETP | 16 |
8:126482077:G:A | TRIB1 | 8 |
19:45416178:T:G | APOC1 | 19 |
2:27752463:A:G | GCKR | 2 |
There are several optional parameters that you can provide in addition to the input files:
- Propagation option:
- --sys2env : determines whether model will do Sys2Env propagation
- --env2sys : determines whether model will do Env2Sys propagation
- --sys2gene : determines whether model will do Gene2Sys propagation
- Translation option:
- --sys2pheno : Updated system embeddings are used to predict phenotype
- --gene2pheno : Updated gene embeddings are used to predict phenotype
- --snp2pheno : SNP embeddings are used to predict phenotype
- if you don't put any translation option,
sys2pheno
will be automatically set
- Model parameter:
- --hiddens-dims: embedding and hierarchical transformer dimension size
- Training parameters:
- --epochs : the number of epoch to run during the training phase. The default is set to 256.
- --val-step: Validation step
- --batch-size : the size of each batch to process at a time. The default is set to 256.
- --z-weight : for the continuous phenotype, individual with high absolute Z-score will be more sampled. if set as 0 (default), all population will be sampled in one training epoch
- --dropout: dropout option. Default is set 0.2
- --lr : Learning rate. Default is set 0.001.
- --wd : Weight decay. Default is set 0.001.
- GPU option:
- Single GPU option
- --cuda : the ID of GPU unit that you want to use for the model training. The default setting is to use GPU 0.
- Multi GPU option (multi-node will be supported)
- --multiprocessing-distributed : determines whether model will be trained in multi-gpu distributed set-up
- --world-size : size of world, default is 1
- --rank : rank, default is 0
- --local-rank : local rank, default is 0
- --dist-url : distribute url,
tcp://127.0.0.1:2222
- --dist_backend : distribute backend default is
nccl
- Single GPU option
- Model input and output:
- --model: if you have trained model, put the path to the trained model.
- --out: a name of directory where you want to store the trained models.
You can prune Gene Ontology (Biological Process) based on your GWAS summary statistics
Prune Gene Ontology based on Your GWAS results
You can put ontology file made from step 1.
python train_snp2p_model.py \
--onto ONTO \
--snp2gene SNP2Gene \
--train-bfile TRAIN --train-cov TRAIN.cov --train-pheno TRAIN.pheno \
--val-bfile VAL --train-cov VAL.cov --val-pheno VAL.pheno \
--test TEST --test-cov VAL.cov --test-pheno TEST.pheno \
--epochs EPOCHS \
--lr LR \
--wd WD \
--batch_size BATCH_SIZE \
--dropout DROPOUT \
--val_step VAL_STEP \
--jobs JOBS \
--cuda 0 \
--hidden_dims HIDDEN_DIMS \
--out OUT
python train_snp2p_model.py \
--onto ONTO \
--snp2gene SNP2Gene \
--train-bfile TRAIN --train-cov TRAIN.cov --train-pheno TRAIN.pheno \
--val-bfile VAL --train-cov VAL.cov --val-pheno VAL.pheno \
--test TEST --test-cov VAL.cov --test-pheno TEST.pheno \
--epochs EPOCHS \
--lr LR \
--wd WD \
--batch_size BATCH_SIZE \
--dropout DROPOUT \
--val_step VAL_STEP \
--jobs JOBS \
--dist-backend 'nccl' \
--dist-url 'tcp://127.0.0.1:2222' \
--multiprocessing-distributed \
--world-size 1 \
--rank 0 \
--hidden_dims HIDDEN_DIMS \
--out OUT
python predict_attention.py \
--onto ONTO \
--snp2gene SNP2Gene \
--bfile BFILE_prefix --cov COVAR.cov --pheno PHENO.pheno \
--model trained_model_dir
--out output_prefix \
--batch_size BATCH_SIZE \
--cpu N_cpu
This will generate
- Prediction:
{output_prefix}.prediction.csv
, containing only predictions (Good for performance evaluation!) - Attention result:
{output_prefix}.attention.csv
, containing Nx(S+G) system and gene attention results for whole population - System importance score:
{output_prefix}.sys_corr.csv
, containing correlation between system attention and prediction - Gene importance score:
{output_prefix}.gene_corr.csv
, containing correlation between system attention and prediction
adding argument --prediction-only
will make this script to predict only (no attention result)
You can visualize attention flow from trained G2PT model.
Draw Sankey from Model Attention
You can search epistasis within system and visualize, and analyze. Please pass through example notebook
Epistais Search and Visualization Example
- Applying Differential Transformer to genetic factor translation
- Build data loader for
plink
binary file usingsgkit
- Adding
.cov
and.pheno
for input