A Genome-Wide-Association-Study Simulation Tool for Genotype Simulation, Phenotype Simulation, and Power Evaluation
You Tang and Xiaolei Liu
[[email protected]](Xiaolei Liu)
- Installation
- Data Preparation
- Genotype Simulation
- Phenotype Simulation
- Population Structure
- GWAS
- Method Evaluation
- FAQ and Hints
back to top
JDK1.8 should be installed and environment variables must be configured before using G2P (http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)
back to top
GUI
Download all files from https://github.com/XiaoleiLiuBio/G2P/tree/master/gG2P_win_64 and double click the .jar file
Pipeline
Download all files from https://github.com/XiaoleiLiuBio/G2P/tree/master/kG2P_win_64
back to top
GUI
Download all files from https://github.com/XiaoleiLiuBio/G2P/tree/master/gG2P_mac and double click the .jar file
Pipeline
Download all files from https://github.com/XiaoleiLiuBio/G2P/tree/master/kG2P_mac
permission setting
$ chmod 777 gemma oldplink plink
back to top
GUI
Download all files from https://github.com/XiaoleiLiuBio/G2P/tree/master/gG2P_linux_x86_64
and run
$ Java -jar gG2P.jar
Pipeline
Download all files from https://github.com/XiaoleiLiuBio/G2P/tree/master/kG2P_linux_x86_64
permission setting
$ chmod 777 gemma oldplink plink
All files should be prepared with the same prefix
details see http://zzz.bwh.harvard.edu/plink/data.shtml#ped
back to top
Family ID | Individual ID | Father ID | Mother ID | Sex | Trait | marker 1 | marker 2 | marker 3 | marker 4 | marker 5 | marker 6 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 33-16 | 0 | 0 | 0 | 2 | 0 0 | A A | A A | A G | A G | A G |
1 | 38-11 | 0 | 0 | 0 | 2 | 0 0 | A G | A G | A A | A G | A G |
1 | 4226 | 0 | 0 | 0 | 2 | 0 0 | A G | A A | A A | A G | A G |
1 | 4722 | 0 | 0 | 0 | 2 | 0 0 | A G | A G | A A | A G | A G |
1 | A188 | 0 | 0 | 0 | 2 | 0 0 | A A | A A | A A | A G | A G |
1 | A214N | 0 | 0 | 0 | 2 | 0 0 | A G | A A | A G | A A | A G |
1 | A239 | 0 | 0 | 0 | 2 | 0 0 | A A | A A | A G | A G | A A |
Family ID | Individual ID | Father ID | Mother ID | Sex | Trait | marker 1 | marker 2 | marker 3 | marker 4 | marker 5 | marker 6 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 33-16 | 0 | 0 | 0 | 2 | 0 0 | 1 1 | 1 1 | 1 3 | 1 3 | 1 3 |
1 | 38-11 | 0 | 0 | 0 | 2 | 0 0 | 1 3 | 1 3 | 1 1 | 1 3 | 1 3 |
1 | 4226 | 0 | 0 | 0 | 2 | 0 0 | 1 3 | 1 1 | 1 1 | 1 3 | 1 3 |
1 | 4722 | 0 | 0 | 0 | 2 | 0 0 | 1 3 | 1 3 | 1 1 | 1 3 | 1 3 |
1 | A188 | 0 | 0 | 0 | 2 | 0 0 | 1 1 | 1 1 | 1 1 | 1 3 | 1 3 |
1 | A214N | 0 | 0 | 0 | 2 | 0 0 | 1 3 | 1 1 | 1 3 | 1 1 | 1 3 |
1 | A239 | 0 | 0 | 0 | 2 | 0 0 | 1 1 | 1 1 | 1 3 | 1 3 | 1 1 |
details see http://zzz.bwh.harvard.edu/plink/data.shtml#map
back to top
Chromosome ID | Marker ID | Genetic Distance | Physical Distance |
---|---|---|---|
1 | PZB00859.1 | 0 | 157104 |
1 | PZA01271.1 | 0 | 1947984 |
1 | PZA03613.2 | 0 | 2914066 |
1 | PZA03613.1 | 0 | 2914171 |
1 | PZA03614.2 | 0 | 2915078 |
1 | PZA03614.1 | 0 | 2915242 |
1 | PZA00258.3 | 0 | 2973508 |
back to top
new samples will be generated using samples within sub-population
Sample ID | sub-Population ID |
---|---|
33-16 | 1 |
38_11 | 1 |
4226 | 1 |
4722 | 2 |
A188 | 2 |
A214N | 2 |
A239 | 2 |
A272 | 2 |
A441-5 | 2 |
A554 | 3 |
A556 | 3 |
A6 | 3 |
A619 | 3 |
back to top
each column represents simulated QTNs for each phenotype
Phenotype 1 | Phenotype 2 | Phenotype 3 | Phenotype 4 | Phenotype 5 |
---|---|---|---|---|
66 | 67 | 80 | 83 | 90 |
9 | 15 | 52 | 59 | 135 |
90 | 96 | 143 | 147 | 174 |
3 | 3 | 15 | 58 | 89 |
89 | 118 | 185 | 203 | 212 |
69 | 72 | 72 | 84 | 110 |
46 | 59 | 125 | 204 | 207 |
14 | 15 | 19 | 29 | 39 |
9 | 23 | 65 | 111 | 131 |
19 | 52 | 74 | 179 | 194 |
Ped: ped file
Map: map file
Path for output Ped/Map: path for output ped and map file
Block: Yes or No, if "Yes", the whole genome will be divided into blocks and exchange to generate new samples
Number of SNPs in each block: Number of SNPs in each block
Imputation: if TRUE, major allele will be used to impute missing values
Population size: simulated sample size
java -jar kG2P.jar --ped D:\data\AG.ped --map D:\data\AG.map --outgen D:\data\output --rn 100 --block 4 --impute
java -jar kG2P.jar --ped /root/data/AG.ped --map /root/data/AG.map --outgen /root/data/output --rn 100 --block 4 –impute
java -jar kG2P.jar --ped D:\data\AG.ped --map D:\data\AG.map --outgen D:\data\output --rn 100 --block 4
java -jar kG2P.jar --ped D:\data\AG.ped --map D:\data\AG.map --outgen D:\data\output --rn 100 --impute
java -jar kG2P.jar --ped D:\data\AG.ped --map D:\data\AG.map --outgen D:\data\output --rn 100
jar: executive software
ped: ped file
map: map file
outgen: output path
block: number of SNPs in each block
rn: simulated sample size
impute: if 'impute' is added, major allele will be used to impute missing values
Ped: ped file
Map: map file
Pop: pop file
Path for output Ped/Map: path for output ped and map file
Block: Yes or No, if "Yes", the whole genome will be divided into blocks and exchange to generate new samples
Number of SNPs in each block: Number of SNPs in each block
Imputation: if TRUE, major allele will be used to impute missing values
Sample size of each population: sample size of each new generated population
Population size: number or vector, simulated sample size
java -jar kG2P.jar --ped D:\data\AG.ped –map D:\data\AG.map --pop D:\data\AG.pop --outgen D:\data\output --block 4 --rn 100
java -jar kG2P.jar --ped /root/data/AG.ped --map /root/data/AG.map --pop /root/data/AG.pop --outgen /root/data/output --impute --block 4 --rn 100
java -jar kG2P.jar --ped D:\data\AG.ped --map D:\data\AG.map --pop D:\data\AG.pop --outgen D:\data\output --block 4 --rn 100
java -jar kG2P.jar --ped D:\data\AG.ped --map D:\data\AG.map --pop D:\data\AG.pop --outgen D:\data\output --rn 100
java -jar kG2P.jar --ped D:\data\AG.ped --map D:\data\AG.map --pop D:\data\AG.pop --outgen D:\data\output --impute --rn 100
java -jar kG2P.jar --ped D:\data\AG.ped --map D:\data\AG.map --pop D:\data\AG.pop --outgen D:\data\output --impute --block 4 --rn 100,200,300,400
java -jar kG2P.jar --ped D:\data\AG.ped --map D:\data\AG.map --pop D:\data\AG.pop --outgen D:\data\output --block 4 --rn 100,200,300,400
java -jar kG2P.jar --ped D:\data\AG.ped --map D:\data\AG.map --pop D:\data\AG.pop --outgen D:\data\output --rn 100,200,300,400
java -jar kG2P.jar --ped D:\data\AG.ped --map D:\data\AG.map --pop D:\data\AG.pop --outgen D:\data\output --impute --rn 100,200,300,400
jar: executive software
ped: ped file
map: map file
pop: pop file
outgen: output path
block: number of SNPs in each block
rn: simulated sample size
impute: if 'impute' is added, major allele will be used to impute missing values
Number of Chromosomes: total number of Chromosomes for each new generated sample
Marker size of each Chromosome: vector, marker size of each Chromosome
Population size: Sample size of new generated population
Physical distance of neighbor markers: Physical distance of neighbor markers
Output ped file path: output path of ped file
Output map file path: output path of map file
java -jar kG2P.jar --sample 100 --chr 5 --marker 100,200,300,400,500 --d 500 --outgen D:\data\output
java -jar kG2P.jar --sample 100 --chr 5 --marker 100,200,300,400,500 --d 500 --outgen /root/data/output
jar: executive software
sample: simulated sample size
chr: Number of Chromosomes
marker: SNP markers for each Chromosome
d: phsical distance (base pairs) between nearby markers
outgen: output path
Ped file path:
QTN area: the genomic area that used to select QTNs
Range: if 'QTN area' is 'Yes', 'range' can be used to set the 'QTN area'
Distribution of QTN effects: two options, 'Normal' and 'Geometry'
Mean: mean value of the normal distribution, length of 'Mean' should be the same with 'Variance'
Variance: variance of the normal distribution
Formats: phenotype formats of 'GEMMA', 'Plink', and 'FaST-LMM' softwares
Number of simulated phenotypes: number of simulated phenotypes
Number of QTNs: number of QTNs, if it is a vector, effect of different QTN group will follow different distribution; length of nqtn, m, and v should be same
Heritability: heritability
Output file path: output file path
java -jar kG2P.jar --ped D:\data\AG.ped --outgen D:\data\output --rep 100 --dis geo 0.99 --h2 0.5 --nqtn 100 --QTNarea 1-500,1000-1500
java -jar kG2P.jar --ped /root/data/AG.ped --outgen /root/data/output --rep 100 --dis geo 0.99 --h2 0.5 --nqtn 100 --QTNarea 1-500,1000-1500
java -jar kG2P.jar --ped D:\data\AG.ped --outgen D:\data\Part2out --rep 100 --dis geo 0.99 --h2 0.5 --nqtn 100
java -jar .\kG2P.jar --ped D:\data\AG.ped --outgen D:\data\output --rep 10 --dis geo 0.99,0.88 --h2 0.5 --nqtn 10,20 --QTNarea 1-500,1000-1500
java -jar KG2P.jar --ped D:\data\AG.ped --outgen D:\data\Part2out --rep 100 --dis nor --m 0,0.1 --v 0.99,0.98 --h2 0.5 --nqtn 10,20 --QTNarea 1-500,1000-1500
jar: executive software
rep: number of simulated phenotypes
dis: distribution of QTN effects, two options, 'nor' and 'geo'
m: mean value of the normal distribution
v: variance of the normal distribution
QTNarea: the genomic area that used to select QTNs
h2: heritability
nqtn: number of QTNs, if it is a vector, effect of different QTN group will follow different distribution; length of nqtn, m, and v should be same
Genotype (bed/bim/fam, ped/map): select genotype file
PCA: select if you want to calculate principle components
Number of PCs: number of PCs will be calculated and generated
Kinship: select if you want to calculate Kinship matrix
java -jar kG2P.jar --pre "plink --bfile D:\data\AG --pca 3 --out D:\data\AG"
java -jar kG2P.jar --pre "./plink --bfile /root/data/AG --pca 3 --out /root/data/AG"
java -jar kG2P.jar --pre "./gemma -bfile /root/data/AG -gk -o testgemma"
jar: executive software
pre: pipeline of the software you want to use, attention that the software should be put in the same path as kG2P.jar
Genotype (bed/bim/fam, ped/map): select genotype file
Phenotype file path: select the first phenotype file, all phenotypes in the same path will be analyzed one by one; name of the phenotype file must include a continuous order number, e.g., 'phenotype1.txt', 'phenotype2.txt', 'phenotype3.txt'
Results output file path: select output file path
Command: command for running gwas of the first phenotype, user-specific covariates files and kinship file can also included in the command line
java -jar kG2P.jar --GWAS "plink --bfile D:\data\AG --fam D:\data\out\171104010413\Plink\Plink_snps1.fam --assoc --out D:\data\g2ptemp" --sp Plink_snps1.fam
java -jar kG2P.jar --GWAS "./plink --bfile /root/data/AG --fam /root/data/output/171104010413/Plink/Plink_snps1.fam --assoc --out /root/data/g2ptemp" --sp Plink_snps1.fam
java -jar kG2P.jar --GWAS "./gemma -bfile /root/data/AG -p /root/data/out/171104030401/GEMMA/GEMMA_phenotype1.txt -k /root/output/testgemma.cXX.txt -lmm 4 -o g2ptemp" --sp GEMMA_phenotype1.txt
jar: executive software
GWAS: command line used for running gwas of the first phenotype
sp: the first phenotype file, the file path is not needed
Map file: map file
QTN file: qtn file
GWAS result files path: path of gwas results
Column number of P values: column number of P values
Output file path: output path of power/fdr evaluation results
java -jar kG2P.jar --map D:\data\AG.map --qtn D:\data\output\170106093742\qtn\test.qtn --gwas D:\data\output\Plink_snps1.qassoc --pv 9 --out D:\data\output
java -jar kG2P.jar --map /root/data/AG.map --qtn /root/data/output/171104030401/qtn/test.qtn --gwas /root/data/output/Plink_snps1.qassoc --pv 9 --out /root/data/output
java -jar kG2P.jar --map /root/data/AG.map --qtn /root/data/output/171104030401/qtn/test.qtn --gwas /root/output/GEMMA_phenotype1.assoc.txt --pv 9 --out /root/data/output
jar: executive software
map: map file
qtn: qtn file
gwas: the first gwas result file, there is a one-to-one mapping between gwas result files and columns in qtn file
pv: column number of P values
out: output file path
For G2P: Hope it will be coming soon
For principle components analysis:
if you use plink, please cite: Purcell S, et.al. "PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses." American Journal of Human Genetics, 81.3(2007):559-575.
For calculating kinship matrix:
if you use gemma, please cite: Zhou, X., et.al. "Genome-wide Efficient Mixed Model Analysis for Association Studies." Nature Genetics, 44.7(2012):821.
Please cite all soft wares you used for GWAS and evaluation!