Skip to content

Commit

Permalink
add gene-gene interaction scripts
Browse files Browse the repository at this point in the history
  • Loading branch information
wangx677 committed Aug 2, 2021
1 parent ec3beed commit 1a4801c
Show file tree
Hide file tree
Showing 283 changed files with 21,384 additions and 46 deletions.
63 changes: 63 additions & 0 deletions BridGE_genes/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
This pipeline is used to evaluate gene-gene interaction using GWAS.

STEP 1: Use "process_data_genes.sh" to extract SNPs that can be mapped to given list of gene from GWAS data.

process_data_genes.sh PROJECTDIR GENELIST GWASFILE MAPPINGDISTANCE

PROJECTDIR is the project directory where you will save intermediate and final results for your analysis.
GENELIST is a text file with each line is a gene.
GWASFILE is the gwas plink file without file extension.
MAPPINGDISTANCE defines the mapping distance between a SNP and a gene. For example, if assign each SNP to any upstream and downstream genegenes within 50kb, use input 50000.


THE FOLLOWING STEPS NEED TO BE RUN UNDER YOUR PROJECTDIR.


STEP 2: Use "run_ssM.sh" to compute mhygeSSI SNP-SNP interaction for real and randomized data.

run_ssM.sh PROJECTDIR MARGINAL N1 N2

PROJECTDIR is the project directory where you will save intermediate and final results for your analysis.
MARGINAL is used to decide which genetic interaction measurement to use. For hygeSSI: MARGINAL=1, for mhygeSSI: MARGINAL=0
N1 is the sample permutation starting number. N1=0 is the real data
N2 is the sample permutation ending number.


STEP 3: Use "prepare_snpsets.m" to generate SNP to gene mapping file.

matlab -nodisplay -nodesktop -nosplash -r "prepare_snpsets('gene_annotation',MAPPINGDISTANCE);exit" </dev/null> /dev/null


STEP 4: use "update_ssM_ld.m" to remove interactions that might caused by linkage disequilibrium (LD).

matlab -nodisplay -nodesktop -nosplash -r "update_ssM_ld('SSMFILE');exit" </dev/null> /dev/null

SSMFILE is the SNP-SNP interaction file generated from STEP 3.

Repeat this for all real and random SSMFILEs


STEP 5: use "getdensity.m" to evaluate gene-gene interactions based on interaction density and ranksum test.

matlab -nodisplay -nodesktop -nosplash -r "getdensity('SSMFILE',R);exit" </dev/null> /dev/null

SSMFILE is the SNP-SNP interaction file generated from STEP 3 without "R*.mat", for example 'ssM_mhygeSSI_alpha10.05_alpha20.05_DD'.
R is the indicator for real or randomized data. R=0 means using SSMFILE from real data, R=n means using SSMFILE from nth randomized data.

Repeat this for all real and random SSMFILEs.


STEP 6: use "density2pvalue.m" to evaluate sample permutation p-value.

matlab -nodisplay -nodesktop -nosplash -r "density2pvalue('SSMFILE',N);exit" </dev/null> /dev/null

SSMFILE is the SNP-SNP interaction file generated from STEP 3 without "R*.mat", for example 'ssM_mhygeSSI_alpha10.05_alpha20.05_DD'.
N is the total number of sample permutation. For example, N=1000

STEP 7: use "computeFDR.m" to evaluate False Discovery Rate.

matlab -nodisplay -nodesktop -nosplash -r "computeFDR('SSMFILE',N);exit" </dev/null> /dev/null &&

SSMFILE is the SNP-SNP interaction file generated from STEP 3 without "R*.mat", for example 'ssM_mhygeSSI_alpha10.05_alpha20.05_DD'.
N is the total number of sample permutation. For example, N=1000

4 changes: 4 additions & 0 deletions BridGE_genes/binarize_data.m
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
function binarize_data(dataFile)

bindataar(dataFile)
bindataad(dataFile)
34 changes: 34 additions & 0 deletions BridGE_genes/computeFDR.m
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
function computeFDR(ssmFile,samplePerms)

% Inputs
% ssmFile: SNP-SNP interaction file
% samplePerms: number of sample permutation
%
% Outputs:

for i=0:samplePerms
load(sprintf('genstats_%s_R%s.mat',ssmFile,num2str(i)))
% compute FDR
% sample permutation p-value based on average interaction scores with filtering out gene-gene interactions with ranksum p>0.05
% one tailled test
fdr_density_protective = mafdr([bpm_pv_density{1} wpm_pv_density{1}],'BHFDR',true);
fdr_density_risk = mafdr([bpm_pv_density{2} wpm_pv_density{2}],'BHFDR',true);

% two tailed test
fdr_tmp = mafdr([bpm_pv_density{1} wpm_pv_density{1} bpm_pv_density{2} wpm_pv_density{2}],'BHFDR',true);
fdr_density_protective_both = fdr_tmp(1:length(fdr_tmp)/2);
fdr_density_risk_both = fdr_tmp(length(fdr_tmp)/2+1:end);

% compute FDR
% sample permutation p-value based on average interaction scores and ranksum score
% one tailled test
fdr_density_and_ranksum_protective = mafdr([bpm_pv_density_and_ranksum{1} wpm_pv_density_and_ranksum{1}],'BHFDR',true);
fdr_density_and_ranksum_risk = mafdr([bpm_pv_density_and_ranksum{2} wpm_pv_density_and_ranksum{2}],'BHFDR',true);

% two tailed test
fdr_tmp = mafdr([bpm_pv_density_and_ranksum{1} wpm_pv_density_and_ranksum{1} bpm_pv_density_and_ranksum{2} wpm_pv_density_and_ranksum{2}],'BHFDR',true);
fdr_density_and_ranksum_protective_both = fdr_tmp(1:length(fdr_tmp)/2);
fdr_density_and_ranksum_risk_both = fdr_tmp(length(fdr_tmp)/2+1:end);

save(sprintf('genstats_%s_R%s.mat',ssmFile,num2str(i)),'fdr*','-append')
end
50 changes: 50 additions & 0 deletions BridGE_genes/density2pvalue.m
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
function density2pvalue(ssmFile,samplePerms)

% Inputs
% ssmFile: SNP-SNP interaction file
% samplePerms: number of sample permutation
%
% Outputs:
% bpm_pv, wpm_pv: permutation p-values

for i=0:samplePerms
load(sprintf('genstats_%s_R%s.mat',ssmFile,num2str(i)))
for tt=1:2
bpm_density_all{tt}(i+1,:) = BPM_density{tt};
wpm_density_all{tt}(i+1,:) = WPM_density{tt};

bpm_local_all{tt}(i+1,:) = bpm_local{tt};
wpm_local_all{tt}(i+1,:) = wpm_local{tt};
end
end

load BPMind.mat

for i=1:samplePerms+1
for tt=1:2
ind = setdiff(1:samplePerms+1,i);
bpm_pv_density{tt} = (sum(bpm_density_all{tt}(ind,:)>=bpm_density_all{tt}(i,:))+1)/(samplePerms+1);
wpm_pv_density{tt} = (sum(wpm_density_all{tt}(ind,:)>=wpm_density_all{tt}(i,:))+1)/(samplePerms+1);

bpm_pv_density_and_ranksum{tt} = (sum(bpm_density_all{tt}(ind,:)>=bpm_density_all{tt}(i,:) & bpm_local_all{tt}(ind,:)>=bpm_local_all{tt}(i,:))+1)/(samplePerms+1);
wpm_pv_density_and_ranksum{tt} = (sum(wpm_density_all{tt}(ind,:)>=wpm_density_all{tt}(i,:) & wpm_local_all{tt}(ind,:)>=wpm_local_all{tt}(i,:))+1)/(samplePerms+1);


% assign 1 to pv when the density is nan
bpm_pv_density{tt}(isnan(bpm_density_all{tt}(i,:))) = 1;
wpm_pv_density{tt}(isnan(wpm_density_all{tt}(i,:))) = 1;

bpm_pv_density_and_ranksum{tt}(isnan(bpm_density_all{tt}(i,:))) = 1;
wpm_pv_density_and_ranksum{tt}(isnan(wpm_density_all{tt}(i,:))) = 1;

% if BPM size is less than 5x5, set pv to be 1
bpm_pv_density{tt}(find(BPM.ind1size<5 | BPM.ind2size<5)) = 1;
bpm_pv_density_and_ranksum{tt}(find(BPM.ind1size<5 | BPM.ind2size<5)) = 1;

% if ranksum test p-value is >0.05, set bpm/wpm_pv_density to be 1
bpm_pv_density{tt}(find(bpm_local_all{tt}(i,:)<-log10(0.05))) = 1;
wpm_pv_density{tt}(find(wpm_local_all{tt}(i,:)<-log10(0.05))) = 1;
end

save(sprintf('genstats_%s_R%s.mat',ssmFile,num2str(i-1)),'bpm_pv*','wpm_pv*','-append')
end
77 changes: 77 additions & 0 deletions BridGE_genes/evaluate_results.m
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
projectdirs={'project_ALS_NINDS_phs000101_OmniExpress_GOIset1','project_ALS_NINDS_phs000101_HumanHap_GOIset1','project_ALS_Finland_phs000344_GOIset1','project_ALS_Irish_phs000127_GOIset1'};
fdrcutoff=0.25;
outputdir='/project/csbio/wwang/BridGE/results_collection/ALS';

% validcutoff=1:(length(projectdirs)-1);
% tail={'one','two'};
% GI={'mhygeSSI','hygeSSI'};
% fdrcutoff = 0.05:0.05:0.25;
m=1;

for fdrcutoff = 0.05:0.05:0.25
for GI = {'mhygeSSI','hygeSSI'};
for tail = {'one','two'}
for validcutoff=1:(length(projectdirs)-1);
for R=0:1000
[number_GI_risk(R+1), number_GI_validated_risk(R+1), number_GI_protective(R+1), number_GI_validated_protective(R+1)] = summarize_result_gene(projectdirs,GI{1},R,fdrcutoff,outputdir,validcutoff,tail{1});
if R==0 & number_GI_validated_risk(1) ==0 & number_GI_validated_protective(1)==0
break
end
end
GI_all(m) = GI;
tail_all(m) = tail;
validcutoff_all(m) = validcutoff;
fdrcutoff_all(m) = fdrcutoff;

number_GI_discovered_risk_all(m) = number_GI_risk(1);
number_GI_validated_risk_all(m) = number_GI_validated_risk(1);
if number_GI_discovered_risk_all(m)>0
pv_risk(m) = nnz(number_GI_validated_risk(2:end)>=number_GI_validated_risk(1))/(length(number_GI_validated_risk)-1);
else
pv_risk(m)=1;
end

number_GI_discovered_protective_all(m) = number_GI_protective(1);
number_GI_validated_protective_all(m) = number_GI_validated_protective(1);
if number_GI_validated_protective_all(m)>0
pv_protective(m) = nnz(number_GI_validated_protective(2:end)>=number_GI_validated_protective(1))/(length(number_GI_validated_protective)-1);
else
pv_protective(m) = 1;
end

number_GI_discovered_combined(m) = number_GI_risk(1) + number_GI_protective(1);
number_GI_validated_combined_all(m) = number_GI_validated_risk(1)+number_GI_validated_protective(1);
if number_GI_validated_combined_all(m)>0
pv_combined(m) = nnz((number_GI_validated_risk(2:end)+number_GI_validated_protective(2:end))>=(number_GI_validated_risk(1)+number_GI_validated_protective(1)))/(length(number_GI_validated_risk)-1);
else
pv_combined(m)=1;
end
clear number_GI_risk number_GI_validated_risk number_GI_protective number_GI_validated_protective
m = m+1
end
end
end
end


GI = reshape(GI_all,length(GI_all),1);
fdrcutoff = reshape(fdrcutoff_all,length(fdrcutoff_all),1);
tail = reshape(tail_all,length(tail_all),1);
validcutoff = reshape(validcutoff_all,length(validcutoff_all),1);

NO_GI_risk_discover = reshape(number_GI_discovered_risk_all,length(number_GI_discovered_risk_all),1);
NO_GI_risk_valid = reshape(number_GI_validated_risk_all,length(number_GI_validated_risk_all),1);
pv_risk = reshape(pv_risk,length(pv_risk),1);

NO_GI_protective_discover = reshape(number_GI_discovered_protective_all,length(number_GI_discovered_protective_all),1);
NO_GI_protective_valid = reshape(number_GI_validated_protective_all,length(number_GI_validated_protective_all),1);
pv_protective = reshape(pv_protective,length(pv_protective),1);

NO_GI_combined_disocver = reshape(number_GI_discovered_combined,length(number_GI_discovered_combined),1);
NO_GI_combined_valid = reshape(number_GI_validated_combined_all,length(number_GI_validated_combined_all),1);
pv_combined = reshape(pv_combined,length(pv_combined),1);


output = table(GI,fdrcutoff,validcutoff,tail,NO_GI_protective_discover,NO_GI_protective_valid,pv_protective,NO_GI_risk_discover,NO_GI_risk_valid,pv_risk,NO_GI_combined_disocver,NO_GI_combined_valid,pv_combined);
writetable(output,'/project/csbio/wwang/BridGE/results_collection/ALS/results_GOI_summary.xls')

94 changes: 94 additions & 0 deletions BridGE_genes/generate_results.m
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
dirs = {'project_PD_NGRC_pbody_0kb','project_PD_NGRC_pbody_50kb','project_PD_NGRC_pbody_merge_0kb','project_PD_NGRC_pbody_merge_50kb','project_PD_Simon_pbody_0kb','project_PD_Simon_pbody_50kb','project_PD_Simon_pbody_merge_0kb','project_PD_Simon_pbody_merge_50kb'};

names = {'NGRC_0kb','NGRC_50kb','NGRC_merge_0kb','NGRC_merge_50kb','NIA2_0kb','NIA2_50kb','NIA2_merge_0kb','NIA2_merge_50kb'};

model = {'RR','DD','RD','combined'};

% summarize the results based on FDR<=0.25
for i=1:length(dirs)
load(sprintf('/project/csbio/wwang/BridGE/%s/BPMind.mat',dirs{i}));
for j=1:length(model)
load(sprintf('/project/csbio/wwang/BridGE/%s/results_pbody_%s.mat',dirs{i},model{j}));
for tt=1:2
fdr_summary{tt}(j,i) = nnz(fdr_combined{tt}<0.255);
end
end
end

fdr_summary_protective = array2table(fdr_summary{1},'RowNames',model,'VariableNames',names);
fdr_summary_risk = array2table(fdr_summary{2},'RowNames',model,'VariableNames',names);
cd('/project/csbio/wwang/BridGE/scripts_PD_pbody');
writetable(fdr_summary_protective,'pbody_fdr_summary.xls','WriteRowNames',true,'Sheet',1)
writetable(fdr_summary_risk,'pbody_fdr_summary.xls','WriteRowNames',true,'Sheet',2)

% pull all interactions with FDR<0.25 (risk only)
load /project/csbio/wwang/BridGE/project_PD_NGRC_pbody_50kb/BPMind.mat
genes1 = [WPM.pathway(BPM.path1idx);WPM.pathway];
genes2 = [WPM.pathway(BPM.path2idx);WPM.pathway];
clear BPM WPM

genes1_genes2 = cellfun(@(x,y)sprintf('%s_%s',x,y),genes1,genes2,'uniform',0);

pbody_GI_table = nan(length(genes1),24);
model_table = repmat({''},length(genes1),length(dirs));

for i=1:length(dirs)
load(sprintf('/project/csbio/wwang/BridGE/%s/BPMind.mat',dirs{i}));
tmp_genes1 = [WPM.pathway(BPM.path1idx);WPM.pathway];
tmp_genes2 = [WPM.pathway(BPM.path2idx);WPM.pathway];
tmp_genes1_genes2 = cellfun(@(x,y)sprintf('%s_%s',x,y),tmp_genes1,tmp_genes2,'uniform',0);

for j=1:length(model)
load(sprintf('/project/csbio/wwang/BridGE/%s/results_pbody_%s.mat',dirs{i},model{j}));
tmp_density(:,j) = [BPM_density{2} WPM_density{2}]';
tmp_pv(:,j) = [BPM_density_pv{2} WPM_density_pv{2}]';
tmp_fdr(:,j) = fdr_combined{2}';
end

[tmpfdr,idx]=min(tmp_fdr,[],2);

for k=1:length(idx)
maxdensity(k) = tmp_density(k,idx(k));
minpv(k) = tmp_pv(k,idx(k));
minfdr(k) = tmp_fdr(k,idx(k));
minmodel{k} = model{idx(k)};
end

[tmp idxa idxb] = intersect(genes1_genes2,tmp_genes1_genes2);
pbody_GI_table(idxa,3*(i-1)+1) = round(maxdensity(idxb),2);
pbody_GI_table(idxa,3*(i-1)+2) = minpv(idxb);
pbody_GI_table(idxa,3*(i-1)+3) = round(minfdr(idxb),2);

model_table(idxa,i) = minmodel(idxb);
clear maxdensity minpv minfdr minmodel tmp*
end

pbody_GI_table = [array2table(genes1) array2table(genes2) array2table(pbody_GI_table,'VariableNames',{'NGRC_0kb_density','NGRC_0kb_pv','NGRC_0kb_FDR','NGRC_50kb_density','NGRC_50kb_pv','NGRC_50kb_FDR','NGRC_merge_0kb_density','NGRC_merge_0kb_pv','NGRC_merge_0kb_FDR','NGRC_merge_50kb_density','NGRC_merge_50kb_pv','NGRC_merge_50kb_FDR','NIA2_0kb_density','NIA2_0kb_pv','NIA2_0kb_FDR','NIA2_50kb_density','NIA2_50kb_pv','NIA2_50kb_FDR','NIA2_merge_0kb_density','NIA2_merge_0kb_pv','NIA2_merge_0kb_FDR','NIA2_merge_50kb_density','NIA2_merge_50kb_pv','NIA2_merge_50kb_FDR'})];


% filtering
NGRC_pv_sum = sum([pbody_GI_table.NGRC_0kb_pv pbody_GI_table.NGRC_50kb_pv pbody_GI_table.NGRC_merge_0kb_pv pbody_GI_table.NGRC_merge_50kb_pv]<=0.05,2);
NIA2_pv_sum = sum([pbody_GI_table.NIA2_0kb_pv pbody_GI_table.NIA2_50kb_pv pbody_GI_table.NIA2_merge_0kb_pv pbody_GI_table.NIA2_merge_50kb_pv]<=0.05,2);

NGRC_FDR_sum = sum([pbody_GI_table.NGRC_0kb_FDR pbody_GI_table.NGRC_50kb_FDR pbody_GI_table.NGRC_merge_0kb_FDR pbody_GI_table.NGRC_merge_50kb_FDR]<0.255,2);
NIA2_FDR_sum = sum([pbody_GI_table.NIA2_0kb_FDR pbody_GI_table.NIA2_50kb_FDR pbody_GI_table.NIA2_merge_0kb_FDR pbody_GI_table.NIA2_merge_50kb_FDR]<0.255,2);

validation = ((NGRC_FDR_sum>=1 & NIA2_pv_sum>=1) | (NIA2_FDR_sum>=1 & NGRC_pv_sum>=1));

ind = find(NGRC_pv_sum>=1 | NIA2_pv_sum>=1);

pbody_GI_table = [pbody_GI_table table(NGRC_pv_sum) table(NIA2_pv_sum) table(NGRC_FDR_sum) table(NIA2_FDR_sum) table(validation)];

pbody_GI_table_filter_pv = pbody_GI_table(ind,:);
model_table = array2table(model_table,'VariableNames',names);
model_table_filter_pv = model_table(ind,:);

ind = find(NGRC_FDR_sum>=1 | NIA2_FDR_sum>=1);

pbody_GI_table_filter_FDR = pbody_GI_table(ind,:);
model_table_filter_FDR = model_table(ind,:);

save generate_results pbody_GI_table pbody_GI_table_filter* model_table model_table_filter*

writetable(pbody_GI_table_filter_pv,'pbody_GI_table_filter_pv.xls')
writetable(pbody_GI_table_filter_FDR,'pbody_GI_table_filter_FDR.xls')
65 changes: 65 additions & 0 deletions BridGE_genes/genstats_GOI.m
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
function [bpm_local bpm_local_pv density_bpm density_bpm_expected wpm_local wpm_local_pv density_wpm density_wpm_expected denseidx path_degree path_degree_pv] = genstats(ssmFile,bpmindFile,binaryNetwork,snpPerms,minPath,netDensity)

% FUNCTION [bpm_local density_bpm density_bpm_expected wpm_local density_wpm density_wpm_expected denseidx path_degree] = genstats(ssmFile,bpmindFile,minPath,binaryNetwork,netDensity)
%
% GENSTATS calculates general statistics for BPM (chi-square global,chi-square local, density), WPM (chi-square global,chi-square local, density) and
% PATH (ranksum)
%
% INPUTS:
% ssmFile - SNP-SNP interaction file name
% bpmindFile - a mat-file that includes SNP indexes of pathways for all possbile BPMs, WPMs
% binaryNetwork - 1: binary network; 0: weighted network
% snpPerms - number of SNP permutation
% netDensity - network binarization density based on considering both protecitve/risk interactions
%
% OUTPUTS:
% An output file named genstats_<ssmFile>.mat or genstats_<ssmFile>_density<netDensity>.mat
%

if (nargin > 6)
error('Incorrect number of input arguments!');
end

load(bpmindFile)

load(sprintf('%s.mat',ssmFile))
[p q] = size(ssM{1});
if min(p,q) == 1;
for tt=1:2
ssM{tt} = squareform(ssM{tt});
end
end

[p q] = size(ssM{1});
if binaryNetwork==1 & exist('netDensity')==1
tmp = max(ssM{1},ssM{2});
tmpcut = quantile(tmp(:),1-netDensity);
for tt=1:2
ssM{tt} = ssM{tt}>=tmpcut;
end

clear tmp tmpcut
elseif binaryNetwork==1 & exist('netDensity')~=1
for tt=1:2
ssM{tt} = ssM{tt}>0;
end
netDensity = nnz(max(ssM{1},ssM{2}))/(p*q)
end

% analyze protective and risk effect individually
bpmsize = BPM.size;
wpmsize = WPM.size;
pathsize = WPM.indsize;
for tt=1:2
[bpm_local{tt} bpm_local_pv{tt} density_bpm{tt} density_bpm_expected{tt} denseidx{tt} wpm_local{tt} wpm_local_pv{tt} density_wpm{tt} density_wpm_expected{tt} path_degree{tt} path_degree_pv{tt}] = rungenstats(full(ssM{tt}),BPM.ind1,BPM.ind2,BPM.ind1size,BPM.ind2size,bpmsize,binaryNetwork,snpPerms,minPath,WPM.ind,wpmsize,pathsize);
end

outputFile = sprintf('genstats_%s.mat',ssmFile);

if binaryNetwork==1
save(outputFile,'bpm_local*','wpm_local*','path_degree*','denseidx','density_*','netDensity','-v7.3')
else
save(outputFile,'bpm_local*','wpm_local*','path_degree*','denseidx','density_*','-v7.3')
end

end
Loading

0 comments on commit 1a4801c

Please sign in to comment.