Functions to visualise linkage disequilibrium analysis with scatter plots
Creates a scatter plot of results from linkage disequilibrium analysis. Specifically, these functions use results from plink analysis using version 1.07 and the command line argument:
$ ./plink --indep <window size>['kb'] <step size (variant ct)> <VIF threshold>
which outputs these files:
filename.prune.in
filename.prune.out
All are plink files.
filename.prune.in
filename.prune.out
filename.map
I used a dataset for alzheimers disease (adgwas), the plink files for which can be found here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
Generates x # of png images using the holoviews library with the 'bokeh' backend to generate scatter plots and used 'matplotlib' without holoviews to generate boxplots.
Use filename prefix only in file paths. Plink automatically includes .prune.in, .prune.out, and .map extensions
ldPath = '~/path/to/ldfiles/filePrefixOnly'
mapPath = '~/path/to/.mapfile/filePrefixOnly'
df_snps_in, df_snps_out = create_df_from_plink_ld(ldPath,mapPath)
The division argument will create the number of base pairs each image will cover. The adgwas data covers a lot of the genome so the default is set 400. Meaning 400 images will be generated with totalSNPs//400 bases pairs along the x axis in each image.
Use output from previous function for next step:
imagesPath = '~/path/to/images/folder'
create_scatter_plot_for_plink_ld(df_snps_in,df_snps_out,imagesPath)
Visualize the two distribution of distances for SNP's that will be kept and pruned
Input: dataframes from step 1
imagesPath = '~/path/to/images/folder'
create_scatter_plot_for_plink_ld(df_snps_in,df_snps_out,imagesPath)