This repository contains the code to reproduce the analysis for the manuscript
by Zane Kliesmete, Peter Orchard, Victor Yan Kin Lee, Johanna Geuder, Simon M. Krauß, Mari Ohnuki, Jessica Jocher, Beate Vieth, Wolfgang Enard, Ines Hellmann
The data necessary to reproduce this analysis can be found on ArrayExpress:
Accession | Dataset |
---|---|
E-MTAB-13494 | RNA-seq data from human and cynomolgus macaque |
E-MTAB-13373 | ATAC-seq data from human and cynomolgus macaque |
As a part of the study, we use published data to quantify the Pleiotropic Degree (PD) for nearly 0.5 million CREs accessible in at least one of the following nine human fetal tissues: adrenal gland, brain, heart, kidney, large intestine, lung, muscle, stomach and thymus. We furthermore associate these CREs to expressed genes in the respective tissue and model the importance of different CRE properties on gene expression levels using a mixed-effects linear model. The relevant analysis scripts for this part, underlying Figure 1 and Supplemental Figures S1, S2 are the following:
DHS peak calling
DHS peak filtering
DHS peak analyses
Expression data preparation
CRE to gene association
Mixed-effects model fitting and
permutation
Generate Figure 1
In this study, we generate data on human and cynomolgous macaque gene expression and accessibility. To have comparable annotation between species, we use liftOff to generate a gtf file for the macFas6 genome. We then process the RNA-seq data using our tool zUMIs and ATAC-seq data using Genrich and investigate differential expression and accessibility associated with CREs from different PDs (Figure 3, Supplemental Figure S3).
Run liftOff
Process liftOff output
Analyse cross-species gene expression
Identify orthologous peaks
Re-analyse CRE PD and activity conservation across mammals from Roller
et al. 2021
Do the integrated analyses
We use INSIGHT to quantify the selection acting on the CREs between human MRCA vs outgroup MRCA and within humans for each PD class and it’s subsets. We also use phyloP and phastCons to investigate CRE conservation across 10 primate species (Figures 4, 5 and Supplemental Figure S4).
Run sequence conservation methods
Summarize sequence conservation
We quantified TFBS repertoire and it’s conservation between human and cynomolgus macaque across >90% of all CREs in this study. First, sequences were extracted and provided to Cluster-Buster along with expressed TF position weight matrices to identify their binding positions. For each CRE, it’s repertoire similarity was measured. Furthermore, all orthologous sequence binding sites were aligned between species and their positional conservation was quantified. The most important scripts to generate Figures 2, 6 and Supplemental Figures S5, S7 are listed below, more intermediate processing scripts can be found here.
Extract orthologous sequences for TFBS
analyses
Quantify TFBS repertoire across PDs
Combine different conservation measures
Re-analyse TFBS conservation across mammals from Ballester et
al. 2014
Generate main figure
Finally, we visualized the case for a pleiotropic ATAXIN-3 gene promoter as a representative example for low sequence and TFBS position, but high functional conservation in terms of TFBS repertoire, CRE accesibility and downstream gene expression (Figure 7).
Throughout the workflow, we are using job scheduling system slurm (v0.4.3).