Skip to content

Latest commit

 

History

History
30 lines (18 loc) · 3.43 KB

within_species_conservation.md

File metadata and controls

30 lines (18 loc) · 3.43 KB

Within species conservation

I hypothesize that the shapes of the promoters within species are conserved. In other words, I want to show that the values of a given shape parameter across different promoters at a given position are consistent and are not significantly different.

I want to show that shapes are important.

Visual examination

When looking at the shapes of randomly sampled regions from the genome, we do not see any pattern in the mean estimates. However, when we look at the shape of sampled promoters, we see that there are specific patterns and shapes that are being predominant in promoters. such as peaks in the -3o bp region or peaks at the Transcription Start Site. Please look at the collection of species-wise promoter shapes here: shape plots.

Shape exaggeration

We propose that in promoters the shape is conserved more than the nucleotide composition. Despite in some regions not seeing any sequence motifs and nucleotide composition conservation, we see the conservation in shape (and therefore hypothesize that there is a selection pressure on the shape specifically). Even though DNA shape does depend on the underlying nucleotide compositin, we believe, that the same shape parameters could be reached by having varying sequences. Therefore, while there is not such a hard selectino pressure on the nucleotide composition, there is a higher pressure on the DNA shape.

We test it by comparing the real shape promoters with promoters generated by using hmm emit. The generated promoters have the same average nucleotide composition, and residual cross-nucleotide dependence. When there is a significant difference between the raw and the hmm shape, we say thet the real shape is exaggerated (and therefore important). The significance of the shape difference is tested using bootstrap (with subsampling with replacement the 10% of population, 90% confidence intervals).

Choosing bootstrap parameters

I did the bootstrapping tests with a 90% confidence level across different sampling percentages -1%, 10%, 50%, and 100% of the total population of promoters. Then I counted the number of significant differences detected along 401 positions for each species-property combination. I visualised how the cumulative significance count (significance summed up across all properties) changes as the sampling percentage grows. Indeed the significance count grows dramatically when using the full (100%) population for bootstrapping as we talked about. Overall, a 10% sample size (full square of a population) should be a good balance for detecting significant differences.

image

The breakdown of significance areas is in the csv file.

It was interesting to look at which properties showed the highest significance counts across species. Here are the graphs for 10% and 50% resampling. It seems like for all species the minor groove width and slide consistently dominates the significance count. This is particularly exciting because there’s a strong biological explanation and it could serve as a reference point for showing shape conservation across species.

image

Is shape exaggeration conserved among species?

The way to test that shape is conserved within species, we can correlate the extent of shape exaggeration