-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathRandy_R_Assignment.qmd
365 lines (244 loc) · 14 KB
/
Randy_R_Assignment.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
---
title: "R_ASSIGNMENT"
author: "Randy"
format: html
editor: visual
---
# R_codes_for_data_inspection_and_processing.
## Setting suitable R environment.
### Instructing R to display the code with the output after running the code.
```{r setup, include=FALSE}
options(knitr.echo = TRUE)
```
For easy understanding and backtracking.
### Loading the needed libraries for the project at hand
```{r}
library(tidyverse)
setwd('.')
```
Tidyverse was package loaded and the current directory was set as the working directory.
### Loading the files for the project
```{r}
Fang_genotypes <- read.table("https://raw.githubusercontent.com/EEOB-BioData/BCB546-Spring2021/main/assignments/UNIX_Assignment/fang_et_al_genotypes.txt", header = TRUE, sep = "\t")
SNPs <- read.table("https://raw.githubusercontent.com/EEOB-BioData/BCB546-Spring2021/main/assignments/UNIX_Assignment/snp_position.txt", header = TRUE, sep = "\t")
```
**Expected output:** these codes helped assign names to the files to be used: Fang_genotypes and SNPs accordingly
## Part 1: Data inspection of files
### Exploring the size, structure, and class of both files
```{r}
# For size as in the dimension of the files
dim(Fang_genotypes)
dim(SNPs)
# For visualizing the columns from the output of the code above;
view(Fang_genotypes[1:10, 1:10])
view(SNPs[1:10, 1:10])
# For the 'typeof' of both files
typeof(Fang_genotypes)
typeof(SNPs)
# For the class of both files
class(Fang_genotypes)
class(SNPs)
# For the structure of both files
str(Fang_genotypes)
str(SNPs)
```
**Expected output:** The above codes read 2782 rows with 986 columns from the Fang_genotype file, which is a list embedded as a dataframe. Additionally, 983 rows with 15 columns were read from the SNPs file, which is also a list embedded as a dataframe.
## Part 2: Processing of files.
### Reloading required libraries for file processing "dplyr" package
```{r}
library(dplyr)
```
### Sorting out the files based on the common column; preparation for joining both Fang_genotypes and SNPs desired contents.
```{r}
# For the Fang_genotype file, column names we viewed
colnames(Fang_genotypes)
```
```{r}
# For the Fang_genotype file, the sorting was done on the column "Group"
# Filter out the maize data and assign variable names as Maize_Group and Teosinte_Group grom the Fang_genotypes file
Maize_Group <- filter(Fang_genotypes, Group == 'ZMMIL' | Group == 'ZMMLR' | Group == 'ZMMMR')
Teosinte_Group <- filter(Fang_genotypes, Group == 'ZMPBA' | Group == 'ZMPIL' | Group == 'ZMPJA')
```
New files were created by filtering out our desired maize and teosinte group names using the filter command.
### Structure of filtered Maize_Group and Teosinte_Group files.
```{r}
# discovering the structure of the new Maize and Teosinte group files.
str(Maize_Group)
str(Teosinte_Group)
```
### Transposing files: Maize_Group and Teosinte_Group
```{r}
# package needed for transposing "tibble or data.table" packages, don't run run this code if you have any of these packages installed.
install.packages(tibble)
```
This installs the tibble package used for transposing the filtered files.
```{r}
library(tibble)
```
Load the tibble package that has been installed. You can either load data.table if that was your choice.
```{r}
view(Maize_Group <- column_to_rownames(Maize_Group, var = "Sample_ID"))
```
Here we reassign "Sample_ID" as a variable name for the column generated using the names of columns from the Maize_Group.
```{r}
# Generate a new file from Maize_Group
view(Transposed_Maize <- t(Maize_Group) %>% as.data.frame() %>% rownames_to_column(., var = "SNP_ID"))
# here we remove the first three rows as to get it ready to merge the file
view(Transposed_Maize1 <- Transposed_Maize[3:nrow(Transposed_Maize),])
# Merge the files based on matching SNP_ID values
view(Merged_SNPs_TranMaize <- merge(SNPs, Transposed_Maize1, by = "SNP_ID"))
# Now reorder the columns starting with SNP_ID,chromosome and position
view(Merged_SNPs_TranMaize1 <- select(Merged_SNPs_TranMaize, SNP_ID, Chromosome, Position, everything()))
# Double check the result here
View(Merged_SNPs_TranMaize1)
# Replicate the same for teosinte using the same code but generating different files and using the following codes:
Teosinte_Group1 <- column_to_rownames(Teosinte_Group, var = "Sample_ID")
Transposed_Teosinte <- t(Teosinte_Group1) %>% as.data.frame() %>% rownames_to_column(., var = "SNP_ID")
Transposed_Teosinte1 <- Transposed_Teosinte[3:nrow(Transposed_Teosinte),]
Teosinte_Final <- merge(SNPs, Transposed_Teosinte1, by = "SNP_ID")
Merged_SNPs_Teosinte_Final <- select(Teosinte_Final, SNP_ID, Chromosome, Position, everything())
```
I assigned a new variable Transposed_Maize, for the new transposed maize group then converted the transposed Maize_Group and used the row names as column names after making it a dataframe with a column name as SNP_ID. Here, we can view the final Transposed_Maize. Now, Transposed_Maize1 was created from Tranposed_Maize with the first 3 rows removed. Finally, the Transposed_Maize was merged with the SNPs file via using the merge command. The same code was used for the Teosinte (see code chunk). Note: use of the view command serves as means to review results after the execution of code.
### The final maize output
```{r}
view(Chromosome_Maize<- filter(Merged_SNPs_TranMaize1, Chromosome != "multiple" & Chromosome != "unknown"))
str(Chromosome_Maize)
```
### Maize folder for maize data.
```{r}
# here the code filters `Maize_data` to exclude rows where the Chromosome is "multiple" or "unknown." then using the str function we can get a brief summary of what Chromosome_Maize looks like
# we create a new directory called Maize
dir.create("./Maize", recursive = TRUE, showWarnings = FALSE)
# Then here using for loop the sequence is generated for each unique chromosome,then it filters `chr_maize` for rows matching that chromosome, sort them by "Position" column in increasing order then writes each chromosome's data to a file named "Maize_inc_A_[chromosome number]" in the "Maize" directory.
for (i in 1:length(unique(Chromosome_Maize$Chromosome))){
chrm <- Chromosome_Maize %>% filter(Chromosome == i) %>% arrange(Position)
write.table(chrm, file = paste("./Maize/Maize_inc_A",i, sep="_"))
}
```
```{r}
# create a new variable name :
view(Substituted_Maize <- as_tibble(lapply(Merged_SNPs_TranMaize1, gsub, pattern ="?", replacement ="-", fixed = TRUE)))
# create a new variable name that has no multiple or unknown in the chromosome section.
view(Substituted_Chromosome_M_U_Maize <- filter(Substituted_Maize, Chromosome != "multiple" & Chromosome != "unknown"))
str(Substituted_Chromosome_M_U_Maize)
```
'Substituted_Maize' is a new data where we are substituting the '?' with '-' converting the result into a tibble. 'Substituted_Chromosome_M_U_Maize' will contain only those rows from Substituted_Maize where the Chromosome column does not have the values "multiple" or "unknown".
```{r}
# Filter out the chromosome with multiple or unknown
# here we do the same using the for loop as the previous part but with decreasing position
for (i in 1:length(unique(Substituted_Maize$Chromosome))) {
chrm_subbed <- Substituted_Chromosome_M_U_Maize %>% filter(Chromosome == i) %>% arrange(desc(Position))
write.table(chrm_subbed, file = paste("./Maize/Maize_dec_B", i, sep = "_"))
}
```
### Teosinte Output
```{r}
# I filtered `Merged_SNPs_Teosinte_Final` to exclude rows where the Chromosome is "multiple" or "unknown." then using the str function we can get a brief summary of what chr_maize looks like:
Chromosome_Teosinte <- filter(Merged_SNPs_Teosinte_Final, Chromosome != "multiple" & Chromosome != "unknown")
str(Chromosome_Teosinte)
```
```{r}
# we create a new directory called Teosinte
dir.create("./Teosinte", recursive = TRUE, showWarnings = FALSE)
# Then here using for loop the sequence is generated for each unique chromosome,then it filters `Chromosome_Teosinte` for rows matching that chromosome, sort them by "Position" column in increasing order then writes each chromosome's data to a file named "Teosinte_inc_A_[chromosome number]" in the "Teosinte" directory:
for (i in 1:length(unique(Chromosome_Teosinte$Chromosome))) {
CHRMT <- Chromosome_Teosinte %>% filter(Chromosome == i) %>% arrange(Position)
write.table (CHRMT, file = paste("./Teosinte/Teosinte_inc_A", i, sep = "_"))
}
```
```{r}
# the new 'Subbed_teosinte' is a new data same as the maize we are substituting the '?' with '-' converting the result into a tibble
Subbed_teosinte <- as_tibble(lapply(Merged_SNPs_Teosinte_Final, gsub, pattern = "?", replacement = "-", fixed = TRUE))
# here we filter out the chromosome with multiple or unknown
Subbed_Chr_Teosinte <- filter(Subbed_teosinte, Chromosome != "multiple" & Chromosome != "unknown")
# here we do the same using the for loop as the previous part but with decreasing position for teosinte
for (i in 1:length(unique(Subbed_Chr_Teosinte$Chromosome))) {
chrt_subbed <- Subbed_Chr_Teosinte %>% filter(Chromosome == i) %>% arrange(desc(Position))
write.table(chrt_subbed, file = paste("./Teosinte/Teosinte_dec_B", i, sep = "_"))
}
```
# Part Three (3)
## Data Visualization
### Clean up and formatting SNPs Fang genotype data frames
```{r}
# Create a new clean data frame that includes only the SNP_ID, chromosome and position
SNP_clean <- SNPs %>% select(SNP_ID, Chromosome, Position)
# here to clean up, exclude JG_OTU and the Group from fang_genotypes data frame then we transpose the new data 'transposed_fang_genotypes' so now it will only have the SNP_ID and other except for JG_oTU and Group
transposed_fang_genotypes <- Fang_genotypes %>% select(-JG_OTU, -Group) %>% column_to_rownames(., var = "Sample_ID") %>% t() %>% as.data.frame() %>% rownames_to_column(., var = "SNP_ID")
view(transposed_fang_genotypes)
# Merge the cleaned and transposed SNPs and fang_genotypes and filter out chromosomes that are unknown and multiple
Merged_fang_genotypes <- merge(SNP_clean, transposed_fang_genotypes) %>% filter(., Chromosome != "unknown" & Chromosome != "multiple")
```
### SNPs per Chromosome
```{r}
library(ggplot2)
# Create a bar plot to visualize the total SNPs for each chromosome in 'Merged_fang_genotypes' where x will be the chromosome and y is the total number of SNPs
total_SNPs <- ggplot(Merged_fang_genotypes, aes(x= Chromosome)) + geom_bar(aes(fill = Chromosome)) + theme_bw() + labs(x = "Chromosome", y = "Total number of SNPs")
view(Merged_fang_genotypes)
# finally, print the bar plot for the total SNPs and save it to the directory
print(total_SNPs)
ggsave("total_SNPs.png",total_SNPs)
```
```{r}
# Create a new dataframe with only numeric Position values from the Merged_fang_genotypes
Numeric_fang_genotypes <- Merged_fang_genotypes %>% mutate(Position_numeric = as.numeric(as.character(Position))) %>% filter(!is.na(Position_numeric)) %>% select(SNP_ID, Chromosome, Position_numeric) %>% rename(Position = Position_numeric)
view(Numeric_fang_genotypes)
```
```{r}
# Now plot using the new dataframe
Diversity_SNPs <- ggplot(Numeric_fang_genotypes, aes(x = Position)) +
geom_density(aes(fill = Chromosome)) +
facet_wrap(~ Chromosome, nrow = 5, ncol = 2) +
theme_bw() +
labs(x = "Chr_Position", y = "Density")
# Create a density plot to visualize the distribution of SNPs for each chromosome in Merged_fang_genotypes data
Diversity_SNPs
```
```{r}
pdf("SNPs_per chromosome.pdf")
print(total_SNPs)
print(Diversity_SNPs)
ggsave("Diversity_SNPs.png",Diversity_SNPs)
```
```{r}
dev.off()
```
# **Visualization of missing data and amount of heterozygosity**
```{r}
# Install tidyr
install.packages("tidyr")
```
```{r}
# Load tidyr
library(tidyr)
# we create a new data frame 'tidy_fang_genotype' from the fang_genotypes by excluding JG_OTU column and pivot it to a long format where each row represents a unique combination of sample_ID,Group and SNP_ID with the corresponding sequence
tidy_fang_genotypes <- Fang_genotypes %>% select(-JG_OTU) %>% pivot_longer(cols=c(-Sample_ID, -Group), names_to = "SNP_ID", values_to = "Sequence")
# here we create new columns for sequences with same alleles 'Homozygous', 'Missing' for those with '?/?', and 'Heterozygous' for those with different alleles by using 'ifelse' and stores it a new column called new_sequence
tidy_fang_genotypes <- tidy_fang_genotypes %>% mutate(new_sequence = ifelse(Sequence %in% c("A/A", "T/T", "C/C", "G/G"), "Homozygous", ifelse(Sequence == "?/?", "Missing", "Heterozygous")))
# overview of the created column classifying the genotypes
View(tidy_fang_genotypes)
#here we create a bar plot to show proportions of sequence for each Sample_ID in the tidy_fang_genotypes
Samples_Plot <- ggplot(tidy_fang_genotypes, aes(x = Sample_ID, fill = new_sequence)) + geom_bar(position = "fill") + theme_bw() + labs(x = "Sample ID", y = "Proportion")
#Stacked Bar-graph for all groups.
Groups_Plot <- ggplot(tidy_fang_genotypes, aes(x = Group , fill = new_sequence)) + geom_bar(position = "fill") +
theme_bw() + theme(axis.text.x = element_text(angle = 90))+ labs(y = "Proportion")
#here we generate a bar plot to show proportions of the new_sequence column
pdf("Missing_DATA_Heterozygosity_Visualisation.pdf")
print(Samples_Plot)
print(Groups_Plot)
ggsave("Sample_Plot.png", Samples_Plot)
ggsave("Group_Plot.png",Groups_Plot)
dev.off()
```
# **My visualization for the proportion of Alleles in different species**
```{r}
#here i compare the proportion of each alleles that exist in each chromosome except for the missing sequences
My_visualization <- ggplot(filter(tidy_fang_genotypes, Sequence != "?/?") , aes(x = Sample_ID, fill = Sequence)) +
geom_bar(position = "fill") + theme_bw() + labs(x = "Sample ID", y = "Proportion")
My_visualization
```
```{r}
pdf("My_visualization.pdf")
ggsave("My_visualization.png", My_visualization)
dev.off()
```