Skip to content

Latest commit

 

History

History
92 lines (84 loc) · 2.8 KB

README.md

File metadata and controls

92 lines (84 loc) · 2.8 KB
title author date output
README.md
Lauren Fitch
Sunday, March 22, 2015
html_document

I started by getting the list of files in the test and train directories.

test_files <- list.files("tidy_data/test", 
                         pattern = "*.txt", full.names = TRUE)
training_files <- list.files("tidy_data/train",
                             pattern = "*.txt", full.names = TRUE)

Then I read those files into R, which required the stringr package.

install.packages("stringr")
library(stringr)

I made names for the data frames that match the original filenames.

test_file_names <- str_sub(test_files, 16, -5)
train_file_names <- str_sub(training_files, 17, -5)

ldf_test <- lapply(test_files, read.table)
names(ldf_test) <- test_file_names

ldf_train <- lapply(training_files, read.table)
names(ldf_train) <- train_file_names

Read in features

features <- read.table("tidy_data//features.txt")

Just take the second column

features <- features[ ,2]

Remove special characters

features <- str_replace_all(features, "[-(),]", "")

Concatenate together into one data frame

merge_data_test <- data.frame(ldf_test[1], ldf_test[2], ldf_test[3])
names(merge_data_test) <- c("subject", features, "y")

merge_data_train <- data.frame(ldf_train[1], ldf_train[2], ldf_train[3])
names(merge_data_train) <- c("subject", features, "y")

merge_data <- rbind(merge_data_test, merge_data_train)

Extracts only the measurements on the mean and standard deviation for each measurement. I looked for either "mean" or "std" in the variable name. I did not remove any variables from the resulting list.

col_names <- str_detect(names(merge_data), "mean|std")
col_names <- names(merge_data)[col_names]
col_names <- c("subject", col_names, "y")

extract_data <- merge_data[ , col_names]

I use descriptive activity names to name the activities in the data set, by reading in activity labels and matching them to the right row.

activity <- read.table("tidy_data/activity_labels.txt")
names(activity) <- c("id", "name")
extract_data$activity <- activity$name[match(extract_data$y, activity$id)]

I labeled the data set with descriptive variable names

names(extract_data) <- c(names(extract_data)[1:80], 
                         "activity_code", "activity")

From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

Requires the dplyr package.

install.packages("dplyr")
library(dplyr)

Group the data by subject and activity

who <- group_by(extract_data, subject, activity)
tidy <- summarise_each(who, funs(mean))

Remove the activity code column

tidy <- tidy[ ,1:81]

Output the tidy data set

write.table(tidy, "tidy.txt", row.name = FALSE)