Skip to content

Work done for University of Pittsburgh course "Principles of Data Science" (STAT 1261) with Dr. Junshu Bao in Fall semester of 2018.

License

Notifications You must be signed in to change notification settings

YogiOnBioinformatics/Principles-of-Data-Science

Repository files navigation

Principles of Data Science

Note

Some pdf files come with Rmd (R Markdown) source code files while others do not. The Rmd files are added as supplementary material. For those unfamiliar with R Markdown, the source code is converted into a pdf file which contains the final, well-formatted and fully visualized work.

For those that are confused about data science, please be sure to check the link associated with this repository.

Introduction

Work done for University of Pittsburgh course "Principles of Data Science" (STAT 1261) with Dr. Junshu Bao in Fall semester of 2018.

Units for the course were divided into:

  1. Data Visualization

  2. Data Tidying and Wrangling

  3. Multiple Statistical Models and Bootstrapping

  4. Machine Learning

💥 Units build on top of each other so most units are not mutually exclusive and involve knowledge from previous units.

Folders

📁 Data Visualization/

Contains multiple files showing various data visualizations using packages such as ggplot as well as the generic plot() built-in R function. There are many examples of linear models as well as colorful schematics for how data can be well visualized. The purpose of these is to peak an audience's interest all while empowering the message behind the data.

📁 Data Tidying and Wrangling/

Starting from reasonably easy difficulty and ending with hard, this folder utilizes many different data sets to create meaningful, usable data from very messy origins. The process of this transformation can be accomplished by data tidying and wrangling. This was accomplished using the packages dplyr, tidyr, tidyverse and mdsr.

📁 Multiple Statistical Models and Bootstrapping/

Files inside this folder relate to creation of many types of stastical models to leverage and validate meaningful data from messy data sets. Bootstrapping was also employed on several occassions. Packages used include mdsr, tidyverse, tidyr and broom.

📁 Machine Learning/

Files inside relate to both Machine Learning as well as in-depth results of multiple linear regression. The packages used for this section were glmnet, rpart, rpart.plot and mdsr.

Contact Information

interests

Yogindra Raghav (YogiOnBioinformatics)

[email protected]