Skip to content

Latest commit

 

History

History
54 lines (41 loc) · 4.93 KB

README.md

File metadata and controls

54 lines (41 loc) · 4.93 KB

🌴 Welcome to Level 1 of the 2023 EM/Dev Assessment! 🌴

This is the first and beginning level of the Vanderbilt Data Science dev assignment! In this level, you will be choosing and cleaning a dataset so that it can be used for later analysis.

Some datasets and their focus questions:

  • YouTube Channel Niche, using this dataset, create a model to predict the category or niche of YouTube channels. This is a classification problem.
  • Pet's Facial Expression Image Dataset, using this dataset, create a model to classify images of animals into different categories. This is a classification problem.
  • Customer Personality Analysis, using this dataset, create a model to predict whether a customer has made at least one complaint in the last 2 years. This is a classification problem.
  • ECG Heartbeat Categorization Dataset, using this dataset, create a model to cluster ECG heartbeats into different categories. This is a clustering problem.
  • Pokemon with stats, using this dataset, create a model to cluster Pokemon into different categories. This is a clustering problem.

Some of these datasets are silly and some of them are serious, some of them make sense in context of their focus question and some of them don't. The point of this level is to get you thinking about the features of the dataset you choose and how they relate to the focus question you want to answer. You are free to choose a different dataset and/or focus question than the ones listed above

Objective:

Your task is to get this data loaded and cleaned in your notebook. This means that you should choose what (if any) libraries you are using, and how you are going to clean the data. You should also write a readme file that answers these questions:

  • What is the dataset you chose? (include a link to the dataset)
  • What is the focus question you are trying to answer with this dataset? Include whether this is a classification, clustring, or other type of problem.
  • What is your plan for cleaning this dataset? (you don't have to go into too much detail, just a general idea of what you are going to do)
  • Any extra information you'd like to include

Submission:

In this folder, you should put the answers from the objective section in a filed called '[name].md' (with the name replaced with yours i.e. janeDoe.md). The code you are using to load the dataset you choose as well as clean the dataset should be located in the Notebook(s) folder, you may title this notebook as you wish.

Bonus Points!! 💃💃

+1 point for using Python
+1 point for using a dataset that is not listed above
+2 point for using a dataset that is not from Kaggle
+1 point for coming up with your own focus question
+2 points for choosing a clustering problem
+3 points for choosing a Neural Network problem

Tips and Tricks

What libraries should I use? We recommend NumPy, pandas, and scikit-learn. Ultimately, the libraries you use are up to you, but these are the ones we recommend.

This section is super short?? Yup! The first step is always the hardest part but in terms of the actual code/submission, this is the shortest portion of the entire coding challenge. Don't worry if you feel like you're missing something, this is just the beginning!

How do I clean a data set? Unless you want to agonize through hundreds of hours scrolling down an Excel spreadsheet, data is cleaned using programs specifically made to filter through keywords in formatted datasets. This article explains what cleaning data is and steps that can be taken, and this article has many specific cleaning programs that can be done in python. Google is your friend! How data is cleaned depends on the format and type of the dataset, but python has very extensive and intuitive support in data management, making it ideal for data science.

More cleaning tips: Not all features are created equally! Some features are important in answering a focus question (i.e. the number of eye-shapes in an image when trying to classify humans and aliens), and some aren't (i.e. the favourite color of the subject in an image when trying to classify humans and aliens). It is up to you to decide what is and isn't important! Additionally, you might find additional features not in a dataset, i.e. BMI, to be useful in your focus question.