Skip to content

Latest commit

 

History

History
162 lines (56 loc) · 8.86 KB

README.md

File metadata and controls

162 lines (56 loc) · 8.86 KB

DIYA

Project: The Effect of Controllable Features on Acceptance of DonorsChoose Applications

Adi Srinivasan, 7-14-20

Describe the Question

Describe the question that you chose from last week. Link it to your project plan here._

The question I have chosen to pursue is to see if we can predict if an application will be accepted based on the aspects of the application that are within control of the teacher.

Dataset Exploration

Identify the variable to be predicted (target variable) and identify the features that will be used for prediction. Provide some descriptive statistics of the variables you intend to use for the prediction (classification or regression). How would you justify the inclusion of the set of features to predict the target variable?

The target variable would be the one indicating whether or not the project was approved. This would be a classification model, sorting instances into groups of was approved or wasn’t approved.

The features used for prediction will be the month of submission, the length of project subcategory, the length of the project title, frequency of keywords in the essays, the length of the essays, and the length of resource summary.

I am not including the project category because teachers have less discretion over choosing that, as there are a limited number of options they can choose from. While they can’t choose what subject they teach, they can choose the wording of their subcategory, which is why I am including subcategory as a controllable feature.

These are all features I am choosing to include because teachers have some control over all of them, and they are all given data points that could show some correlation with a higher acceptance rate.

Data Cleaning/Manipulation

_Figure out if the data can be used as is. Are all the records complete with the values you would use for training? Do you need to eliminate certain records? Do you need to consolidate or convert any fields? _

All applications after 2016-05-17 will have missing entries for essays 3 and 4. I don’t think I’ll need to eliminate those records, I could either separate them or leave them as is and only predict based on the essays that are available.

Training Set/Test Set

_Determine how you will divide the data into a training set, validation set and test set. _

I will use an 80-10-10 ratio of splitting my data into training, validation, and test sets respectively.

Machine Learning Algorithm

_Explore the use of the decision tree algorithm for solving the problem. Describe the decision tree used for solving the problem. Add code snippets to showcase the approach that you took to solve the problem. _

_Add any assumptions that you are making on the dataset while applying the machine learning algorithm. _

I used a decision tree classifier for my first algorithm. My control features ended up being length of the essays, month of submission, length of project title, length of resource summary, and the frequency of the top 50 keywords of approved essays based on TF-IDF scores.

>>>>> gd2md-html alert: inline image link here (to images/image1.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

The above code snippet shows the use of the TfidfTransformer from sklearn, which I used to find the top 50 keywords used in all of the approved essays. I decided to consolidate all essays into a single column.

>>>>> gd2md-html alert: inline image link here (to images/image2.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Performance Evaluation and Analyses

Evaluate the performance of the “decision tree” algorithm to answer your question by calculating the accuracy of the algorithm on the training data and validation data. Think of setting up a realistic experimental condition, where you carefully observe the changes in performance by varying different factors in a controlled manner. To do that, observe how the performance of the algorithm changes with respect to:

  1. Size of the training data. To see this plot the training accuracy and validation accuracy for classification (mean squared error for regression) for different values of training data size. What are your observations?

>>>>> gd2md-html alert: inline image link here (to images/image3.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image4.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image5.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

  1. Parameters of the decision tree algorithm. To see this, plot the training accuracy and validation accuracy for (i) different values of the minimum number of samples required at a leaf node (ii) different values of the maximum depth of the tree.

Remember to avail the benefits of various visualization tools to aid your analyses.

Min Sample Leaf

>>>>> gd2md-html alert: inline image link here (to images/image6.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image7.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Not sure if there was an error on my part but the min_samples_leaf doesn’t seem to affect accuracy

Max Depth

>>>>> gd2md-html alert: inline image link here (to images/image8.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image9.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Analyses of Results

_Share your analyses of the results. To help you with this, evaluate the performance by checking for overfitting or underfitting. _How would you address the overfitting/underfitting conditions?

What is the best setting of the decision tree? Plot your tree.

>>>>> gd2md-html alert: inline image link here (to images/image10.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Interpret your Model

Explain the features that were useful in the prediction. Are some variables more important than others in your prediction? Are you surprised by what you found?