Water-Pump-Status-Prediction

This project implements machine learning models to predict the status of water pumps in Tanzania using data from DrivenData's competition. The project includes preprocessing steps, model evaluation using cross-validation, and hyperparameter optimization with Optuna.

Libraries/Imports Used

optuna
scikitlearn
pandas
NumPy
subprocess
argparse
os

Project Structure

part1.py: This script trains and evaluates various machine learning models with different preprocessing techniques using cross-validation.
part2.py: This script performs hyperparameter optimization using Optuna.
TestAll.py: This script automates the process of running part1.py with different combinations of preprocessing and model types.

Usage

Running Part 1

To evaluate the models with different preprocessing techniques and model types, you can manually run part1.py with the desired arguments. However, it's more efficient to use TestAll.py to automate this process.

Example Command: python part1.py <train-input-file> <train-labels-file> <test-input-file> <numerical-preprocessing> <categorical-preprocessing> <model-type> <test-prediction-output-file>

Automate Part 1 with TestAll.py: The TestAll.py script loops through all the command-line argument options instead of manually writing the command in the terminal for each variation of a model’s training. Run this file in your IDE.

Running Part 2

For hyperparameter optimization, run the part2.py script. It uses Optuna to find the best hyperparameters for different classifiers.

Data

The dataset for this project is sourced from the DrivenData competition Pump it Up: Data Mining the Water Table. It consists of training data (input features and labels) and test data (input features only).

Preprocessing

Various preprocessing steps are implemented, including handling categorical features, dealing with missing values, scaling numerical values, and dealing with datetime features. For categorical features, I use three types of encoding: O

OneHotEncoder
OrdinalEncoder
TargetEncoder

For numerical features, I consider two options: no scaling and StandardScaler.

Machine Learning Models

I evaluate the performance of five families of machine learning models:

Logistic Regression
Random Forest Classifier
Gradient Boosting Classifier
Histogram-based Gradient Boosting Classifier
Multi-layer Perceptron Classifier

Hyperparameter Optimization

The project uses Optuna for hyperparameter optimization. The optimization process includes defining the configuration space and evaluating the performance using cross-validation.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
TestAll.py		TestAll.py
part1.py		part1.py
part2.py		part2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Water-Pump-Status-Prediction

Libraries/Imports Used

Project Structure

Usage

Running Part 1

Running Part 2

Data

Preprocessing

Machine Learning Models

Hyperparameter Optimization

About

Releases

Packages

Languages

leabrodyheine/Water-Pump-Status-Prediction

Folders and files

Latest commit

History

Repository files navigation

Water-Pump-Status-Prediction

Libraries/Imports Used

Project Structure

Usage

Running Part 1

Running Part 2

Data

Preprocessing

Machine Learning Models

Hyperparameter Optimization

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages