This project aims to predict customer responses to automobile insurance offers using machine learning techniques. Utilizing Kaggle's synthetic dataset, we explore data visualization, preprocessing, and modeling strategies to optimize performance on this binary classification problem.
- Competition: Kaggle Playground Series 2024
- Objective: Predict the probability of a customer responding positively to an automobile insurance offer.
- Evaluation Metric: Area Under the ROC Curve (ROC-AUC).
- Synthetic Data: Designed to mimic real-world data while preserving privacy.
- Library Installation: Importing essential Python libraries.
- Data Exploration: Inspecting data structure, identifying patterns, and analyzing distributions.
- Feature Engineering: Encoding categorical variables, scaling numerical features, and handling missing values.
- Model Development: Implementing and fine-tuning an XGBoost model.
- Visualization: Creating insightful plots for understanding feature relationships and model performance.
- Submission: Preparing and validating the final submission file.
- Programming Language: Python
- Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, XGBoost, Plotly
- Clone the repository and navigate to the project directory.
- Install the required libraries:
pip install pandas numpy matplotlib seaborn scikit-learn xgboost plotly
- Download the dataset from the Kaggle competition page and place it in the same directory as the project.
- Open the notebook and execute cells sequentially to reproduce the results.
The XGBoost model achieved a strong performance with an ROC-AUC score of 0.886 on the test set, indicating reliable predictions.
Anna Balatska - Kaggle Grandmaster | Data Scientist | Machine Learning Enthusiast