The project is New York City Taxi trip duration prediction.
The goal is to use the available data in order to train a simple machine learning model
to predict the trip duration based on some input that can be available in production environment.
An ultimate goal for this use case can be to predict in real time trips durations (google-maps/waze like itinerary) but for simplicity, in this module, we assume that we need batch prediction. The data for which we need predictions will be stored in a file for ingestion in the trained model.
The machine learning phase is mainly constituted by the following steps :
- data processing
- model training
- model evaluation
- prediction
The data to use for this module can be downloaded from the TLC Trip Record Data page. To complete this module, you will need 03 samples of data :
sample 1 example
: yellow trip 2021-01 data (to train model)sample 2 example
: yellow trip 2021-02 data (to evaluate model)sample 3 example
: yellow trip 2021-03 data (for prediction)
Disclaimer : The volumes of data used in this module are not at all significant to have efficient models and interpretable performances. Here we use data volumes that fit locally and allow pipelines building and fast execution but we don't focus on model performance and interpretability because it is not the main focus of this course.
Data location : Please create a "00-data" folder in the course root directory and put the downloaded files inside.
If names are different, please rename your files to "yellow_tripdata_2021-01.parquet" (2021-02 / 2021-03)
A notebook implementing the machine learning steps to predict Taxi trip duration can be found in the course' GitHub repository in the introduction course.
Since the main focus of the course is not Machine Learning itself, let's just run the notebook in your local jupyter container.
- First, let's create our jupyter lab image and network by running
make prepare-mlops-crashcourse
- Then, let's create our local jupyter lab container by running
make launch-mlops-crashcourse
You will need to pass the token 'MLOPS' to your jupyter lab UI
- Finally, go to
lessons/01-intro/practice-intro-supinfo.ipynb
and try running and understanding the modelization implementation