Skip to content

This is an attempt on a dataset released during a data science contest in France

Notifications You must be signed in to change notification settings

selimamrouni/best-data-scientist-france-2018

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

best-data-scientist-france-2018

This is a small attempt on a dataset released during a data science contest in France. The dataset comes from Label Emmaüs, a French non-profit organization.

Unfortunately, I was not in France during the contest and could not take part in it. But the dataset was released and is available on the Meilleur Data Scientist and it is still possible to submit your prediction on the platform.

Label Emmaüs offers for sale objects renovated or created by the movement Emmaüs. The aim is to estimate the range of time to sale each object.

This is a multi-labels classification, with 3 labels:

  • 0 : between 0 et 10 days
  • 1 : between 10 et 60 days
  • 2 : more than 60 days

The evaluation metric is multilogloss.

For this project, I started by some features engineering, then used machine learning technics and deep learning technics.

The files are:

  • X_train.csv, X_test.csv, y_train.csv: Data files.
  • description.pdf: An overview of the challenge proposed by the platform.
  • meilleur_DS_france.ipynb: The notebook file.

Feature Engineering

This is a real-world data problem. Thus, some data cleaning has been realized and features have been removed: too much NaNs or non relevant features (like listings URLs).

The numerical data have been normalized and the categorical have been one-hot encoded.

Finally, I used text mining technics (length, count of words, sentiment analysis) to extract features from titles and descriptions.

Machine Learning

I tested several classification pipelines using PCA with different level of dimensionality reduction and classification algorithms. The best algo was the logistic regression with no PCA. It scored 1.01409 of log_loss on the platform.

Deep Learning

I prototyped a neural net model using Keras. I faced over-fitting and used both early-stopping and dropout to limit it. I think the dataset is too small, with more data, we might reach better results. It scored 0.99662 of log_loss on the platform.

Thanks to

Special thanks to Artem Golubin rushter The one-hot encoding is inspired by this amazing git: https://github.com/rushter/heamy/blob/master/heamy/feature.py#L7

Author

About

This is an attempt on a dataset released during a data science contest in France

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published