Classifying a movie plot into genres was chosen as it provides a wide range of exploratory paths with data science methods and its application can be found in various sophisticated recommendation engines. The project aims at exploring various classifier algorithms, understanding their behaviors and enhancing the classifier accuracy to predict the genre.
Our goal with the project is to:
- Conduct EDA on the Data
- Learn and Apply NLP on the plot content
- Test out various classic and deep machine learning models
- Build a pipeline for the best result and pickle the models
- Develop a simple to use web application as an API for new classification
The data is taken from Kaggle data source :
https://www.kaggle.com/jrobischon/wikipedia-movie-plots
The dataset contains descriptions of 34,886 movies from around the world. Column descriptions are listed below:
- Release Year - Year in which the movie was released
- Title - Movie title
- Origin/Ethnicity - Origin of movie (i.e. American, Bollywood, Tamil, etc.)
- Director - Director(s)
- Genre - Movie Genre(s)
- Wiki Page - URL of the Wikipedia page from which the plot description was scraped
- Plot - Long form description of movie plot
This data was scraped from Wikipedia
Exploratory Data Analysis (EDA) is an approach for data analysis that employs a variety of techniques (mostly graphical) to :
- maximize insight into a data set
- uncover underlying structure
- extract important variables
- detect outliers and anomalies
- test underlying assumptions
- develop parsimonious models
- determine optimal factor setting
Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
Library Used : Natural Language Toolkit
- Convert movie plot into lower case
- Remove stop words
- Stemming
- Lemmatization