Introduction & Background

Classifying a movie plot into genres was chosen as it provides a wide range of exploratory paths with data science methods and its application can be found in various sophisticated recommendation engines. The project aims at exploring various classifier algorithms, understanding their behaviors and enhancing the classifier accuracy to predict the genre.

Our goal with the project is to:

Conduct EDA on the Data
Learn and Apply NLP on the plot content
Test out various classic and deep machine learning models
Build a pipeline for the best result and pickle the models
Develop a simple to use web application as an API for new classification

Data Source and Description

The data is taken from Kaggle data source :

https://www.kaggle.com/jrobischon/wikipedia-movie-plots

Content

The dataset contains descriptions of 34,886 movies from around the world. Column descriptions are listed below:

Release Year - Year in which the movie was released
Title - Movie title
Origin/Ethnicity - Origin of movie (i.e. American, Bollywood, Tamil, etc.)
Director - Director(s)
Genre - Movie Genre(s)
Wiki Page - URL of the Wikipedia page from which the plot description was scraped
Plot - Long form description of movie plot

Acknowledgements

This data was scraped from Wikipedia

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach for data analysis that employs a variety of techniques (mostly graphical) to :

maximize insight into a data set
uncover underlying structure
extract important variables
detect outliers and anomalies
test underlying assumptions
develop parsimonious models
determine optimal factor setting

Natural Language Processing (NLP)

Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Library Used : Natural Language Toolkit

Steps Followed

Convert movie plot into lower case
Remove stop words
Stemming
Lemmatization

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
notebooks		notebooks
webapp		webapp
.gitignore		.gitignore
Project_report_Movie_Classification .pdf		Project_report_Movie_Classification .pdf
classifier_result.csv		classifier_result.csv
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction & Background

Data Source and Description

Content

Acknowledgements

Exploratory Data Analysis

Natural Language Processing (NLP)

Steps Followed

About

Releases

Packages

Languages

sreeragsreenath/info6105_project

Folders and files

Latest commit

History

Repository files navigation

Introduction & Background

Data Source and Description

Content

Acknowledgements

Exploratory Data Analysis

Natural Language Processing (NLP)

Steps Followed

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages