Contents |
---|
Dataset Description |
Columns Descreption |
Questions for Analysis |
Data Wrangling |
Data Cleaning |
Exploratory Data Analysis |
Built with |
This data set contains information about 10,000 movies extracted from TMDB. The dataset contains movies from 1960 to 2015. Including user ratings and revenue. Original data from Kaggle
id, imdb_id
: unique id or imdb id for each movie on TMDBpopularity
: a metric used to measure the popularity of the movie.budget
:the total budget of the moviein USD.revenue
:the total revenue of the movie in USD.original_title
: the original title of the movie.cast
:the names of the cast of the movie separated by "|".homepage
: the website of the movie (if it existed).director
:name(s) of the director(s) of the movie (separated by "|" if there are more than one director).tagline
:a catchphrase describing the movie.keywords
: keywords related to the movie.overview
:summary of the plot of the movie.runtime
:total runtime of the movie in minutes.genres
: genres of the movie separated by "|".production_companies
:production compan(y/ies) of the movie.release_date
:release date of the movie.vote_count
:number of voters of te movie.vote_average
:the average user rating of the movierelease_year
:release year of the movie (from 1960 to 2015)budget_adj
:the total budget of the moviein USD in terms of 2010 dollars, accounting for inflation over time.revenue_adj
:the total budget of the movie in USD in terms of 2010 dollars, accounting for inflation over time.
- Do movies with high popularity achive high revenvue?
- What are the most filmed genres in this whole dataset?
- Is there a correlation between a movie budget and its revenue?
Our data can be found on tmdb-movies.csv
file provided on this repository. It is an edited version of the original Kaggle's TMDB 5000 Movie Dataset provided by Udacity on the Become a Data Analyst Nanodegree Program.
Main Observations:
- Our dataset consisted of a total of 10866 rows and 21 columns.
- We had only 1 duplicated row which had been dropped.
- Some columns wont be useful in answering our questions so they were dropped.
- Few columns had many missing values that needed to be handled.
- Columns
cast
director
genre
had values saperated with a '|'. -
release_date
's data type needed to be casted. - We could append a column for the movie
profit
using the formula:. -
vote_average
better be presented as a catecorical variable that groubs multible ratings values. - We might also catigorize
profit
column for better EDA
After finishing our dataset cleaning, we endded up with a total of 10840 records and 10 columns. The dataset now has no duplicates nor null values, and the data types are consistant with suitable categorical variable to address our questions. We then perfomed some analytics and created some visualizations to answer our targeted questions.
More popular movies recieve way more revenue than the less popular movies.
Drama
,Comedy
andAction
are the most three filmed genres in total of 10839 movies in our dataset.Drama
genre alone is filmed 22.6% of the times on our dataset.
There is positive correlation between
budget
andrevenue
, indecating a relation between them with little outliers found.
- JupyterLab
- Python3
- Pandas
- Numpy