Contents |
---|
Dataset Description |
Columns Descreption |
EDA Questions |
Data Wrangling |
Data Cleaning |
Data Visualization |
Conclusion |
Built with |
This data set contains information about +9000 movies extracted from TMDB API.
Release_Date
: Date when the movie was released.Title
: Name of the movie.Overview
: Brief summary of the movie.Popularity
: It is a very important metric computed by TMDB developers based on the number of views per day, votes per day, number of users marked it as "favorite" and "watchlist" for the data, release date and more other metrics.Vote_Count
: Total votes received from the viewers.Vote_Average
: Average rating based on vote count and the number of viewers out of 10.Original_Language
: Original language of the movies. Dubbed version is not considered to be original language.Genre
: Categories the movie it can be classified as.Poster_Url
: Url of the movie poster.
- Q1: What is the most frequent
genre
in the dataset? - Q2: What
genres
has highestvotes
? - Q3: What movie got the highest
popularity
? what's itsgenre
? - Q4: Which year has the most filmmed movies?
Our data can be found on mymoviedb.csv
file provided on this repository, downloaded from Kaggle.
- we have a dataframe consisting of 9827 rows and 9 columns.
- our dataset looks a bit tidy with no NaNs nor duplicated values.
Release_Date
column needs to be casted into date time and to extract only the year value.Overview
,Original_Languege
andPoster-Url
wouldn't be so useful during analysis, so we'll drop them.- there is noticable outliers in
Popularity
column Vote_Average
bettter be categorised for proper analysis.Genre
column has comma saperated values and white spaces that needs to be handled and casted into category.
We endded up with a datafram of a total of 6 columns and 25551 rows to dig into during our analysis after comleting our cleaning.
Using Matplotlib
and Seaborn
, we made several meaningful visuals and charts to help us gain informative insights regarding any correlation between attributes in our dataset, that'll be discussed in the next section.
These are derived conclusions after completing our data visualisation phase.
Drama
genre is the most frequent genre in our dataset and has appeared more than 14% of the times among 19 other genres.
we have 25.5% of our dataset with popular vote (6520 rows).
Drama
again gets the highest popularity among fans by being having more than 18.5% of movies popularities.
Spider-Man: No Way Home
has the highest popularity rate in our dataset and it has genres of Action
, Adventure
and Sience Fiction
.
year 2020
has the highest filmming rate in our dataset.
- JupyterLab
- Python3
- Pandas
- Numpy
- Matplotlib
- Seaborn