IMDb Data Exploration and Analysis¶
The creation of a clean dataset for use in Tableau for exploration and to build insights¶
Dataset: IMDb Non-Commercial Datasets
Publisher: IMDb
URL: https://datasets.imdbws.com/
Analysis and Visualizations by: Jacob Krol
Project Scope:¶
Step 1: Plan¶
During this stage of the process, I am mostly concerned with forming a generalized approach to the dataset. Generally, the pure planning stage consists of collaboration between cross-functional teams and is great for establishing the scope and KPIs of a project. This project's scope has been defined above.
The primary questions which I am concerned with at this stage are:
- What is the scope of the project?
- What are the key deliverables which are expected to be delivered for this project?
- Who are the key stakeholders in this project?
- What is the timeline for the project?
- How was the data collected?
- Is this data reliable?
- Are there ethical concerns associated with the use of or contents of the data set?
- What python packages will be needed for this analysis?
- What models may be useful for this project?
Again, as this is an analysis for the sake of demonstration with a public dataset, many of the above questions are not relevent. It is going to be assumed that the data is accurate for this practice. However, ethical considerations should still always be taken. In this case, there is no data or variables present which I believe to be unethical. With regards to which model may be useful, I am immediately draw to the Poisson regression model, since this is what would generally be used for forcasing sales data. That being said, I will not rule out the Random Forest approach, which could offer a less transparent but higher-scoring model overall.
Step 1.5: Plan/ Analyze¶
During this stage in the process, it is important to become familiar with the dataset. This is largely where the technical planning begins. Once I have a better idea of the contents of the dataset, I can begin forming basic visualizations which will then inform which variables are focused on, which data engineering methods will be used to generate additional useful derived data, and which modeling methods will be the most useful.
The primary questions which I am concerned with at this stage are:
- What variables are included in the dataset?
- How many variables are in the dataset?
- How many entries are in the dataset?
- Are there any null values or outliers present in the dataset?
- Is there any apparent trend which is contributing to the outliers or null values which will affect how they are handled?
- Does an initial analysis uncover any multicollinearity which needs to be considered?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
# Silence warnings from seaborn
import warnings
warnings.filterwarnings("ignore")
cwd = os.getcwd()
# Show all rows
pd.set_option('display.max_rows', None)
# Importing the datasets into pandas dataframes one at a time and clean them.
# We don't plan to work with any adult films, so we'll filter those out. This is in title.basics.tsv
title_basics = pd.read_csv(cwd + '/data/title.basics.tsv', sep='\t')
# Describing the dataset
title_basics.describe()
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | |
---|---|---|---|---|---|---|---|---|---|
count | 10673965 | 10673965 | 10673947 | 10673947 | 10673965 | 10673965 | 10673965 | 10673965 | 10673745 |
unique | 10673965 | 11 | 4806019 | 4829549 | 30 | 153 | 97 | 899 | 2371 |
top | tt0000001 | tvEpisode | Episode #1.1 | Episode #1.1 | 0 | \N | \N | \N | Drama |
freq | 1 | 8165289 | 51208 | 51208 | 10269704 | 1400226 | 10552418 | 7360018 | 1218510 |
title_basics.head()
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | |
---|---|---|---|---|---|---|---|---|---|
0 | tt0000001 | short | Carmencita | Carmencita | 0 | 1894 | \N | 1 | Documentary,Short |
1 | tt0000002 | short | Le clown et ses chiens | Le clown et ses chiens | 0 | 1892 | \N | 5 | Animation,Short |
2 | tt0000003 | short | Pauvre Pierrot | Pauvre Pierrot | 0 | 1892 | \N | 4 | Animation,Comedy,Romance |
3 | tt0000004 | short | Un bon bock | Un bon bock | 0 | 1892 | \N | 12 | Animation,Short |
4 | tt0000005 | short | Blacksmith Scene | Blacksmith Scene | 0 | 1893 | \N | 1 | Comedy,Short |
This isn't great. It looks like all of the data is in strings...
Let's start with the isAdult column since that will trim things down quite a bit. There are 30 unique values, reading as a string, so let's just see how many are in each category first. If there are few in each incorrect category, we'll just drop the other flags. If not, we'll figure it out from there.
print(len(title_basics['isAdult']))
title_basics[['tconst', 'isAdult']].groupby('isAdult').count()
10673965
tconst | |
---|---|
isAdult | |
0 | 10269704 |
1 | 338506 |
1977 | 1 |
1978 | 26 |
1979 | 10 |
1980 | 2 |
1981 | 1 |
1982 | 2 |
1983 | 6 |
1984 | 14 |
1985 | 29 |
1986 | 2 |
1987 | 10 |
2005 | 1 |
2011 | 1 |
2012 | 1 |
2013 | 8 |
2014 | 6 |
2015 | 28 |
2016 | 20 |
2017 | 17 |
2018 | 9 |
2019 | 7 |
2020 | 9 |
2021 | 1 |
2022 | 2 |
2023 | 6 |
0 | 63396 |
1 | 2139 |
\N | 1 |
It looks like some of the data has been mishandled while creating the db. Let's just drop anything which has been incorrectly loaded. as a year or NA and all of the adult content at once.
title_basics['isAdult'] = title_basics['isAdult'].astype(int)
title_basics = title_basics[title_basics['isAdult'] == 0]
title_basics['isAdult'] = title_basics['isAdult'].astype(bool)
title_basics[['isAdult', 'tconst']].groupby('isAdult').count()
tconst | |
---|---|
isAdult | |
False | 10333100 |
Great! Now that this column isn't useful to us, we can just drop it, knowing that nothing in here should be adult films. Now I want to look into what types of titles are in the dataset. I want to mostly focus on movies and tv, so let's see what's in there next.
title_basics.drop('isAdult', axis=1, inplace=True)
title_basics[['titleType','tconst']].groupby(['titleType']).count()
tconst | |
---|---|
titleType | |
movie | 666922 |
short | 985375 |
tvEpisode | 7943467 |
tvMiniSeries | 53242 |
tvMovie | 144858 |
tvPilot | 1 |
tvSeries | 257707 |
tvShort | 10279 |
tvSpecial | 46917 |
video | 186783 |
videoGame | 37549 |
Interesting.. For the sake of having this data, I am going to split off each title type that I'm interested in into a different dataframe, then export it for myself to use in the future for more projects.
title_basics_movies = title_basics[title_basics['titleType'] == 'movie']
title_basics_tv_episodes = title_basics[title_basics['titleType'] == 'tvEpisode']
title_basics_tv_series = title_basics[title_basics['titleType'] == 'tvSeries']
title_basics_tv_movies = title_basics[title_basics['titleType'] == 'tvMovie']
title_basics_video_games = title_basics[title_basics['titleType'] == 'videoGame']
title_basics_movies.to_csv(cwd + '/data/title_basics_by_titleType/title.basics.movies.tsv', sep='\t', index=False)
title_basics_tv_episodes.to_csv(cwd + '/data/title_basics_by_titleType/title.basics.tvEpisodes.tsv', sep='\t', index=False)
title_basics_tv_series.to_csv(cwd + '/data/title_basics_by_titleType/title.basics.tvSeries.tsv', sep='\t', index=False)
title_basics_tv_movies.to_csv(cwd + '/data/title_basics_by_titleType/title.basics.tvMovies.tsv', sep='\t', index=False)
title_basics_video_games.to_csv(cwd + '/data/title_basics_by_titleType/title.basics.videoGames.tsv', sep='\t', index=False)
This project will focus only on movies so we can feel free to only look at movies going forward and drop any of the other data from the memory
del title_basics
del title_basics_movies
del title_basics_tv_episodes
del title_basics_tv_series
del title_basics_tv_movies
del title_basics_video_games
IMDb Movie Data Exploration and Analysis¶
The first thing that I want to do here is just check the dataset for null values, as well as getting a feeling for what the whole dataset of movies looks like. We'll use the title.basics.movies file to make a DF, from which we can use tconst to refine the other IMDb tables as we load them in.
title_basics = pd.read_csv(cwd + '/data/title_basics_by_titleType/title.basics.movies.tsv', sep='\t')
title_basics.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 666922 entries, 0 to 666921 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tconst 666922 non-null object 1 titleType 666922 non-null object 2 primaryTitle 666920 non-null object 3 originalTitle 666920 non-null object 4 startYear 666922 non-null object 5 endYear 666922 non-null object 6 runtimeMinutes 666922 non-null object 7 genres 666922 non-null object dtypes: object(8) memory usage: 40.7+ MB
# Take care of the dates by converting them to datetime objects of just the year
title_basics['startYear'] = pd.to_datetime(title_basics['startYear'], errors='coerce')
title_basics['startYear'] = title_basics['startYear'].dt.year
title_basics['endYear'] = pd.to_datetime(title_basics['endYear'], errors='coerce')
title_basics['endYear'] = title_basics['endYear'].dt.year
title_basics.head(10)
tconst | titleType | primaryTitle | originalTitle | startYear | endYear | runtimeMinutes | genres | |
---|---|---|---|---|---|---|---|---|
0 | tt0000009 | movie | Miss Jerry | Miss Jerry | 1970.0 | NaN | 45 | Romance |
1 | tt0000147 | movie | The Corbett-Fitzsimmons Fight | The Corbett-Fitzsimmons Fight | 1970.0 | NaN | 100 | Documentary,News,Sport |
2 | tt0000502 | movie | Bohemios | Bohemios | 1970.0 | NaN | 100 | \N |
3 | tt0000574 | movie | The Story of the Kelly Gang | The Story of the Kelly Gang | 1970.0 | NaN | 70 | Action,Adventure,Biography |
4 | tt0000591 | movie | The Prodigal Son | L'enfant prodigue | 1970.0 | NaN | 90 | Drama |
5 | tt0000615 | movie | Robbery Under Arms | Robbery Under Arms | 1970.0 | NaN | \N | Drama |
6 | tt0000630 | movie | Hamlet | Amleto | 1970.0 | NaN | \N | Drama |
7 | tt0000675 | movie | Don Quijote | Don Quijote | 1970.0 | NaN | \N | Drama |
8 | tt0000679 | movie | The Fairylogue and Radio-Plays | The Fairylogue and Radio-Plays | 1970.0 | NaN | 120 | Adventure,Fantasy |
9 | tt0000838 | movie | A Cultura do Cacau | A Cultura do Cacau | 1970.0 | NaN | \N | \N |
Interesting. So, even though the .info() that we called earlier indicated that there were no null values, we clearly see placeholders for null values. Let's convert those so that they are being read properly, then reassess the set. Also, we notice that the end years are all null. Since these are movies, they really just have release years, so let's drop the end year for this analysis. We also know that these are all movies, since that is the dataset that we isolated earlier.
# drop endYear column
title_basics.drop(['endYear', 'titleType'], axis=1, inplace=True)
# Replace the \N values with NaN
title_basics.replace(r'\N', np.nan, inplace=True)
title_basics.head(10)
tconst | primaryTitle | originalTitle | startYear | runtimeMinutes | genres | |
---|---|---|---|---|---|---|
0 | tt0000009 | Miss Jerry | Miss Jerry | 1970.0 | 45 | Romance |
1 | tt0000147 | The Corbett-Fitzsimmons Fight | The Corbett-Fitzsimmons Fight | 1970.0 | 100 | Documentary,News,Sport |
2 | tt0000502 | Bohemios | Bohemios | 1970.0 | 100 | NaN |
3 | tt0000574 | The Story of the Kelly Gang | The Story of the Kelly Gang | 1970.0 | 70 | Action,Adventure,Biography |
4 | tt0000591 | The Prodigal Son | L'enfant prodigue | 1970.0 | 90 | Drama |
5 | tt0000615 | Robbery Under Arms | Robbery Under Arms | 1970.0 | NaN | Drama |
6 | tt0000630 | Hamlet | Amleto | 1970.0 | NaN | Drama |
7 | tt0000675 | Don Quijote | Don Quijote | 1970.0 | NaN | Drama |
8 | tt0000679 | The Fairylogue and Radio-Plays | The Fairylogue and Radio-Plays | 1970.0 | 120 | Adventure,Fantasy |
9 | tt0000838 | A Cultura do Cacau | A Cultura do Cacau | 1970.0 | NaN | NaN |
This table looks properly cleaned! Now let's see what else there is to explore by pulling up the IMDb database information to look over what our other files have to share.
IMDb Data Description¶
title.akas.tsv.gz¶
titleId (string) - a tconst, an alphanumeric unique identifier of the title ordering (integer) – a number to uniquely identify rows for a given titleId title (string) – the localized title region (string) - the region for this version of the title language (string) - the language of the title types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning attributes (array) - Additional terms to describe this alternative title, not enumerated isOriginalTitle (boolean) – 0: not original title; 1: original title
title.basics.tsv.gz¶
tconst (string) - alphanumeric unique identifier of the title titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc) primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release originalTitle (string) - original title, in the original language isAdult (boolean) - 0: non-adult title; 1: adult title startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year endYear (YYYY) – TV Series end year. ‘\N’ for all other title types runtimeMinutes – primary runtime of the title, in minutes genres (string array) – includes up to three genres associated with the title
### title.crew.tsv.gz tconst (string) - alphanumeric unique identifier of the title directors (array of nconsts) - director(s) of the given title writers (array of nconsts) – writer(s) of the given title
title.episode.tsv.gz¶
tconst (string) - alphanumeric identifier of episode parentTconst (string) - alphanumeric identifier of the parent TV Series seasonNumber (integer) – season number the episode belongs to episodeNumber (integer) – episode number of the tconst in the TV series
### title.principals.tsv.gz tconst (string) - alphanumeric unique identifier of the title ordering (integer) – a number to uniquely identify rows for a given titleId nconst (string) - alphanumeric unique identifier of the name/person category (string) - the category of job that person was in job (string) - the specific job title if applicable, else '\N' characters (string) - the name of the character played if applicable, else '\N'
title.ratings.tsv.gz¶
tconst (string) - alphanumeric unique identifier of the title averageRating – weighted average of all the individual user ratings numVotes - number of votes the title has received
### name.basics.tsv.gz nconst (string) - alphanumeric unique identifier of the name/person primaryName (string)– name by which the person is most often credited birthYear – in YYYY format deathYear – in YYYY format if applicable, else '\N' primaryProfession (array of strings)– the top-3 professions of the person knownForTitles (array of tconsts) – titles the person is known for
I think that region and language from title.akas.tsv, as well as all of the data from title.ratings.tsv.
# import the next dataset
title_ratings = pd.read_csv(cwd + '/data/title.ratings.tsv', sep='\t')
title_akas = pd.read_csv(cwd + '/data/title.akas.tsv', sep='\t')
title_akas.rename(columns={'titleId': 'tconst'}, inplace=True)
# in order to remove the adult films, we need to merge the title_basics and title_akas dataframes
title_basics = pd.merge(title_basics, title_akas, on='tconst')
title_basics = pd.merge(title_basics, title_ratings, on='tconst')
title_basics.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2359924 entries, 0 to 2359923 Data columns (total 15 columns): # Column Dtype --- ------ ----- 0 tconst object 1 primaryTitle object 2 originalTitle object 3 startYear float64 4 runtimeMinutes object 5 genres object 6 ordering int64 7 title object 8 region object 9 language object 10 types object 11 attributes object 12 isOriginalTitle int64 13 averageRating float64 14 numVotes int64 dtypes: float64(2), int64(3), object(10) memory usage: 270.1+ MB
title_basics.drop(['ordering', 'types', 'attributes', 'isOriginalTitle'], axis=1, inplace=True)
title_basics.replace(r'\N', np.nan, inplace=True)
title_basics.dropna(subset=['region'], inplace=True)
title_basics.head(10)
tconst | primaryTitle | originalTitle | startYear | runtimeMinutes | genres | title | region | language | averageRating | numVotes | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | tt0000009 | Miss Jerry | Miss Jerry | 1970.0 | 45 | Romance | Miss Jerry | AU | NaN | 5.3 | 210 |
2 | tt0000009 | Miss Jerry | Miss Jerry | 1970.0 | 45 | Romance | Miss Jerry | HU | NaN | 5.3 | 210 |
3 | tt0000009 | Miss Jerry | Miss Jerry | 1970.0 | 45 | Romance | Miss Jerry | US | NaN | 5.3 | 210 |
4 | tt0000009 | Miss Jerry | Miss Jerry | 1970.0 | 45 | Romance | Fräulein Jerry | DE | NaN | 5.3 | 210 |
6 | tt0000147 | The Corbett-Fitzsimmons Fight | The Corbett-Fitzsimmons Fight | 1970.0 | 100 | Documentary,News,Sport | The Corbett-Fitzsimmons Fight | US | NaN | 5.2 | 509 |
7 | tt0000147 | The Corbett-Fitzsimmons Fight | The Corbett-Fitzsimmons Fight | 1970.0 | 100 | Documentary,News,Sport | Бой Корбетта и Фитцсиммонса | RU | NaN | 5.2 | 509 |
9 | tt0000502 | Bohemios | Bohemios | 1970.0 | 100 | NaN | Bohemios | ES | NaN | 4.2 | 16 |
11 | tt0000574 | The Story of the Kelly Gang | The Story of the Kelly Gang | 1970.0 | 70 | Action,Adventure,Biography | The Story of the Kelly Gang | AU | NaN | 6.0 | 882 |
12 | tt0000574 | The Story of the Kelly Gang | The Story of the Kelly Gang | 1970.0 | 70 | Action,Adventure,Biography | The Story of the Kelly Gang | GB | NaN | 6.0 | 882 |
13 | tt0000574 | The Story of the Kelly Gang | The Story of the Kelly Gang | 1970.0 | 70 | Action,Adventure,Biography | The Story of the Kelly Gang | SG | en | 6.0 | 882 |
len(title_basics) - len(title_basics.dropna(subset=['genres']))
20109
I would like to look into genres popularity by region as well, but I don't think that I want to drop the 20k null values, so I will leave them be and address them in visualization if need-be.
I came across an issue while visualizing the data within Tableau. The country codes were not recognized in the ISO format, so I will go in and selectively change those country codes over. Turns out that some of the countries within the set no longer exist as they did when the data was collected, so for those regions, one representative region code will be used from the resulting nations which such a contry has converted into.
It looks as though XAS, XAU, and similar are actually continent codes. These will be dropped, as we are trying to look at the data by country. XWW, or worldwide, will be removed for the same reason.
# Replace the region codes with the country codes
title_basics['region'].replace({'BH' : 'BHA',
'CR' : 'CRI',
'AN' : 'CQ',
'BUMM' : 'MM',
'CSHH' : 'CZE',
'CSXX' : 'CS',
'DDDE' : 'DEU',
'SUHH' : 'SU',
'VDVN' : 'VD',
'XAS' : 'AS',
'XAU' : np.nan,
'XEU' : np.nan,
'XKO' : 'XK',
'XKV' : 'XK',
'XNA' : 'KNA',
'XPI' :'ISR',
'XSA' : np.nan,
'XSI': 'THA',
'XWG' : 'DEU',
'XWW': np.nan,
'XYU' : 'YU',
'YUCS' : 'YU',
'ZRCD' : 'CD',
}, inplace=True)
# Drop the rows with NaN values in the region column
title_basics.dropna(subset=['region'], inplace=True)
title_basics.head(10)
tconst | primaryTitle | originalTitle | startYear | runtimeMinutes | genres | title | region | language | averageRating | numVotes | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | tt0000009 | Miss Jerry | Miss Jerry | 1970.0 | 45 | Romance | Miss Jerry | AU | NaN | 5.3 | 210 |
2 | tt0000009 | Miss Jerry | Miss Jerry | 1970.0 | 45 | Romance | Miss Jerry | HU | NaN | 5.3 | 210 |
3 | tt0000009 | Miss Jerry | Miss Jerry | 1970.0 | 45 | Romance | Miss Jerry | US | NaN | 5.3 | 210 |
4 | tt0000009 | Miss Jerry | Miss Jerry | 1970.0 | 45 | Romance | Fräulein Jerry | DE | NaN | 5.3 | 210 |
6 | tt0000147 | The Corbett-Fitzsimmons Fight | The Corbett-Fitzsimmons Fight | 1970.0 | 100 | Documentary,News,Sport | The Corbett-Fitzsimmons Fight | US | NaN | 5.2 | 509 |
7 | tt0000147 | The Corbett-Fitzsimmons Fight | The Corbett-Fitzsimmons Fight | 1970.0 | 100 | Documentary,News,Sport | Бой Корбетта и Фитцсиммонса | RU | NaN | 5.2 | 509 |
9 | tt0000502 | Bohemios | Bohemios | 1970.0 | 100 | NaN | Bohemios | ES | NaN | 4.2 | 16 |
11 | tt0000574 | The Story of the Kelly Gang | The Story of the Kelly Gang | 1970.0 | 70 | Action,Adventure,Biography | The Story of the Kelly Gang | AU | NaN | 6.0 | 882 |
12 | tt0000574 | The Story of the Kelly Gang | The Story of the Kelly Gang | 1970.0 | 70 | Action,Adventure,Biography | The Story of the Kelly Gang | GB | NaN | 6.0 | 882 |
13 | tt0000574 | The Story of the Kelly Gang | The Story of the Kelly Gang | 1970.0 | 70 | Action,Adventure,Biography | The Story of the Kelly Gang | SG | en | 6.0 | 882 |
title_basics.to_csv(cwd + '/data/vis_ready/movie_ratings_by_region.csv', index=False)
Now the interesting thing is that we have multiple for each title depending on where is was released. However, we see that there are different ratings and numbers of votes depending on where the movie was released. Since I really like the idea of comparing movie popularity across regions, I think it makes the most sense to leave this in without change.
I think that this is pretty good! Let's get this into Tableau and start building a cool visualization!
title_basics = pd.read_csv(cwd + '/data/vis_ready/movie_ratings_by_region.csv')
votes_per_year = title_basics[['startYear', 'numVotes']].groupby('startYear').sum()
votes_per_year
numVotes | |
---|---|
startYear | |
1896.0 | 19 |
1897.0 | 124 |
1898.0 | 24 |
1899.0 | 214 |
1900.0 | 80 |
... | ... |
2020.0 | 688188266 |
2021.0 | 1207891667 |
2022.0 | 1163778231 |
2023.0 | 698190470 |
2024.0 | 48447321 |
129 rows × 1 columns
votes_per_year.reset_index(inplace=True)
votes_per_year
startYear | numVotes | |
---|---|---|
0 | 1896.0 | 19 |
1 | 1897.0 | 124 |
2 | 1898.0 | 24 |
3 | 1899.0 | 214 |
4 | 1900.0 | 80 |
... | ... | ... |
124 | 2020.0 | 688188266 |
125 | 2021.0 | 1207891667 |
126 | 2022.0 | 1163778231 |
127 | 2023.0 | 698190470 |
128 | 2024.0 | 48447321 |
129 rows × 2 columns
# Convert startYear to str for both dataframes
votes_per_year['startYear'] = votes_per_year['startYear'].astype(str)
title_basics['startYear'] = title_basics['startYear'].astype(str)
for year in votes_per_year['startYear']:
title_basics.loc[title_basics['startYear'] == year, 'votesThisYear'] = votes_per_year.loc[votes_per_year['startYear'] == year, 'numVotes'].values[0]
title_basics[['startYear', 'votesThisYear']].groupby('startYear').max().reset_index()
startYear | votesThisYear | |
---|---|---|
0 | 1896.0 | 1.900000e+01 |
1 | 1897.0 | 1.240000e+02 |
2 | 1898.0 | 2.400000e+01 |
3 | 1899.0 | 2.140000e+02 |
4 | 1900.0 | 8.000000e+01 |
... | ... | ... |
125 | 2021.0 | 1.207892e+09 |
126 | 2022.0 | 1.163778e+09 |
127 | 2023.0 | 6.981905e+08 |
128 | 2024.0 | 4.844732e+07 |
129 | nan | NaN |
130 rows × 2 columns
title_basics.head()
tconst | primaryTitle | originalTitle | startYear | runtimeMinutes | genres | title | region | language | averageRating | numVotes | votesThisYear | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | tt0000009 | Miss Jerry | Miss Jerry | 1970.0 | 45.0 | Romance | Miss Jerry | AU | NaN | 5.3 | 210 | 7.098898e+09 |
1 | tt0000009 | Miss Jerry | Miss Jerry | 1970.0 | 45.0 | Romance | Miss Jerry | HU | NaN | 5.3 | 210 | 7.098898e+09 |
2 | tt0000009 | Miss Jerry | Miss Jerry | 1970.0 | 45.0 | Romance | Miss Jerry | US | NaN | 5.3 | 210 | 7.098898e+09 |
3 | tt0000009 | Miss Jerry | Miss Jerry | 1970.0 | 45.0 | Romance | Fräulein Jerry | DE | NaN | 5.3 | 210 | 7.098898e+09 |
4 | tt0000147 | The Corbett-Fitzsimmons Fight | The Corbett-Fitzsimmons Fight | 1970.0 | 100.0 | Documentary,News,Sport | The Corbett-Fitzsimmons Fight | US | NaN | 5.2 | 509 | 7.098898e+09 |
# Use one hot encoding to convert the genres column into a format that can be used for analysis
title_basics = title_basics.join(title_basics['genres'].str.get_dummies(sep=','))
title_basics.drop('genres', axis=1, inplace=True)
title_basics = title_basics[title_basics['Adult'] != 1]
#export to csv
title_basics.to_csv(cwd + '/data/vis_ready/movie_ratings_by_region.csv', index=False)
Much of the process that will take place in Tableau is in-part a piece of the Analysis stage, and in-part the entirety of the Construct phase. Much of that process will not be recorded, as it is worked through, but a final visualization will demonstrate the fruits of this analysis.
The final visualization can be found here: Link