IMDb Data Exploration and Analysis¶

The creation of a clean dataset for use in Tableau for exploration and to build insights¶

Dataset: IMDb Non-Commercial Datasets
Publisher: IMDb URL: https://datasets.imdbws.com/
Analysis and Visualizations by: Jacob Krol


Project Scope:¶


Step 1: Plan¶

During this stage of the process, I am mostly concerned with forming a generalized approach to the dataset. Generally, the pure planning stage consists of collaboration between cross-functional teams and is great for establishing the scope and KPIs of a project. This project's scope has been defined above.

The primary questions which I am concerned with at this stage are:

  1. What is the scope of the project?
  2. What are the key deliverables which are expected to be delivered for this project?
  3. Who are the key stakeholders in this project?
  4. What is the timeline for the project?
  5. How was the data collected?
  6. Is this data reliable?
  7. Are there ethical concerns associated with the use of or contents of the data set?
  8. What python packages will be needed for this analysis?
  9. What models may be useful for this project?

Again, as this is an analysis for the sake of demonstration with a public dataset, many of the above questions are not relevent. It is going to be assumed that the data is accurate for this practice. However, ethical considerations should still always be taken. In this case, there is no data or variables present which I believe to be unethical. With regards to which model may be useful, I am immediately draw to the Poisson regression model, since this is what would generally be used for forcasing sales data. That being said, I will not rule out the Random Forest approach, which could offer a less transparent but higher-scoring model overall.


Step 1.5: Plan/ Analyze¶

During this stage in the process, it is important to become familiar with the dataset. This is largely where the technical planning begins. Once I have a better idea of the contents of the dataset, I can begin forming basic visualizations which will then inform which variables are focused on, which data engineering methods will be used to generate additional useful derived data, and which modeling methods will be the most useful.

The primary questions which I am concerned with at this stage are:

  1. What variables are included in the dataset?
  2. How many variables are in the dataset?
  3. How many entries are in the dataset?
  4. Are there any null values or outliers present in the dataset?
  5. Is there any apparent trend which is contributing to the outliers or null values which will affect how they are handled?
  6. Does an initial analysis uncover any multicollinearity which needs to be considered?
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Silence warnings from seaborn
import warnings
warnings.filterwarnings("ignore")
cwd = os.getcwd()
In [3]:
# Show all rows
pd.set_option('display.max_rows', None)
In [15]:
# Importing the datasets into pandas dataframes one at a time and clean them. 
# We don't plan to work with any adult films, so we'll filter those out. This is in title.basics.tsv

title_basics = pd.read_csv(cwd + '/data/title.basics.tsv', sep='\t')

# Describing the dataset
title_basics.describe()
Out[15]:
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
count 10673965 10673965 10673947 10673947 10673965 10673965 10673965 10673965 10673745
unique 10673965 11 4806019 4829549 30 153 97 899 2371
top tt0000001 tvEpisode Episode #1.1 Episode #1.1 0 \N \N \N Drama
freq 1 8165289 51208 51208 10269704 1400226 10552418 7360018 1218510
In [16]:
title_basics.head()
Out[16]:
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
0 tt0000001 short Carmencita Carmencita 0 1894 \N 1 Documentary,Short
1 tt0000002 short Le clown et ses chiens Le clown et ses chiens 0 1892 \N 5 Animation,Short
2 tt0000003 short Pauvre Pierrot Pauvre Pierrot 0 1892 \N 4 Animation,Comedy,Romance
3 tt0000004 short Un bon bock Un bon bock 0 1892 \N 12 Animation,Short
4 tt0000005 short Blacksmith Scene Blacksmith Scene 0 1893 \N 1 Comedy,Short

This isn't great. It looks like all of the data is in strings...
Let's start with the isAdult column since that will trim things down quite a bit. There are 30 unique values, reading as a string, so let's just see how many are in each category first. If there are few in each incorrect category, we'll just drop the other flags. If not, we'll figure it out from there.

In [17]:
print(len(title_basics['isAdult']))
title_basics[['tconst', 'isAdult']].groupby('isAdult').count()
10673965
Out[17]:
tconst
isAdult
0 10269704
1 338506
1977 1
1978 26
1979 10
1980 2
1981 1
1982 2
1983 6
1984 14
1985 29
1986 2
1987 10
2005 1
2011 1
2012 1
2013 8
2014 6
2015 28
2016 20
2017 17
2018 9
2019 7
2020 9
2021 1
2022 2
2023 6
0 63396
1 2139
\N 1

It looks like some of the data has been mishandled while creating the db. Let's just drop anything which has been incorrectly loaded. as a year or NA and all of the adult content at once.

In [31]:
title_basics['isAdult'] = title_basics['isAdult'].astype(int)
title_basics = title_basics[title_basics['isAdult'] == 0]
title_basics['isAdult'] = title_basics['isAdult'].astype(bool)
In [36]:
title_basics[['isAdult', 'tconst']].groupby('isAdult').count()
Out[36]:
tconst
isAdult
False 10333100

Great! Now that this column isn't useful to us, we can just drop it, knowing that nothing in here should be adult films. Now I want to look into what types of titles are in the dataset. I want to mostly focus on movies and tv, so let's see what's in there next.

In [37]:
title_basics.drop('isAdult', axis=1, inplace=True)
In [41]:
title_basics[['titleType','tconst']].groupby(['titleType']).count()
Out[41]:
tconst
titleType
movie 666922
short 985375
tvEpisode 7943467
tvMiniSeries 53242
tvMovie 144858
tvPilot 1
tvSeries 257707
tvShort 10279
tvSpecial 46917
video 186783
videoGame 37549

Interesting.. For the sake of having this data, I am going to split off each title type that I'm interested in into a different dataframe, then export it for myself to use in the future for more projects.

In [44]:
title_basics_movies = title_basics[title_basics['titleType'] == 'movie']
title_basics_tv_episodes = title_basics[title_basics['titleType'] == 'tvEpisode']
title_basics_tv_series = title_basics[title_basics['titleType'] == 'tvSeries']
title_basics_tv_movies = title_basics[title_basics['titleType'] == 'tvMovie']
title_basics_video_games = title_basics[title_basics['titleType'] == 'videoGame']

title_basics_movies.to_csv(cwd + '/data/title_basics_by_titleType/title.basics.movies.tsv', sep='\t', index=False)
title_basics_tv_episodes.to_csv(cwd + '/data/title_basics_by_titleType/title.basics.tvEpisodes.tsv', sep='\t', index=False)
title_basics_tv_series.to_csv(cwd + '/data/title_basics_by_titleType/title.basics.tvSeries.tsv', sep='\t', index=False)
title_basics_tv_movies.to_csv(cwd + '/data/title_basics_by_titleType/title.basics.tvMovies.tsv', sep='\t', index=False)
title_basics_video_games.to_csv(cwd + '/data/title_basics_by_titleType/title.basics.videoGames.tsv', sep='\t', index=False)

This project will focus only on movies so we can feel free to only look at movies going forward and drop any of the other data from the memory

In [45]:
del title_basics
del title_basics_movies
del title_basics_tv_episodes
del title_basics_tv_series
del title_basics_tv_movies
del title_basics_video_games

IMDb Movie Data Exploration and Analysis¶

The first thing that I want to do here is just check the dataset for null values, as well as getting a feeling for what the whole dataset of movies looks like. We'll use the title.basics.movies file to make a DF, from which we can use tconst to refine the other IMDb tables as we load them in.

In [10]:
title_basics = pd.read_csv(cwd + '/data/title_basics_by_titleType/title.basics.movies.tsv', sep='\t')
title_basics.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 666922 entries, 0 to 666921
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          666922 non-null  object
 1   titleType       666922 non-null  object
 2   primaryTitle    666920 non-null  object
 3   originalTitle   666920 non-null  object
 4   startYear       666922 non-null  object
 5   endYear         666922 non-null  object
 6   runtimeMinutes  666922 non-null  object
 7   genres          666922 non-null  object
dtypes: object(8)
memory usage: 40.7+ MB
In [11]:
# Take care of the dates by converting them to datetime objects of just the year
title_basics['startYear'] = pd.to_datetime(title_basics['startYear'], errors='coerce')
title_basics['startYear'] = title_basics['startYear'].dt.year
title_basics['endYear'] = pd.to_datetime(title_basics['endYear'], errors='coerce')
title_basics['endYear'] = title_basics['endYear'].dt.year
In [12]:
title_basics.head(10)
Out[12]:
tconst titleType primaryTitle originalTitle startYear endYear runtimeMinutes genres
0 tt0000009 movie Miss Jerry Miss Jerry 1970.0 NaN 45 Romance
1 tt0000147 movie The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight 1970.0 NaN 100 Documentary,News,Sport
2 tt0000502 movie Bohemios Bohemios 1970.0 NaN 100 \N
3 tt0000574 movie The Story of the Kelly Gang The Story of the Kelly Gang 1970.0 NaN 70 Action,Adventure,Biography
4 tt0000591 movie The Prodigal Son L'enfant prodigue 1970.0 NaN 90 Drama
5 tt0000615 movie Robbery Under Arms Robbery Under Arms 1970.0 NaN \N Drama
6 tt0000630 movie Hamlet Amleto 1970.0 NaN \N Drama
7 tt0000675 movie Don Quijote Don Quijote 1970.0 NaN \N Drama
8 tt0000679 movie The Fairylogue and Radio-Plays The Fairylogue and Radio-Plays 1970.0 NaN 120 Adventure,Fantasy
9 tt0000838 movie A Cultura do Cacau A Cultura do Cacau 1970.0 NaN \N \N

Interesting. So, even though the .info() that we called earlier indicated that there were no null values, we clearly see placeholders for null values. Let's convert those so that they are being read properly, then reassess the set. Also, we notice that the end years are all null. Since these are movies, they really just have release years, so let's drop the end year for this analysis. We also know that these are all movies, since that is the dataset that we isolated earlier.

In [13]:
# drop endYear column
title_basics.drop(['endYear', 'titleType'], axis=1, inplace=True)

# Replace the \N values with NaN
title_basics.replace(r'\N', np.nan, inplace=True)
In [14]:
title_basics.head(10)
Out[14]:
tconst primaryTitle originalTitle startYear runtimeMinutes genres
0 tt0000009 Miss Jerry Miss Jerry 1970.0 45 Romance
1 tt0000147 The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight 1970.0 100 Documentary,News,Sport
2 tt0000502 Bohemios Bohemios 1970.0 100 NaN
3 tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang 1970.0 70 Action,Adventure,Biography
4 tt0000591 The Prodigal Son L'enfant prodigue 1970.0 90 Drama
5 tt0000615 Robbery Under Arms Robbery Under Arms 1970.0 NaN Drama
6 tt0000630 Hamlet Amleto 1970.0 NaN Drama
7 tt0000675 Don Quijote Don Quijote 1970.0 NaN Drama
8 tt0000679 The Fairylogue and Radio-Plays The Fairylogue and Radio-Plays 1970.0 120 Adventure,Fantasy
9 tt0000838 A Cultura do Cacau A Cultura do Cacau 1970.0 NaN NaN

This table looks properly cleaned! Now let's see what else there is to explore by pulling up the IMDb database information to look over what our other files have to share.


IMDb Data Description¶

title.akas.tsv.gz¶

titleId (string) - a tconst, an alphanumeric unique identifier of the title
ordering (integer) – a number to uniquely identify rows for a given titleId
title (string) – the localized title
region (string) - the region for this version of the title
language (string) - the language of the title
types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
attributes (array) - Additional terms to describe this alternative title, not enumerated
isOriginalTitle (boolean) – 0: not original title; 1: original title

title.basics.tsv.gz¶

tconst (string) - alphanumeric unique identifier of the title
titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
originalTitle (string) - original title, in the original language
isAdult (boolean) - 0: non-adult title; 1: adult title
startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
runtimeMinutes – primary runtime of the title, in minutes
genres (string array) – includes up to three genres associated with the title

### title.crew.tsv.gz tconst (string) - alphanumeric unique identifier of the title directors (array of nconsts) - director(s) of the given title writers (array of nconsts) – writer(s) of the given title

title.episode.tsv.gz¶

tconst (string) - alphanumeric identifier of episode
parentTconst (string) - alphanumeric identifier of the parent TV Series
seasonNumber (integer) – season number the episode belongs to
episodeNumber (integer) – episode number of the tconst in the TV series

### title.principals.tsv.gz tconst (string) - alphanumeric unique identifier of the title ordering (integer) – a number to uniquely identify rows for a given titleId nconst (string) - alphanumeric unique identifier of the name/person category (string) - the category of job that person was in job (string) - the specific job title if applicable, else '\N' characters (string) - the name of the character played if applicable, else '\N'

title.ratings.tsv.gz¶

tconst (string) - alphanumeric unique identifier of the title
averageRating – weighted average of all the individual user ratings
numVotes - number of votes the title has received

### name.basics.tsv.gz nconst (string) - alphanumeric unique identifier of the name/person primaryName (string)– name by which the person is most often credited birthYear – in YYYY format deathYear – in YYYY format if applicable, else '\N' primaryProfession (array of strings)– the top-3 professions of the person knownForTitles (array of tconsts) – titles the person is known for


I think that region and language from title.akas.tsv, as well as all of the data from title.ratings.tsv.

In [15]:
# import the next dataset
title_ratings = pd.read_csv(cwd + '/data/title.ratings.tsv', sep='\t')
title_akas = pd.read_csv(cwd + '/data/title.akas.tsv', sep='\t')
In [16]:
title_akas.rename(columns={'titleId': 'tconst'}, inplace=True)
In [17]:
# in order to remove the adult films, we need to merge the title_basics and title_akas dataframes
title_basics = pd.merge(title_basics, title_akas, on='tconst')
title_basics = pd.merge(title_basics, title_ratings, on='tconst')
In [18]:
title_basics.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2359924 entries, 0 to 2359923
Data columns (total 15 columns):
 #   Column           Dtype  
---  ------           -----  
 0   tconst           object 
 1   primaryTitle     object 
 2   originalTitle    object 
 3   startYear        float64
 4   runtimeMinutes   object 
 5   genres           object 
 6   ordering         int64  
 7   title            object 
 8   region           object 
 9   language         object 
 10  types            object 
 11  attributes       object 
 12  isOriginalTitle  int64  
 13  averageRating    float64
 14  numVotes         int64  
dtypes: float64(2), int64(3), object(10)
memory usage: 270.1+ MB
In [19]:
title_basics.drop(['ordering', 'types', 'attributes', 'isOriginalTitle'], axis=1, inplace=True)
In [20]:
title_basics.replace(r'\N', np.nan, inplace=True)
title_basics.dropna(subset=['region'], inplace=True)
In [21]:
title_basics.head(10)
Out[21]:
tconst primaryTitle originalTitle startYear runtimeMinutes genres title region language averageRating numVotes
1 tt0000009 Miss Jerry Miss Jerry 1970.0 45 Romance Miss Jerry AU NaN 5.3 210
2 tt0000009 Miss Jerry Miss Jerry 1970.0 45 Romance Miss Jerry HU NaN 5.3 210
3 tt0000009 Miss Jerry Miss Jerry 1970.0 45 Romance Miss Jerry US NaN 5.3 210
4 tt0000009 Miss Jerry Miss Jerry 1970.0 45 Romance Fräulein Jerry DE NaN 5.3 210
6 tt0000147 The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight 1970.0 100 Documentary,News,Sport The Corbett-Fitzsimmons Fight US NaN 5.2 509
7 tt0000147 The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight 1970.0 100 Documentary,News,Sport Бой Корбетта и Фитцсиммонса RU NaN 5.2 509
9 tt0000502 Bohemios Bohemios 1970.0 100 NaN Bohemios ES NaN 4.2 16
11 tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang 1970.0 70 Action,Adventure,Biography The Story of the Kelly Gang AU NaN 6.0 882
12 tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang 1970.0 70 Action,Adventure,Biography The Story of the Kelly Gang GB NaN 6.0 882
13 tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang 1970.0 70 Action,Adventure,Biography The Story of the Kelly Gang SG en 6.0 882
In [22]:
len(title_basics) - len(title_basics.dropna(subset=['genres']))
Out[22]:
20109

I would like to look into genres popularity by region as well, but I don't think that I want to drop the 20k null values, so I will leave them be and address them in visualization if need-be.

I came across an issue while visualizing the data within Tableau. The country codes were not recognized in the ISO format, so I will go in and selectively change those country codes over. Turns out that some of the countries within the set no longer exist as they did when the data was collected, so for those regions, one representative region code will be used from the resulting nations which such a contry has converted into.

It looks as though XAS, XAU, and similar are actually continent codes. These will be dropped, as we are trying to look at the data by country. XWW, or worldwide, will be removed for the same reason.

In [33]:
# Replace the region codes with the country codes
title_basics['region'].replace({'BH' : 'BHA',
                                'CR' : 'CRI',
                                'AN' : 'CQ',
                                'BUMM' : 'MM',
                                'CSHH' : 'CZE',
                                'CSXX' : 'CS',
                                'DDDE' : 'DEU',
                                'SUHH' : 'SU',
                                'VDVN' : 'VD',
                                'XAS' : 'AS',
                                'XAU' : np.nan,
                                'XEU' : np.nan,
                                'XKO' : 'XK',
                                'XKV' : 'XK',
                                'XNA' : 'KNA',
                                'XPI' :'ISR',
                                'XSA' : np.nan,
                                'XSI': 'THA',
                                'XWG' : 'DEU',
                                'XWW': np.nan,
                                'XYU' : 'YU',
                                'YUCS' : 'YU',
                                'ZRCD' : 'CD',
                                }, inplace=True)
In [34]:
# Drop the rows with NaN values in the region column
title_basics.dropna(subset=['region'], inplace=True)
In [31]:
title_basics.head(10)
Out[31]:
tconst primaryTitle originalTitle startYear runtimeMinutes genres title region language averageRating numVotes
1 tt0000009 Miss Jerry Miss Jerry 1970.0 45 Romance Miss Jerry AU NaN 5.3 210
2 tt0000009 Miss Jerry Miss Jerry 1970.0 45 Romance Miss Jerry HU NaN 5.3 210
3 tt0000009 Miss Jerry Miss Jerry 1970.0 45 Romance Miss Jerry US NaN 5.3 210
4 tt0000009 Miss Jerry Miss Jerry 1970.0 45 Romance Fräulein Jerry DE NaN 5.3 210
6 tt0000147 The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight 1970.0 100 Documentary,News,Sport The Corbett-Fitzsimmons Fight US NaN 5.2 509
7 tt0000147 The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight 1970.0 100 Documentary,News,Sport Бой Корбетта и Фитцсиммонса RU NaN 5.2 509
9 tt0000502 Bohemios Bohemios 1970.0 100 NaN Bohemios ES NaN 4.2 16
11 tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang 1970.0 70 Action,Adventure,Biography The Story of the Kelly Gang AU NaN 6.0 882
12 tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang 1970.0 70 Action,Adventure,Biography The Story of the Kelly Gang GB NaN 6.0 882
13 tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang 1970.0 70 Action,Adventure,Biography The Story of the Kelly Gang SG en 6.0 882
In [35]:
title_basics.to_csv(cwd + '/data/vis_ready/movie_ratings_by_region.csv', index=False)

Now the interesting thing is that we have multiple for each title depending on where is was released. However, we see that there are different ratings and numbers of votes depending on where the movie was released. Since I really like the idea of comparing movie popularity across regions, I think it makes the most sense to leave this in without change.

I think that this is pretty good! Let's get this into Tableau and start building a cool visualization!

In [37]:
title_basics = pd.read_csv(cwd + '/data/vis_ready/movie_ratings_by_region.csv')
In [38]:
votes_per_year = title_basics[['startYear', 'numVotes']].groupby('startYear').sum()
In [39]:
votes_per_year
Out[39]:
numVotes
startYear
1896.0 19
1897.0 124
1898.0 24
1899.0 214
1900.0 80
... ...
2020.0 688188266
2021.0 1207891667
2022.0 1163778231
2023.0 698190470
2024.0 48447321

129 rows × 1 columns

In [40]:
votes_per_year.reset_index(inplace=True)
votes_per_year
Out[40]:
startYear numVotes
0 1896.0 19
1 1897.0 124
2 1898.0 24
3 1899.0 214
4 1900.0 80
... ... ...
124 2020.0 688188266
125 2021.0 1207891667
126 2022.0 1163778231
127 2023.0 698190470
128 2024.0 48447321

129 rows × 2 columns

In [41]:
# Convert startYear to str for both dataframes
votes_per_year['startYear'] = votes_per_year['startYear'].astype(str)
title_basics['startYear'] = title_basics['startYear'].astype(str)
In [ ]:
for year in votes_per_year['startYear']:
    title_basics.loc[title_basics['startYear'] == year, 'votesThisYear'] = votes_per_year.loc[votes_per_year['startYear'] == year, 'numVotes'].values[0]
In [43]:
title_basics[['startYear', 'votesThisYear']].groupby('startYear').max().reset_index()
Out[43]:
startYear votesThisYear
0 1896.0 1.900000e+01
1 1897.0 1.240000e+02
2 1898.0 2.400000e+01
3 1899.0 2.140000e+02
4 1900.0 8.000000e+01
... ... ...
125 2021.0 1.207892e+09
126 2022.0 1.163778e+09
127 2023.0 6.981905e+08
128 2024.0 4.844732e+07
129 nan NaN

130 rows × 2 columns

In [50]:
title_basics.head()
Out[50]:
tconst primaryTitle originalTitle startYear runtimeMinutes genres title region language averageRating numVotes votesThisYear
0 tt0000009 Miss Jerry Miss Jerry 1970.0 45.0 Romance Miss Jerry AU NaN 5.3 210 7.098898e+09
1 tt0000009 Miss Jerry Miss Jerry 1970.0 45.0 Romance Miss Jerry HU NaN 5.3 210 7.098898e+09
2 tt0000009 Miss Jerry Miss Jerry 1970.0 45.0 Romance Miss Jerry US NaN 5.3 210 7.098898e+09
3 tt0000009 Miss Jerry Miss Jerry 1970.0 45.0 Romance Fräulein Jerry DE NaN 5.3 210 7.098898e+09
4 tt0000147 The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight 1970.0 100.0 Documentary,News,Sport The Corbett-Fitzsimmons Fight US NaN 5.2 509 7.098898e+09
In [53]:
# Use one hot encoding to convert the genres column into a format that can be used for analysis
title_basics = title_basics.join(title_basics['genres'].str.get_dummies(sep=','))
title_basics.drop('genres', axis=1, inplace=True)
title_basics = title_basics[title_basics['Adult'] != 1]
In [55]:
#export to csv
title_basics.to_csv(cwd + '/data/vis_ready/movie_ratings_by_region.csv', index=False)

Much of the process that will take place in Tableau is in-part a piece of the Analysis stage, and in-part the entirety of the Construct phase. Much of that process will not be recorded, as it is worked through, but a final visualization will demonstrate the fruits of this analysis.

The final visualization can be found here: Link