While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which we can download from their GitHub repository.
For this project, we'll clean and explore the dataset. There are several columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes. This type of data is difficult to represent in a column. As a result, this dataset needs a lot of cleaning.
import pandas as pd
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
star_wars.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?�� | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3.0 | ... | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?��', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
Take a look at the next two columns:
Both represent Yes
/No
questions. They can also be NaN
where a respondent chooses not to answer a question.
Both columns are currently string types, because the main values they contain are Yes
and No
. We can make the data a bit easier to analyze later by converting each column to a Boolean with only the values True
, False
, and NaN
. Booleans are easier to work with because we can select the rows that are True
or False
without having to do a string comparison.
yes_no = {'Yes': True, 'No': False}
for col in [
"Have you seen any of the 6 films in the Star Wars franchise?",
"Do you consider yourself to be a fan of the Star Wars film franchise?"
]:
star_wars[col] = star_wars[col].map(yes_no)
star_wars.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?�� | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3.0 | ... | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen?
Please select all that apply.
The columns for this question are:
Which of the following Star Wars films have you seen?
Please select all that apply. — whether or not the respondent saw Star Wars: Episode I The Phantom Menace.
Unnamed: 4
— whether or not the respondent saw Star Wars: Episode II Attack of the Clones
.Unnamed: 5
— whether or not the respondent saw Star Wars: Episode III Revenge of the Sith
.Unnamed: 6
— whether or not the respondent saw Star Wars: Episode IV A New Hope
.Unnamed: 7
— whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back
.Unnamed: 8
— whether or not the respondent saw Star Wars: Episode VI Return of the Jedi
.For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN
, the respondent either didn't answer or didn't see the movie. We'll assume that they didn't see the movie.
We'll need to convert each of these columns to a Boolean, then rename the column something more intuitive. We can convert the values the same way we did earlier, except that we'll need to include the movie title and NaN
in the mapping dictionary.
import numpy as np
movie_mapping = {
"Star Wars: Episode I The Phantom Menace": True,
np.nan: False,
"Star Wars: Episode II Attack of the Clones": True,
"Star Wars: Episode III Revenge of the Sith": True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"Star Wars: Episode VI Return of the Jedi": True
}
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(movie_mapping)
star_wars = star_wars.rename(columns={
"Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
"Unnamed: 4": "seen_2",
"Unnamed: 5": "seen_3",
"Unnamed: 6": "seen_4",
"Unnamed: 7": "seen_5",
"Unnamed: 8": "seen_6"
})
star_wars.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?�� | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | True | True | True | True | True | True | 3.0 | ... | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
The next six columns ask the respondent to rank the Star Wars movies in order from least to most favorite. 1
means the film was the most favorite, and 6
means it was the least favorite. Each of the following columns can contain the value 1
, 2
, 3
, 4
, 5
, 6
, or NaN
:
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.
- How much the respondent liked Star Wars: Episode I The Phantom Menace
Unnamed: 10
— How much the respondent liked Star Wars: Episode II Attack of the Clones
Unnamed: 11
— How much the respondent liked Star Wars: Episode III Revenge of the Sith
Unnamed: 12
— How much the respondent liked Star Wars: Episode IV A New Hope
Unnamed: 13
— How much the respondent liked Star Wars: Episode V The Empire Strikes Back
Unnamed: 14
— How much the respondent liked Star Wars: Episode VI Return of the Jedi
Fortunately, these columns don't require much cleaning. We'll need to convert each column to a numeric type, then rename the columns so that we can tell what they represent more easily.
star_wars = star_wars.rename(columns={
"Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "ranking_1",
"Unnamed: 10": "ranking_2",
"Unnamed: 11": "ranking_3",
"Unnamed: 12": "ranking_4",
"Unnamed: 13": "ranking_5",
"Unnamed: 14": "ranking_6"
})
star_wars.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?�� | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | True | True | True | True | True | True | 3.0 | ... | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
Now that we've cleaned up the ranking columns, we can find the highest-ranked movie quicker. To do this, take the mean of each ranking columns using the pandas.DataFrame.mean()
method on DataFrames.
star_wars[star_wars.columns[9:15]].mean()
ranking_1 3.732057 ranking_2 4.084827 ranking_3 4.337321 ranking_4 3.273596 ranking_5 2.516129 ranking_6 3.051374 dtype: float64
%matplotlib inline
import matplotlib.pyplot as plt
plt.bar(range(1,7,1), star_wars[star_wars.columns[9:15]].mean())
<BarContainer object of 6 artists>
As we suspected, it looks like the "original" movies are rated much more highly than the newer ones.
Earlier in this project, we cleaned up the seen
columns and converted their values to the Boolean type. When we call methods like pandas.DataFrame.sum()
or mean()
, they treat Booleans like integers. They consider True
a 1
and False
a 0
. That means we can figure out how many people have seen each movie just by taking the sum of the column (even though they contain Boolean values).
star_wars[star_wars.columns[3:9]].sum()
seen_1 674 seen_2 572 seen_3 551 seen_4 608 seen_5 759 seen_6 739 dtype: int64
plt.bar(range(1,7,1), star_wars[star_wars.columns[3:9]].sum())
<BarContainer object of 6 artists>
It appears that more respondents saw the new movies than the old movies. This reinforces what we saw in the rankings, where the newer movies seem to be more popular
We know which movies the survey population as a whole has ranked the highest. Now let's examine how certain segments of the survey population responded. There are several columns that segment our data into two groups. Here are a few examples:
Do you consider yourself to be a fan of the Star Wars film franchise?
— True
or False
Do you consider yourself to be a fan of the Star Trek franchise?
— Yes
or No
Gender
— Male
or Female
We can split a DataFrame into two groups based on a binary column by creating two subsets of that column.
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
The subsets allow us to compute the most viewed movie, the highest-ranked movie, and other statistics separately for each group.
import matplotlib.style as style
style.use('ggplot')
plt.figure(figsize=(10,3))
plt.subplot(1,2,1)
plt.bar(range(1,7,1), males[males.columns[9:15]].mean(), color='lightblue')
plt.ylim(2,5)
plt.title('Males Ranking')
plt.subplot(1,2,2)
plt.bar(range(1,7,1), females[females.columns[9:15]].mean())
plt.ylim(2,5)
plt.title('Females Ranking')
plt.show()
style.use('ggplot')
plt.figure(figsize=(10,3))
plt.subplot(1,2,1)
plt.bar(range(1,7,1), males[males.columns[3:9]].sum(), color='lightblue')
plt.ylim(200,400)
plt.title('Males Views')
plt.subplot(1,2,2)
plt.bar(range(1,7,1), females[females.columns[3:9]].sum())
plt.title('Females Views')
plt.ylim(200,400)
plt.show()
Interestingly, more males watches episodes 1-3, but males liked them far less than females did.