Jeopardy is a popular TV show in the US where participants answer questions to win money. I am going to work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help to win.
The dataset is named jeopardy.csv and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded here.
Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:
First I am going to read the dataset and explore.
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head(5)
Show Number | Air Date | Round | Category | Value | Question | Answer | |
---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams |
jeopardy.columns
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value', ' Question', ' Answer'], dtype='object')
Some of the column names have spaces in front, I am going to remove them:
jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer'], dtype='object')
Let's have a close look at the format of each column.
jeopardy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 19999 entries, 0 to 19998 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Show Number 19999 non-null int64 1 Air Date 19999 non-null object 2 Round 19999 non-null object 3 Category 19999 non-null object 4 Value 19999 non-null object 5 Question 19999 non-null object 6 Answer 19999 non-null object dtypes: int64(1), object(6) memory usage: 1.1+ MB
One messy aspect about the Jeopardy dataset is that it contains text. Text can contain punctuation and different capitalization, which will make it hard for us to compare the text of an answer to the text of a question. We would like to make this process easier for ourselves, so we’ll need to process the text data in this step. The process of cleaning text in data analysis is sometimes called normalization. More specifically, we want ensure that we lowercase all of the words and any remove punctuation. We remove punctuation because it ensures that the text stays as purely letters. Before normalization, the terms Don’t and don’t are considered to be different words, and we don’t want this.
Before starting the analysis, we need to normalize and fix the datatypes of some columns. I need to lowercase Question
and Answer
columns and remove the punctuation. the Value
column should be numeric and the Air Date
should be a datetime.
First I am going to write a function to get in a string and return that string in lowercase and without punctuation.
import re
def normalize(text):
text = text.lower()
text = re.sub('[^\w\s]', '', text)
return text
# test normalize function
normalize("Hello! How are you?")
'hello how are you'
Let's apply the normalize function to Question
and Answer
columns and save the result in clean_question
and clean_answer
columns.
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy['clean_question'].head(5)
0 for the last 8 years of his life galileo was u... 1 no 2 1912 olympian football star at carlisle i... 2 the city of yuma in this state has a record av... 3 in 1963 live on the art linkletter show this c... 4 signer of the dec of indep framer of the const... Name: clean_question, dtype: object
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)
jeopardy['clean_answer'].head(5)
0 copernicus 1 jim thorpe 2 arizona 3 mcdonalds 4 john adams Name: clean_answer, dtype: object
To normalize the Value
column I am going to remove the dollar sign from the beginning, convert it from text to numeric and save the result to a new column called clean_value
.
def normalize_value(value):
value = re.sub('[^\w\s]', '', value)
try:
value_int = int(value)
except ValueError:
value_int = 0
return value_int
# test
normalize_value('$200')
200
#apply normalize_value function to Value column
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)
The Air Date
column should also be datatime to enable us to work with easily.
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
Let's see the types of all columns especially the new ones again.
jeopardy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 19999 entries, 0 to 19998 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Show Number 19999 non-null int64 1 Air Date 19999 non-null datetime64[ns] 2 Round 19999 non-null object 3 Category 19999 non-null object 4 Value 19999 non-null object 5 Question 19999 non-null object 6 Answer 19999 non-null object 7 clean_question 19999 non-null object 8 clean_answer 19999 non-null object 9 clean_value 19999 non-null int64 dtypes: datetime64[ns](1), int64(2), object(7) memory usage: 1.5+ MB
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
To answer the second question I need to figure out how often complex words (> 6 characters) reoccur and for the first question I need to see how many times words in the answer also occur in the question.
let's start with the first question. I am going to write a function to calculate for each question the ratio of the number of words in answers that are found in questions. Then I am going to apply it to all of the questions and calculate the average of them. In this function, 'the' is removed from the words that are investigated since in not a valuable word.
def count_matches_ratio(row):
answer = row['clean_answer']
question = row['clean_question']
split_answer = answer.split()
split_question = question.split()
match_count = 0
if 'the' in split_answer:
split_answer.remove('the')
if len(split_answer) == 0:
return 0
for item in split_answer:
if item in split_question:
match_count += 1
return match_count/len(split_answer)
# use apply() to loop over all the rows
jeopardy['answer_in_question'] = jeopardy.apply(count_matches_ratio, axis = 1)
jeopardy['answer_in_question'].mean()
0.05900196524977763
On average about 6% of the words of answers are found in the questions. So the chance of deducing the answer from the question is quite low.
Let's go through the second question and investigate how often new questions are repeated of older ones. I can not completely answer this question the dataset includes only 10% of the full jeopardy question dataset but I am going to investigate it.
I am going to check if the terms with six or more characters in questions have been used previously or not.
question_overlap = []
# get unique set of words
terms_used = set()
# sorted date, then it is clear to see what is a new question
jeopardy.sort_values('Air Date', inplace = True)
# loop over data frame with index cout
for i, row in jeopardy.iterrows():
# get list of the words in a question
split_question = row['clean_question'].split()
# word contains 6+ characters
split_question = [q for q in split_question if len(q)>= 6]
match_count = 0
for term in split_question:
if term in terms_used:
match_count += 1
terms_used.add(term)
if len(split_question) > 0:
# normalize the count across different question length
match_count = match_count / len(split_question)
question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
# get the percentage of the repeated question
jeopardy['question_overlap'].mean()
0.689481997219586
About 69% of the complex words in questions are repeated so it seems studying the past questions can be really helpful to win.
Let's focus our study on questions that pertain to high value questions instead of low value questions. This is helpful to earn more money.
We can actually figure out which terms correspond to high-value questions using a chi-squared test. I'll first need to narrow down the questions into two categories:
I'll then be able to loop through each of the terms from terms_used
, and:
I can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so I'll just do it for a small sample now.
def categorize_value(row):
value = 0
if row['clean_value'] > 800:
value = 1
return value
jeopardy['high_value'] = jeopardy.apply(categorize_value, axis = 1)
def count_values(word):
low_count = 0
high_count = 0
for _, row in jeopardy.iterrows():
split_question = row['clean_question'].split()
if word in split_question:
if row['high_value'] == 1:
high_count += 1
else:
low_count += 1
return high_count, low_count
#Randomly pick ten elements of terms_used
from random import choice
comparison_terms = [choice(list(terms_used)) for _ in range(10)]
comparison_terms
['recruits', 'hotshot', '500000member', 'exceptions', 'dipsomaniac', 'tylenol', 'letters', 'latvia', 'bergens', 'strangely']
observed_expected = []
for word in comparison_terms:
observed_expected.append(count_values(word))
observed_expected
[(1, 1), (1, 0), (0, 1), (0, 1), (0, 1), (0, 1), (17, 37), (3, 1), (0, 1), (1, 2)]
Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.
high_value_count = sum(jeopardy['high_value'])
low_value_count = jeopardy[jeopardy['high_value'] == 0]['high_value'].count()
# low_value_count2 = jeopardy['high_value'].count() - sum(jeopardy['high_value'])
print('high_value_count = {}'.format(high_value_count))
print('low_value_count = {}'.format(low_value_count))
high_value_count = 5734 low_value_count = 14265
import numpy as np
from scipy.stats import chisquare
chi_squared = []
for high_count, low_count in observed_expected:
# total times that a word shows
total = high_count + low_count
# the probability of a word shows in jeopardy
total_prop = total/jeopardy.shape[0]
# expected values according to the ratio of total high/low values
high_value_exp = total_prop * high_value_count
low_value_exp = total_prop * low_value_count
observed = np.array([high_count, low_count])
expected = np.array([high_value_exp, low_value_exp])
chi_squared.append(chisquare(observed, expected))
chi_squared
[Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996), Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.20850107809730017, pvalue=0.6479447887525934), Power_divergenceResult(statistic=4.198022975221989, pvalue=0.0404711362009595), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293)]
Looking at the above result none of the p values is less than 0.05 so there is no significant difference in usage in high value and low value for these words. Additionally, the frequencies were all except one lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.
We can eliminate non-informative words to decrease the size of terms_used
so we are able to run count_values
function on more data. First we can remove stopwords
.
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. We would not want these words to take up space in our database, or taking up the valuable processing time. Let's remove these words.
len(terms_used)
24470
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
for word in stop_words:
if word in terms_used:
terms_used.remove(word)
len(terms_used)
24454
looking at the words in terms_used
there are some links which seem not relevant to our project question, so we can remove them as well.
terms_used_lr = pd.Series(list(terms_used))
# The tilde (~) operator is used to invert the boolean values
terms_used_lr = terms_used_lr[~terms_used_lr.str.contains('hrefhttp')]
len(terms_used_lr)
23251
There are still 23250 words in terms_used
. At this stage, we can look at the count_values
function and see if I can make it run faster.
Looking at the count_values
function there is a loop that iterates over the whole jeopardy dataset. we can replace it with the pandas columns operations to make it faster. To make it easier to understand the result, the new function returns the word as well.
def count_values_faster(word):
high_count = 0
low_count = 0
# regex pattern to match the whole word only
pattern = r"\b{}\b".format(word)
high_count = jeopardy[(jeopardy['clean_question'].str.contains(pattern, regex = True)) &
(jeopardy['high_value'] == 1)]['high_value'].count()
low_count = jeopardy[(jeopardy['clean_question'].str.contains(pattern, regex = True)) &
(jeopardy['high_value'] == 0)]['high_value'].count()
return word, high_count, low_count
Let's test to make sure that we get the same result as the count_values function.
observed_test = []
for word in comparison_terms:
observed_test.append(count_values_faster(word))
print(observed_test)
[('recruits', 1, 1), ('hotshot', 1, 0), ('500000member', 0, 1), ('exceptions', 0, 1), ('dipsomaniac', 0, 1), ('tylenol', 0, 1), ('letters', 17, 37), ('latvia', 3, 1), ('bergens', 0, 1), ('strangely', 1, 2)]
The test is passed and the results are the same with higher efficiency.
I am going to apply this new function on the all terms_used
. It takes time to run completely but it is more applicable than count_values.
frequencies = terms_used_lr.apply(count_values_faster)
frequencies
0 (boasts, 5, 6) 1 (integrity, 1, 0) 2 (puberty, 0, 1) 3 (gosling, 2, 0) 4 (seward, 0, 2) ... 24449 (beatified, 0, 1) 24450 (boxers, 0, 1) 24451 (modernqueen, 0, 1) 24452 (arthropods, 1, 0) 24453 (waldorf, 1, 1) Length: 23251, dtype: object
To make the chi_squared
test valid, let's filter the words with high frequency and run the chio squred test on the top 1000 highest frequencies.
def get_high_frequecies(data, size):
frequencies = pd.DataFrame(data,
columns = ['word', 'high_value', 'low_value'])
frequencies['total_value'] = frequencies['high_value'] + frequencies['low_value']
frequencies.sort_values('total_value', ascending = False, inplace = True)
return(frequencies.head(size))
high_frequecies = get_high_frequecies(list(frequencies),1000)
high_frequecies
word | high_value | low_value | total_value | |
---|---|---|---|---|
867 | called | 168 | 346 | 514 |
2728 | country | 141 | 332 | 473 |
19362 | played | 77 | 212 | 289 |
8683 | became | 79 | 203 | 282 |
4831 | american | 77 | 174 | 251 |
... | ... | ... | ... | ... |
17472 | controversial | 4 | 10 | 14 |
5216 | consists | 3 | 11 | 14 |
5286 | stopped | 1 | 13 | 14 |
16451 | figures | 4 | 10 | 14 |
16267 | waterfall | 5 | 9 | 14 |
1000 rows × 4 columns
def calculate_chi_squared(row):
chi_squared = []
total_prop = row['total_value']/jeopardy.shape[0]
high_value_exp = total_prop * high_value_count
low_value_exp = total_prop * low_value_count
observed = np.array([row['high_value'], row['low_value']])
expected = np.array([high_value_exp, low_value_exp])
chi_value, p_value = chisquare(observed, expected)
chi_squared.append((row['word'], chi_value, p_value, row['high_value'], row['low_value']))
return chi_squared
chi_squared = high_frequecies.apply(calculate_chi_squared, axis = 1)
chi_squared.head(5)
867 [(called, 4.048305063534577, 0.044215717944225... 2728 [(country, 0.29967829483482744, 0.584084171311... 19362 [(played, 0.5810990283039111, 0.44588185909193... 8683 [(became, 0.05956570730840162, 0.8071836789959... 4831 [(american, 0.4938111242657224, 0.482232156839... dtype: object
At this stage, we can filter the words with the p_values
less than 0.05 to figure out which words are significantly different in high value and low value. I am also looking for words with higher frequency in high_value
questions rather than low_value
ones.
x = [c[0] for c in chi_squared]
chi_squared_df = pd.DataFrame([c[0] for c in chi_squared],
columns = ['word', 'chi_squared', 'p_value', 'high_value', 'low_value'])
chi_squared_df = chi_squared_df.sort_values('p_value')
chi_squared_df = chi_squared_df[(chi_squared_df['p_value'] < 0.05) &
(chi_squared_df['high_value'] > chi_squared_df['low_value']) ]
chi_squared_df
word | chi_squared | p_value | high_value | low_value | |
---|---|---|---|---|---|
179 | monitora | 45.947439 | 1.214686e-11 | 35 | 13 |
78 | target_blanksarah | 24.358972 | 7.995351e-07 | 40 | 33 |
226 | target_blankkelly | 20.921282 | 4.785483e-06 | 25 | 16 |
93 | african | 17.283572 | 3.219584e-05 | 35 | 33 |
494 | painter | 16.941684 | 3.854581e-05 | 16 | 8 |
159 | target_blankjimmy | 16.114608 | 5.962236e-05 | 28 | 24 |
217 | target_blankjon | 13.979777 | 1.847876e-04 | 23 | 19 |
498 | pulitzer | 13.429676 | 2.476749e-04 | 15 | 9 |
388 | liquid | 12.719123 | 3.619354e-04 | 17 | 12 |
467 | example | 11.997980 | 5.325823e-04 | 15 | 10 |
592 | spirit | 11.341071 | 7.581159e-04 | 13 | 8 |
689 | andrew | 11.049381 | 8.871680e-04 | 12 | 7 |
557 | plants | 9.954357 | 1.604691e-03 | 13 | 9 |
547 | relative | 9.954357 | 1.604691e-03 | 13 | 9 |
309 | border | 9.792563 | 1.752191e-03 | 18 | 16 |
422 | process | 9.542069 | 2.008152e-03 | 15 | 12 |
991 | physics | 8.682874 | 3.212141e-03 | 9 | 5 |
947 | spiritual | 8.682874 | 3.212141e-03 | 9 | 5 |
439 | string | 8.057304 | 4.532057e-03 | 14 | 12 |
885 | elements | 7.198788 | 7.295282e-03 | 9 | 6 |
625 | marine | 6.779092 | 9.223181e-03 | 11 | 9 |
461 | jersey | 6.652781 | 9.900120e-03 | 13 | 12 |
721 | greece | 6.361380 | 1.166308e-02 | 10 | 8 |
829 | translated | 5.950459 | 1.471346e-02 | 9 | 7 |
848 | window | 5.950459 | 1.471346e-02 | 9 | 7 |
861 | filled | 5.950459 | 1.471346e-02 | 9 | 7 |
590 | composed | 5.772340 | 1.628035e-02 | 11 | 10 |
618 | persian | 5.772340 | 1.628035e-02 | 11 | 10 |
584 | particles | 5.772340 | 1.628035e-02 | 11 | 10 |
602 | colony | 5.772340 | 1.628035e-02 | 11 | 10 |
992 | physicist | 5.549240 | 1.848871e-02 | 8 | 6 |
694 | freedom | 5.333590 | 2.091826e-02 | 10 | 9 |
788 | committee | 4.896281 | 2.691459e-02 | 9 | 8 |
792 | portuguese | 4.896281 | 2.691459e-02 | 9 | 8 |
919 | describe | 4.460992 | 3.467736e-02 | 8 | 7 |
907 | nature | 4.460992 | 3.467736e-02 | 8 | 7 |
chi_squared_df.shape[0]
36
In this project, a dataset of Jeopardy questions has been used to figure out some patterns in the questions that could help to win. After exploring we figured out that
Then we focused our study on questions that pertain to high value questions instead of low value ones. This is helpful to earn more money. Using chi squared test we have got a list of 36 words with higher usage in high value questions and with a statistically significant difference of usage in high value and low value questions.
The next step can be finding the questions with the high value containing these words. These questions can be recommended to study to win.