Hypothesis Testing: Winning Jeopardy¶

Jeopardy is a popular TV show in the US where participants answer questions to win money. I am going to work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help to win.

The dataset is named jeopardy.csv and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded here.

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

  • Show Number -- the Jeopardy episode number of the show this question was in.
  • Air Date -- the date the episode aired.
  • Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
  • Category -- the category of the question.
  • Value -- the number of dollars answering the question correctly is worth.
  • Question -- the text of the question.
  • Answer -- the text of the answer.

First I am going to read the dataset and explore.

In [1]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head(5)
Out[1]:
Show Number Air Date Round Category Value Question Answer
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams
In [2]:
jeopardy.columns
Out[2]:
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the column names have spaces in front, I am going to remove them:

In [3]:
jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns
Out[3]:
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Let's have a close look at the format of each column.

In [6]:
jeopardy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1   Air Date     19999 non-null  object
 2   Round        19999 non-null  object
 3   Category     19999 non-null  object
 4   Value        19999 non-null  object
 5   Question     19999 non-null  object
 6   Answer       19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB

Normalize the columns¶

One messy aspect about the Jeopardy dataset is that it contains text. Text can contain punctuation and different capitalization, which will make it hard for us to compare the text of an answer to the text of a question. We would like to make this process easier for ourselves, so we’ll need to process the text data in this step. The process of cleaning text in data analysis is sometimes called normalization. More specifically, we want ensure that we lowercase all of the words and any remove punctuation. We remove punctuation because it ensures that the text stays as purely letters. Before normalization, the terms Don’t and don’t are considered to be different words, and we don’t want this.

Before starting the analysis, we need to normalize and fix the datatypes of some columns. I need to lowercase Question and Answer columns and remove the punctuation. the Value column should be numeric and the Air Dateshould be a datetime.

First I am going to write a function to get in a string and return that string in lowercase and without punctuation.

In [7]:
import re
def normalize(text):
    text  = text.lower()
    text = re.sub('[^\w\s]', '', text)
    return text

# test normalize function
normalize("Hello! How are you?")
Out[7]:
'hello how are you'

Let's apply the normalize function to Question and Answer columns and save the result in clean_question and clean_answer columns.

In [8]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy['clean_question'].head(5)
Out[8]:
0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object
In [9]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)
jeopardy['clean_answer'].head(5)
Out[9]:
0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

To normalize the Value column I am going to remove the dollar sign from the beginning, convert it from text to numeric and save the result to a new column called clean_value.

In [13]:
def normalize_value(value):
    value = re.sub('[^\w\s]', '', value)
    try:
        value_int = int(value)
    except ValueError:
        value_int = 0
    return value_int
# test
normalize_value('$200')
Out[13]:
200
In [14]:
#apply normalize_value function to Value column
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

The Air Date column should also be datatime to enable us to work with easily.

In [15]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

Let's see the types of all columns especially the new ones again.

In [16]:
jeopardy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Show Number     19999 non-null  int64         
 1   Air Date        19999 non-null  datetime64[ns]
 2   Round           19999 non-null  object        
 3   Category        19999 non-null  object        
 4   Value           19999 non-null  object        
 5   Question        19999 non-null  object        
 6   Answer          19999 non-null  object        
 7   clean_question  19999 non-null  object        
 8   clean_answer    19999 non-null  object        
 9   clean_value     19999 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB

Study¶

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

  • How often the answer is used for a question.
  • How often new questions are repeats of older questions.

To answer the second question I need to figure out how often complex words (> 6 characters) reoccur and for the first question I need to see how many times words in the answer also occur in the question.

let's start with the first question. I am going to write a function to calculate for each question the ratio of the number of words in answers that are found in questions. Then I am going to apply it to all of the questions and calculate the average of them. In this function, 'the' is removed from the words that are investigated since in not a valuable word.

In [18]:
def count_matches_ratio(row):
    answer = row['clean_answer']
    question = row['clean_question']
    split_answer = answer.split()
    split_question = question.split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count/len(split_answer)
  
# use apply() to loop over all the rows 
jeopardy['answer_in_question'] = jeopardy.apply(count_matches_ratio, axis = 1)  
jeopardy['answer_in_question'].mean()
Out[18]:
0.05900196524977763

On average about 6% of the words of answers are found in the questions. So the chance of deducing the answer from the question is quite low.

Repeated questions¶

Let's go through the second question and investigate how often new questions are repeated of older ones. I can not completely answer this question the dataset includes only 10% of the full jeopardy question dataset but I am going to investigate it.

I am going to check if the terms with six or more characters in questions have been used previously or not.

In [25]:
question_overlap = []
# get unique set of words
terms_used = set()
# sorted date, then it is clear to see what is a new question
jeopardy.sort_values('Air Date', inplace = True)
# loop over data frame with index cout
for i, row in jeopardy.iterrows():
    # get list of the words in a question
    split_question = row['clean_question'].split()
    # word contains 6+ characters
    split_question = [q for q in split_question if len(q)>= 6]
    match_count = 0
    for term in split_question:
        if term in terms_used:
            match_count += 1
        terms_used.add(term)
    if len(split_question) > 0:
        # normalize the count across different question length
        match_count = match_count / len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
# get the percentage of the repeated question
jeopardy['question_overlap'].mean()
Out[25]:
0.689481997219586

About 69% of the complex words in questions are repeated so it seems studying the past questions can be really helpful to win.

Study questions with high value¶

Let's focus our study on questions that pertain to high value questions instead of low value questions. This is helpful to earn more money.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. I'll first need to narrow down the questions into two categories:

  • Low value -- Any row where Value is less than 800.
  • High value -- Any row where Value is greater than 800.

I'll then be able to loop through each of the terms from terms_used, and:

  • Find the number of low value questions the word occurs in.
  • Find the number of high value questions the word occurs in.
  • Find the percentage of questions the word occurs in.
  • Based on the percentage of questions the word occurs in, find expected counts.
  • Compute the chi-squared value based on the expected counts and the observed counts for high and low value questions.

I can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so I'll just do it for a small sample now.

In [26]:
def categorize_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(categorize_value, axis = 1)
In [27]:
def count_values(word):
    low_count = 0
    high_count = 0
    for _, row in jeopardy.iterrows():
        split_question = row['clean_question'].split()
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count
In [28]:
#Randomly pick ten elements of terms_used
from random import choice
comparison_terms = [choice(list(terms_used)) for _ in range(10)]
comparison_terms
Out[28]:
['recruits',
 'hotshot',
 '500000member',
 'exceptions',
 'dipsomaniac',
 'tylenol',
 'letters',
 'latvia',
 'bergens',
 'strangely']
In [29]:
observed_expected = []
for word in comparison_terms:
    observed_expected.append(count_values(word))
observed_expected
Out[29]:
[(1, 1),
 (1, 0),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (17, 37),
 (3, 1),
 (0, 1),
 (1, 2)]

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [34]:
high_value_count = sum(jeopardy['high_value'])
low_value_count = jeopardy[jeopardy['high_value'] == 0]['high_value'].count()
# low_value_count2 = jeopardy['high_value'].count() - sum(jeopardy['high_value'])

print('high_value_count = {}'.format(high_value_count))
print('low_value_count = {}'.format(low_value_count))
high_value_count = 5734
low_value_count = 14265
In [35]:
import numpy as np
from scipy.stats import chisquare

chi_squared = []
for high_count, low_count in observed_expected:
    # total times that a word shows
    total = high_count + low_count
    # the probability of a word shows in jeopardy
    total_prop = total/jeopardy.shape[0]
    # expected values according to the ratio of total high/low values
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([high_count, low_count])
    expected = np.array([high_value_exp, low_value_exp])
    
    chi_squared.append(chisquare(observed, expected))
    
chi_squared
Out[35]:
[Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.20850107809730017, pvalue=0.6479447887525934),
 Power_divergenceResult(statistic=4.198022975221989, pvalue=0.0404711362009595),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293)]

Looking at the above result none of the p values is less than 0.05 so there is no significant difference in usage in high value and low value for these words. Additionally, the frequencies were all except one lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

Eliminate non-informative words¶

We can eliminate non-informative words to decrease the size of terms_used so we are able to run count_values function on more data. First we can remove stopwords.

Remove stopwords¶

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. We would not want these words to take up space in our database, or taking up the valuable processing time. Let's remove these words.

In [36]:
len(terms_used)
Out[36]:
24470
In [40]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
for word in stop_words:
    if word in terms_used:
        terms_used.remove(word)
len(terms_used)
Out[40]:
24454

Remove hrefhttp¶

looking at the words in terms_used there are some links which seem not relevant to our project question, so we can remove them as well.

In [48]:
terms_used_lr = pd.Series(list(terms_used))
# The tilde (~) operator is used to invert the boolean values 
terms_used_lr = terms_used_lr[~terms_used_lr.str.contains('hrefhttp')]
len(terms_used_lr)
Out[48]:
23251

There are still 23250 words in terms_used. At this stage, we can look at the count_values function and see if I can make it run faster.

Re-write count_values function¶

Looking at the count_values function there is a loop that iterates over the whole jeopardy dataset. we can replace it with the pandas columns operations to make it faster. To make it easier to understand the result, the new function returns the word as well.

In [42]:
def count_values_faster(word):
    high_count = 0
    low_count = 0
    
    # regex pattern to match the whole word only
    pattern = r"\b{}\b".format(word)
    high_count = jeopardy[(jeopardy['clean_question'].str.contains(pattern, regex = True)) &
                         (jeopardy['high_value'] == 1)]['high_value'].count()
    low_count = jeopardy[(jeopardy['clean_question'].str.contains(pattern, regex = True)) &
                        (jeopardy['high_value'] == 0)]['high_value'].count()
    return word, high_count, low_count

Let's test to make sure that we get the same result as the count_values function.

In [43]:
observed_test = []
for word in comparison_terms:
    observed_test.append(count_values_faster(word))
print(observed_test)
[('recruits', 1, 1), ('hotshot', 1, 0), ('500000member', 0, 1), ('exceptions', 0, 1), ('dipsomaniac', 0, 1), ('tylenol', 0, 1), ('letters', 17, 37), ('latvia', 3, 1), ('bergens', 0, 1), ('strangely', 1, 2)]

The test is passed and the results are the same with higher efficiency.

I am going to apply this new function on the all terms_used. It takes time to run completely but it is more applicable than count_values.

In [47]:
frequencies = terms_used_lr.apply(count_values_faster)
frequencies
Out[47]:
0             (boasts, 5, 6)
1          (integrity, 1, 0)
2            (puberty, 0, 1)
3            (gosling, 2, 0)
4             (seward, 0, 2)
                ...         
24449      (beatified, 0, 1)
24450         (boxers, 0, 1)
24451    (modernqueen, 0, 1)
24452     (arthropods, 1, 0)
24453        (waldorf, 1, 1)
Length: 23251, dtype: object

Words with higher frequencies¶

To make the chi_squared test valid, let's filter the words with high frequency and run the chio squred test on the top 1000 highest frequencies.

In [49]:
def get_high_frequecies(data, size):
    frequencies = pd.DataFrame(data, 
                               columns = ['word', 'high_value', 'low_value'])
    frequencies['total_value'] = frequencies['high_value'] + frequencies['low_value']
    frequencies.sort_values('total_value', ascending = False, inplace = True)
    return(frequencies.head(size))



high_frequecies = get_high_frequecies(list(frequencies),1000)
high_frequecies
Out[49]:
word high_value low_value total_value
867 called 168 346 514
2728 country 141 332 473
19362 played 77 212 289
8683 became 79 203 282
4831 american 77 174 251
... ... ... ... ...
17472 controversial 4 10 14
5216 consists 3 11 14
5286 stopped 1 13 14
16451 figures 4 10 14
16267 waterfall 5 9 14

1000 rows × 4 columns

In [50]:
def calculate_chi_squared(row):
    chi_squared = []
    total_prop = row['total_value']/jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    observed = np.array([row['high_value'], row['low_value']])
    expected = np.array([high_value_exp, low_value_exp])
    
    chi_value, p_value = chisquare(observed, expected)
    
    chi_squared.append((row['word'], chi_value, p_value, row['high_value'], row['low_value']))
    return chi_squared
                         
    
chi_squared = high_frequecies.apply(calculate_chi_squared, axis = 1)
chi_squared.head(5)
Out[50]:
867      [(called, 4.048305063534577, 0.044215717944225...
2728     [(country, 0.29967829483482744, 0.584084171311...
19362    [(played, 0.5810990283039111, 0.44588185909193...
8683     [(became, 0.05956570730840162, 0.8071836789959...
4831     [(american, 0.4938111242657224, 0.482232156839...
dtype: object

At this stage, we can filter the words with the p_values less than 0.05 to figure out which words are significantly different in high value and low value. I am also looking for words with higher frequency in high_value questions rather than low_value ones.

In [51]:
x = [c[0] for c in chi_squared]
chi_squared_df = pd.DataFrame([c[0] for c in chi_squared], 
                              columns = ['word', 'chi_squared', 'p_value', 'high_value', 'low_value'])
chi_squared_df = chi_squared_df.sort_values('p_value')
chi_squared_df = chi_squared_df[(chi_squared_df['p_value'] < 0.05) & 
                                (chi_squared_df['high_value'] > chi_squared_df['low_value']) ]
chi_squared_df
Out[51]:
word chi_squared p_value high_value low_value
179 monitora 45.947439 1.214686e-11 35 13
78 target_blanksarah 24.358972 7.995351e-07 40 33
226 target_blankkelly 20.921282 4.785483e-06 25 16
93 african 17.283572 3.219584e-05 35 33
494 painter 16.941684 3.854581e-05 16 8
159 target_blankjimmy 16.114608 5.962236e-05 28 24
217 target_blankjon 13.979777 1.847876e-04 23 19
498 pulitzer 13.429676 2.476749e-04 15 9
388 liquid 12.719123 3.619354e-04 17 12
467 example 11.997980 5.325823e-04 15 10
592 spirit 11.341071 7.581159e-04 13 8
689 andrew 11.049381 8.871680e-04 12 7
557 plants 9.954357 1.604691e-03 13 9
547 relative 9.954357 1.604691e-03 13 9
309 border 9.792563 1.752191e-03 18 16
422 process 9.542069 2.008152e-03 15 12
991 physics 8.682874 3.212141e-03 9 5
947 spiritual 8.682874 3.212141e-03 9 5
439 string 8.057304 4.532057e-03 14 12
885 elements 7.198788 7.295282e-03 9 6
625 marine 6.779092 9.223181e-03 11 9
461 jersey 6.652781 9.900120e-03 13 12
721 greece 6.361380 1.166308e-02 10 8
829 translated 5.950459 1.471346e-02 9 7
848 window 5.950459 1.471346e-02 9 7
861 filled 5.950459 1.471346e-02 9 7
590 composed 5.772340 1.628035e-02 11 10
618 persian 5.772340 1.628035e-02 11 10
584 particles 5.772340 1.628035e-02 11 10
602 colony 5.772340 1.628035e-02 11 10
992 physicist 5.549240 1.848871e-02 8 6
694 freedom 5.333590 2.091826e-02 10 9
788 committee 4.896281 2.691459e-02 9 8
792 portuguese 4.896281 2.691459e-02 9 8
919 describe 4.460992 3.467736e-02 8 7
907 nature 4.460992 3.467736e-02 8 7
In [52]:
chi_squared_df.shape[0]
Out[52]:
36

Conclusion¶

In this project, a dataset of Jeopardy questions has been used to figure out some patterns in the questions that could help to win. After exploring we figured out that

  • On average about 6% of the words of answers are found in the questions. So the chance of deducing the answer from the question is quite low.
  • About 69% of the complex words in questions are repeated so studying the past questions can be really helpful to win.

Then we focused our study on questions that pertain to high value questions instead of low value ones. This is helpful to earn more money. Using chi squared test we have got a list of 36 words with higher usage in high value questions and with a statistically significant difference of usage in high value and low value questions.

The next step can be finding the questions with the high value containing these words. These questions can be recommended to study to win.