This is the first project for data analyse. There are plenty helper funcitons to avoid using advanced package, such as numpy
and pandas
. We keep using native functions to achieve the goal for practicing the basics.
In this project, We will answer the folloing questions:
First of all, we need some helper functions to deal with the first project.
import pandas as pd
import re
import matplotlib.pyplot as plt
from IPython.display import display, HTML
def data_from_url(url):
df = pd.read_html(url)[1]
lol = df.to_numpy().tolist()
return lol
def fetch_year(date_string):
return int(re.findall("\d{4}", date_string)[0])
def barplot(list_of_2_element_list):
d = {ya[0]:ya[1] for ya in list_of_2_element_list}
plt.figure(figsize=(9,15))
axes = plt.axes()
axes.get_xaxis().set_visible(False)
spines = axes.spines
spines['top'].set_visible(False)
spines['right'].set_visible(False)
spines['bottom'].set_visible(False)
spines['left'].set_visible(False)
ax = plt.barh(*zip(*d.items()), height=.5)
plt.yticks(list(d.keys()), list(d.keys()))
plt.xticks(range(4), range(4))
rectangles = ax.patches
for rectangle in rectangles:
x_value = rectangle.get_width()
y_value = rectangle.get_y() + rectangle.get_height() / 2
space = 5
ha = 'left'
label = "{}".format(x_value)
if x_value > 0:
plt.annotate(
label,
(x_value, y_value),
xytext=(space, 0),
textcoords="offset points",
va='center',
ha=ha)
axes.tick_params(tick1On=False)
plt.show()
def unique_countries(countries):
s = pd.Series(countries)
return list(s.unique())
def display_no_index(df):
display(HTML(df.to_html(index=False)))
def print_pretty_table(countries_frequency):
countries = df.Country.value_counts().index
occurrences = df.Country.value_counts().values
d = {"Country": countries, "Number of Occurrences": occurrences}
display_no_index(pd.DataFrame(d))
# df = pd.read_html("https://en.wikipedia.org/wiki/List_of_helicopter_prison_escapes")[1]
# df = df[["Date", "Prison name", "Country", "Succeeded", "Escapee(s)"]]
Now, let's get the data from the List of helicopter prison escapes Wikipedia article.
url = 'https://en.wikipedia.org/wiki/List_of_helicopter_prison_escapes'
data = data_from_url(url)
Let's print the first three rows and see what they are
for row in data[:3]:
print(row)
['August 19, 1971', 'Santa Martha Acatitla', 'Mexico', 'Yes', 'Joel David Kaplan Carlos Antonio Contreras Castro', "Joel David Kaplan was a New York businessman who had been arrested for murder in 1962 in Mexico City and was incarcerated at the Santa Martha Acatitla prison in the Iztapalapa borough of Mexico City. Joel's sister, Judy Kaplan, arranged the means to help Kaplan escape, and on August 19, 1971, a helicopter landed in the prison yard. The guards mistakenly thought this was an official visit. In two minutes, Kaplan and his cellmate Carlos Antonio Contreras, a Venezuelan counterfeiter, were able to board the craft and were piloted away, before any shots were fired.[9] Both men were flown to Texas and then different planes flew Kaplan to California and Contreras to Guatemala.[3] The Mexican government never initiated extradition proceedings against Kaplan.[9] The escape is told in a book, The 10-Second Jailbreak: The Helicopter Escape of Joel David Kaplan.[4] It also inspired the 1975 action movie Breakout, which starred Charles Bronson and Robert Duvall.[9]"] ['October 31, 1973', 'Mountjoy Jail', 'Ireland', 'Yes', "JB O'Hagan Seamus Twomey Kevin Mallon", 'On October 31, 1973, an IRA member hijacked a helicopter and forced the pilot to land in the exercise yard of Dublin\'s Mountjoy Jail\'s D Wing at 3:40\xa0p.m., October 31, 1973. Three members of the IRA were able to escape: JB O\'Hagan, Seamus Twomey and Kevin Mallon. Another prisoner who also was in the prison was quoted as saying, "One shamefaced screw apologised to the governor and said he thought it was the new Minister for Defence (Paddy Donegan) arriving. I told him it was our Minister of Defence leaving." The Mountjoy helicopter escape became Republican lore and was immortalized by "The Helicopter Song", which contains the lines "It\'s up like a bird and over the city. There\'s three men a\'missing I heard the warder say".[1]'] ['May 24, 1978', 'United States Penitentiary, Marion', 'United States', 'No', 'Garrett Brock Trapnell Martin Joseph McNally James Kenneth Johnson', "43-year-old Barbara Ann Oswald hijacked a Saint Louis-based charter helicopter and forced the pilot to land in the yard at USP Marion. While landing the aircraft, the pilot, Allen Barklage, who was a Vietnam War veteran, struggled with Oswald and managed to wrestle the gun away from her. Barklage then shot and killed Oswald, thwarting the escape.[10] A few months later Oswald's daughter hijacked TWA Flight 541 in an effort to free Trapnell."]
We initialize an index
variable with the value of 0. The purpose of this variable is to help us track which row we're modifying.
index = 0
for row in data:
data[index] = row[:-1]
index += 1
Let's check if the last column is removed.
print(data[:3])
[[1971, 'Santa Martha Acatitla', 'Mexico', 'Yes', 'Joel David Kaplan Carlos Antonio Contreras Castro'], [1973, 'Mountjoy Jail', 'Ireland', 'Yes', "JB O'Hagan Seamus Twomey Kevin Mallon"], [1978, 'United States Penitentiary, Marion', 'United States', 'No', 'Garrett Brock Trapnell Martin Joseph McNally James Kenneth Johnson']]
In the code cell below, we iterate over data using the iterable variable row and:
row[0]
, we refer to the first entry of row, i.e., the date.date = fetch_year(row[0])
, we're extracting the year out of the date in row[0]
and assiging it to the variable date.row[0]
with the year that we just extracted.for row in data:
row[0] = fetch_year(row[0])
print(data[:3])
[[1971, 'Santa Martha Acatitla', 'Mexico', 'Yes', 'Joel David Kaplan Carlos Antonio Contreras Castro'], [1973, 'Mountjoy Jail', 'Ireland', 'Yes', "JB O'Hagan Seamus Twomey Kevin Mallon"], [1978, 'United States Penitentiary, Marion', 'United States', 'No', 'Garrett Brock Trapnell Martin Joseph McNally James Kenneth Johnson']]
min_year = min(data, key=lambda x: x[0])[0]
max_year = max(data, key=lambda x: x[0])[0]
Before we move on, let's check what are the earliest and latest dates we have in our dataset.
print(min_year)
print(max_year)
1971 2020
Now we'll create a list of all the years ranging from min_year to max_year. Our goal is to then determine how many prison break attempts there were for each year. Since years in which there weren't any prison breaks aren't present in the dataset, this will make sure we capture them.
attempts_per_years = []
for y in range(min_year, max_year + 1):
attempts_per_years.append([y,0])
print(attempts_per_years)
[[1971, 0], [1972, 0], [1973, 0], [1974, 0], [1975, 0], [1976, 0], [1977, 0], [1978, 0], [1979, 0], [1980, 0], [1981, 0], [1982, 0], [1983, 0], [1984, 0], [1985, 0], [1986, 0], [1987, 0], [1988, 0], [1989, 0], [1990, 0], [1991, 0], [1992, 0], [1993, 0], [1994, 0], [1995, 0], [1996, 0], [1997, 0], [1998, 0], [1999, 0], [2000, 0], [2001, 0], [2002, 0], [2003, 0], [2004, 0], [2005, 0], [2006, 0], [2007, 0], [2008, 0], [2009, 0], [2010, 0], [2011, 0], [2012, 0], [2013, 0], [2014, 0], [2015, 0], [2016, 0], [2017, 0], [2018, 0], [2019, 0], [2020, 0]]
Now we create a list where each element looks like [year, 0]
.
To determine how many attempts there were in each year, we will create a loop within a loop, then increment the second entry (the one on index 1 which starts out as being 0) by 1 each time a year appears in the data.
for row in data:
for year_attempt in attempts_per_years:
if row[0] == year_attempt[0]:
year_attempt[1] += 1
print (attempts_per_years)
[[1971, 2], [1972, 0], [1973, 2], [1974, 0], [1975, 0], [1976, 0], [1977, 0], [1978, 2], [1979, 0], [1980, 0], [1981, 4], [1982, 0], [1983, 2], [1984, 0], [1985, 4], [1986, 6], [1987, 2], [1988, 2], [1989, 4], [1990, 2], [1991, 2], [1992, 4], [1993, 2], [1994, 0], [1995, 0], [1996, 2], [1997, 2], [1998, 0], [1999, 2], [2000, 4], [2001, 6], [2002, 4], [2003, 2], [2004, 0], [2005, 4], [2006, 2], [2007, 6], [2008, 0], [2009, 6], [2010, 2], [2011, 0], [2012, 2], [2013, 4], [2014, 2], [2015, 0], [2016, 2], [2017, 0], [2018, 2], [2019, 0], [2020, 2]]
It would be better if we could visualize it in a friendlier way. matplotlib
can help in this case.
%matplotlib inline
barplot(attempts_per_years)
The years in which the most helicopter prison break attempts occurred were 1986, 2001, 2007 and 2009, with a total of six attempts each.
We can use the dictionary to do it for countries.
countries = {}
for row in data:
if row[2] not in countries:
countries[row[2]] = 1
else:
countries[row[2]] += 1
print(countries)
{'Mexico': 1, 'Ireland': 1, 'United States': 8, 'France': 15, 'Canada': 4, 'Australia': 2, 'Brazil': 2, 'Italy': 1, 'United Kingdom': 2, 'Puerto Rico': 1, 'Chile': 1, 'Netherlands': 1, 'Greece': 4, 'Belgium': 4, 'Russia': 1}
countries_frequency = []
for key, value in countries.items():
countries_frequency.append([key,value])
barplot(countries_frequency)
It is clearly showing that the most attempted helicopter prison breaks occur in France, in total 15 attempts.