Analyzing CIA Factbook Data Using SQL¶

Introduction¶

In this project, we'll work with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like the following:

  • population — the global population.
  • population_growth — the annual population growth rate, as a percentage.
  • area — the total land and water area.

If you are interested in this project, you can download the SQLite factbook.db database here.

In [1]:
%%capture
# capture the stdout/stderr output of the cell
# load the SQL module in the notebook
%load_ext sql
# connects to an SQLite database
%sql sqlite:///factbook.db

Overview of the Data¶

We can now begin by exploring the data.

In [2]:
%%sql
SELECT *
FROM sqlite_master
WHERE type='table';
 * sqlite:///factbook.db
Done.
Out[2]:
type name tbl_name rootpage sql
table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
table facts facts 47 CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float)
In [3]:
%%sql
SELECT *
FROM facts
LIMIT 5;
 * sqlite:///factbook.db
Done.
Out[3]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51
2 al Albania 28748 27398 1350 3029278 0.3 12.92 6.58 3.3
3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92
4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.0
5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46

Here are the descriptions for some of the columns:

  • name — the name of the country.
  • area — the total land and sea area of the country.
  • population — the country's population.
  • population_growth— the country's population growth as a percentage.
  • birth_rate — the country's birth rate, or the number of births a year per 1,000 people.
  • death_rate — the country's death rate, or the number of death a year per 1,000 people.
  • area — the country's total area (both land and water).
  • area_land — the country's land area in square kilometers.
  • area_water — the country's water area in square kilometers.

Let's start by calculating some summary statistics and see what they tell us.

Summary Statistics¶

In [4]:
%%sql
SELECT MIN(population) AS min_pop,
       MAX(population) AS max_pop,
       MIN(population_growth) AS min_pop_growth,
       MAX(population_growth) AS max_pop_growth
FROM facts;
 * sqlite:///factbook.db
Done.
Out[4]:
min_pop max_pop min_pop_growth max_pop_growth
0 7256490011 0.0 4.02

A few things are interesting in the summary statistics on the previous screen:

  • There's a country with a population of 0.
  • There's a country with a population of 7256490011 (or more than 7.2 billion people).

Let's use subqueries to concentrate on these countries without using the specific values.

Exploring Outliers¶

In [5]:
%%sql
SELECT *
FROM facts
WHERE population == (SELECT MIN(population) FROM facts);
 * sqlite:///factbook.db
Done.
Out[5]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
250 ay Antarctica None 280000 None 0 None None None None

It seems like the table contains a row for Antarctica, which explains the population of 0. This seems to match the CIA Factbook page for Antarctica:

screenshot

In [6]:
%%sql
SELECT *
FROM facts
WHERE population == (SELECT MAX(population) FROM facts);
 * sqlite:///factbook.db
Done.
Out[6]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
261 xx World None None None 7256490011 1.08 18.6 7.8 None

We can see that the table contains a row for the whole world, which explains the maximum population of over 7.2 billion we found earlier.

Now that we know this, we should recalculate the summary statistics we calculated earlier, while excluding the row for the whole world.

Summary Statistics Revisited¶

In [7]:
%%sql
SELECT MIN(population) AS min_pop,
       MAX(population) AS max_pop,
       MIN(population_growth) AS min_pop_growth,
       MAX(population_growth) AS max_pop_growth
FROM facts
WHERE name != 'World';
 * sqlite:///factbook.db
Done.
Out[7]:
min_pop max_pop min_pop_growth max_pop_growth
0 1367485388 0.0 4.02

There's a country whose population closes in on 1.4 billion!

Exploring Average Population and Area¶

Let's explore density. Density depends on the population and the country's area. Let's look at the average values for these two columns.

We should discard the row for the whole planet.

In [8]:
%%sql
SELECT AVG(population) AS avg_population, AVG(area) AS ave_area
FROM facts
WHERE name != 'World'
 * sqlite:///factbook.db
Done.
Out[8]:
avg_population ave_area
32242666.56846473 555093.546184739

We see that the average population is around 32 million and the average area is 555 thousand square kilometers.

Finding Densely Populated Countries¶

T to finish, we'll build on the query above to find countries that are densely populated. We'll identify countries that have the following:

  • Above-average values for population.
  • Below-average values for area.
In [9]:
%%sql
SELECT *
FROM facts
WHERE population > (SELECT AVG(population) FROM facts WHERE name!='World')
AND area < (SELECT AVG(area) FROM facts WHERE name!='World');
 * sqlite:///factbook.db
Done.
Out[9]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
14 bg Bangladesh 148460 130170 18290 168957745 1.6 21.14 5.61 0.46
65 gm Germany 357022 348672 8350 80854408 0.17 8.47 11.42 1.24
80 iz Iraq 438317 437367 950 37056169 2.93 31.45 3.77 1.62
83 it Italy 301340 294140 7200 61855120 0.27 8.74 10.19 4.1
85 ja Japan 377915 364485 13430 126919659 0.16 7.93 9.51 0.0
91 ks Korea, South 99720 96920 2800 49115196 0.14 8.19 6.75 0.0
120 mo Morocco 446550 446300 250 33322699 1.0 18.2 4.81 3.36
138 rp Philippines 300000 298170 1830 100998376 1.61 24.27 6.11 2.09
139 pl Poland 312685 304255 8430 38562189 0.09 9.74 10.19 0.46
163 sp Spain 505370 498980 6390 48146134 0.89 9.64 9.04 8.31
173 th Thailand 513120 510890 2230 67976405 0.34 11.19 7.8 0.0
182 ug Uganda 241038 197100 43938 37101745 3.24 43.79 10.69 0.74
185 uk United Kingdom 243610 241930 1680 64088222 0.54 12.17 9.35 2.54
192 vm Vietnam 331210 310070 21140 94348835 0.97 15.96 5.93 0.3

Some of these countries are generally known to be densely populated, so we have confidence in our results!