In this project, we're going to analyze a dataset about the westbound traffic on the I-94 Interstate highway.
The goal of our analysis is to determine a few indicators of heavy traffic on I-94. These indicators can be weather type, time of the day, time of the week, etc. For instance, we may find out that the traffic is usually heavier in the summer or when it snows.
John Hogue made the dataset available that we'll be working with, and you can download it from the UCI Machine Learning Repository.
import pandas as pd
i_94 = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
i_94.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
i_94.tail()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
i_94.describe()
temp | rain_1h | snow_1h | clouds_all | traffic_volume | |
---|---|---|---|---|---|
count | 48204.000000 | 48204.000000 | 48204.000000 | 48204.000000 | 48204.000000 |
mean | 281.205870 | 0.334264 | 0.000222 | 49.362231 | 3259.818355 |
std | 13.338232 | 44.789133 | 0.008168 | 39.015750 | 1986.860670 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 272.160000 | 0.000000 | 0.000000 | 1.000000 | 1193.000000 |
50% | 282.450000 | 0.000000 | 0.000000 | 64.000000 | 3380.000000 |
75% | 291.806000 | 0.000000 | 0.000000 | 90.000000 | 4933.000000 |
max | 310.070000 | 9831.300000 | 0.510000 | 100.000000 | 7280.000000 |
i_94.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
The dataset has 48,204 rows and 9 columns, and there are no null values. Each row describes traffic and weather data for a specific hour — we have data from 2012-10-02 09:00:00 until 2018-09-30 23:00:00.
A station located approximately midway between Minneapolis and Saint Paul records the traffic data (see the dataset documentation). For this station, the direction of the route is westbound (i.e., cars moving from east to west). This means that the results of our analysis will be about the westbound traffic in the proximity of the station. In other words, we should avoid generalizing our results for the entire I-94 highway.
We're going to start our analysis by examining the distribution of the traffic_volume
column.
import matplotlib.pyplot as plt
%matplotlib inline
i_94['traffic_volume'].plot.hist()
plt.show()
i_94['traffic_volume'].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
Between 2012-10-02 09:00:00 and 2018-09-30 23:00:00, the hourly traffic volume varied from 0 to 7,280 cars, with an average of 3,260 cars.
About 25% of the time, there were only 1,193 cars or fewer passing the station each hour — this probably occurs during the night, or when a road is under construction. However, about 25% of the time, the traffic volume was four times as much (4,933 cars or more).
This observation gives our analysis an interesting direction: comparing daytime data with nighttime data.
We'll start by dividing the dataset into two parts:
Daytime data: hours from 7 AM to 7 PM (12 hours) Nighttime data: hours from 7 PM to 7 AM (12 hours) While this is not a perfect criterion for distinguishing between nighttime and daytime, it's a good starting point.
i_94['date_time'] = pd.to_datetime(i_94['date_time'])
day = i_94.copy()[(i_94['date_time'].dt.hour >= 7) & (i_94['date_time'].dt.hour < 19)]
print(day.shape)
night = i_94.copy()[(i_94['date_time'].dt.hour >= 19) | (i_94['date_time'].dt.hour < 7)]
print(night.shape)
(23877, 9) (24327, 9)
This significant difference in row numbers between day and night is due to a few hours of missing data. For instance, if you look at rows 176 and 177 (i_94.iloc[176:178]), you'll notice there's no data for two hours (4 and 5).
Now that we've isolated day and night, we're going to look at the histograms of traffic volume side-by-side by using a grid chart.
plt.figure(figsize=(11,3.5))
plt.subplot(1, 2, 1)
plt.hist(day['traffic_volume'])
plt.xlim(-100, 7500)
plt.ylim(0, 8000)
plt.title('Traffic Volume: Day')
plt.xlabel('Traffic Volumn')
plt.ylabel('Frequency')
plt.subplot(1,2,2)
plt.hist(night['traffic_volume'])
plt.xlim(-100, 7500)
plt.ylim(0, 8000)
plt.title('Traffic Volume: Night')
plt.xlabel('Traffic Volumn')
plt.ylabel('Frequency')
plt.show()
day['traffic_volume'].describe()
count 23877.000000 mean 4762.047452 std 1174.546482 min 0.000000 25% 4252.000000 50% 4820.000000 75% 5559.000000 max 7280.000000 Name: traffic_volume, dtype: float64
night['traffic_volume'].describe()
count 24327.000000 mean 1785.377441 std 1441.951197 min 0.000000 25% 530.000000 50% 1287.000000 75% 2819.000000 max 6386.000000 Name: traffic_volume, dtype: float64
The histogram that shows the distribution of traffic volume during the day is left skewed. This means that most of the traffic volume values are high — there are 4,252 or more cars passing the station each hour 75% of the time (because 25% of values are less than 4,252).
The histogram displaying the nighttime data is right skewed. This means that most of the traffic volume values are low — 75% of the time, the number of cars that passed the station each hour was less than 2,819.
Although there are still measurements of over 5,000 cars per hour, the traffic at night is generally light. Our goal is to find indicators of heavy traffic, so we'll only focus on the daytime data moving forward.
One of the possible indicators of heavy traffic is time. There might be more people on the road in a certain month, on a certain day, or at a certain time of day.
We're going to look at a few line plots showing how the traffic volume changes according to the following:
day['month'] = day['date_time'].dt.month
by_month = day.groupby('month').mean(numeric_only=True)
by_month['traffic_volume']
month 1 4495.613727 2 4711.198394 3 4889.409560 4 4906.894305 5 4911.121609 6 4898.019566 7 4595.035744 8 4928.302035 9 4870.783145 10 4921.234922 11 4704.094319 12 4374.834566 Name: traffic_volume, dtype: float64
plt.plot(by_month['traffic_volume'])
# by_month['traffic_volume'].plot.line()
plt.show()
The traffic looks less heavy during cold months (November–February) and more intense during warm months (March–October), with one interesting exception: July. Is there anything special about July? Is traffic significantly less heavy in July each year?
To answer the last question, let's see how the traffic volume changed each year in July.
day['year'] = day['date_time'].dt.year
only_july = day[day['month'] == 7]
plt.plot(only_july.groupby('year').mean(numeric_only=True)['traffic_volume'])
plt.show()
Typically, the traffic is pretty heavy in July, similar to the other warm months. The only exception we see is 2016, which had a high decrease in traffic volume. One possible reason for this is road construction — this article from 2016 supports this hypothesis.
As a tentative conclusion here, we can say that warm months generally show heavier traffic compared to cold months. In a warm month, you can can expect for each hour of daytime a traffic volume close to 5,000 cars.
Let's now look at a more granular indicator: day of week.
day['day_of_week']=day['date_time'].dt.dayofweek
by_day_of_week = day.groupby('day_of_week').mean(numeric_only=True)
by_day_of_week['traffic_volume'].plot.line()
<Axes: xlabel='day_of_week'>
Traffic volume is significantly heavier on business days (Monday – Friday). Except for Monday, we only see values over 5,000 during business days. Traffic is lighter on weekends, with values below 4,000 cars.
Let's now see what values we have based on time of the day. The weekends, however, will drag down the average values, so we're going to look only at the averages separately.
day['hour'] = day['date_time'].dt.hour
weekday = day.copy()[(day['day_of_week'] >= 0) & (day['day_of_week'] < 5)]
weekend = day.copy()[day['day_of_week'] >= 5]
by_hour_weekday = weekday.groupby('hour').mean(numeric_only=True)
by_hour_weekend = weekend.groupby('hour').mean(numeric_only=True)
plt.figure(figsize=(11, 3.5))
plt.subplot(1,2,1)
plt.plot(by_hour_weekday['traffic_volume'])
plt.xlim(6, 20)
plt.ylim(1500, 6500)
plt.xlabel('Hour')
plt.ylabel('Traffic Volume')
plt.title('Weekday Traffic by Hour')
plt.subplot(1,2,2)
plt.plot(by_hour_weekend['traffic_volume'])
plt.xlim(6, 20)
plt.ylim(1500, 6500)
plt.xlabel('Hour')
plt.ylabel('Traffic Volume')
plt.title('Weekend Traffic by Hour')
plt.show()
At each hour of the day, the traffic volume is generally higher during business days compared to the weekends. As somehow expected, the rush hours are around 7 and 16 — when most people travel from home to work and back. We see volumes of over 6,000 cars at rush hours.
To summarize, we found a few time-related indicators of heavy traffic:
The traffic is usually heavier during warm months (March–October) compared to cold months (November–February). The traffic is usually heavier on business days compared to weekends. On business days, the rush hours are around 7 and 16.
Another possible indicator of heavy traffic is weather. The dataset provides us with a few useful columns about weather: temp
, rain_1h
, snow_1h
, clouds_all
, weather_main
, weather_description
.
A few of these columns are numerical, so let's start by looking up their correlation values with traffic_volume
.
day.corr()['traffic_volume']
C:\Users\Clark\AppData\Local\Temp\ipykernel_20108\3421110943.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. day.corr()['traffic_volume']
temp 0.128317 rain_1h 0.003697 snow_1h 0.001265 clouds_all -0.032932 traffic_volume 1.000000 month -0.022337 year -0.003557 day_of_week -0.416453 hour 0.172704 Name: traffic_volume, dtype: float64
Temperature shows the strongest correlation with a value of just +0.13. The other relevant columns (rain_1h
, snow_1h
, clouds_all
) don't show any strong correlation with traffic_value
.
Let's generate a scatter plot to visualize the correlation between temp
and traffic_volume
.
day.plot.scatter('traffic_volume', 'temp')
plt.ylim(230.320) # two wrong 0K temperatures mess up the y-axis
plt.show()
We can conclude that temperature doesn't look like a solid indicator of heavy traffic.
Let's now look at the other weather-related columns: weather_main
and weather_description
.
To start, we're going to group the data by weather_main
and look at the traffic_volume
averages.
by_weather_main = day.groupby('weather_main').mean(numeric_only=True)
by_weather_main['traffic_volume'].plot.barh()
plt.show()
by_weather_description = day.groupby('weather_description').mean(numeric_only=True)
by_weather_description['traffic_volume'].plot.barh(figsize = (5, 15))
plt.show()
It looks like there are three weather types where traffic volume exceeds 5,000:
It's not clear why these weather types have the highest average traffic values — this is bad weather, but not that bad. Perhaps more people take their cars out of the garage when the weather is bad instead of riding a bike or walking.
In this project, we tried to find a few indicators of heavy traffic on the I-94 Interstate highway. We managed to find two types of indicators: