For this project, we'll be looking at the Medical Cost Data Set from Kaggle. This dataset contains information on individual medical insurance bills. Each bill is associated with some demographic and personal characteristics of the person who received it.
For our regression problem, we're interested in how these different characteristics relate to the total medical cost. It's a continuous, positive number, which makes it a good candidate for a linear regression. For this guided project, we want to construct the best possible predictive model for the cost, given some information about the patient. Predicting medical costs is an important task because it allows hospitals to predict revenue and plan necessary procedures needed by its patient population.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
# Load in the insurance dataset
insurance = pd.read_csv("insurance.csv")
# Columns in the dataset
insurance.columns
Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')
insurance.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1338 entries, 0 to 1337 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB
The charges
column is our outcome, while everything else are the potential predictors to use in the model
insurance.hist("charges")
array([[<Axes: title={'center': 'charges'}>]], dtype=object)
The charges
column is highly skewed to the right. Extremely costly insurance charges are more common than extremely small ones. This makes it unlikely that the errors in the model will truly be centered at zero. It might be worth it to log-transform the outcome.
insurance["log_charges"] = np.log2(insurance["charges"])
insurance.hist("log_charges")
array([[<Axes: title={'center': 'log_charges'}>]], dtype=object)
The log-transformed charges
values are more centered, which is what we wanted. This makes it more likely that the errors will be unbiased.
# Checking the correlation between the continuous columns in the insurance data
insurance.corr(numeric_only=True)
age | bmi | children | charges | log_charges | |
---|---|---|---|---|---|
age | 1.000000 | 0.109272 | 0.042469 | 0.299008 | 0.527834 |
bmi | 0.109272 | 1.000000 | 0.012759 | 0.198341 | 0.132669 |
children | 0.042469 | 0.012759 | 1.000000 | 0.067998 | 0.161336 |
charges | 0.299008 | 0.198341 | 0.067998 | 1.000000 | 0.892964 |
log_charges | 0.527834 | 0.132669 | 0.161336 | 0.892964 | 1.000000 |
age
has 30% correlation with charges
, bmi
has 19.8% correlation, and number of children
has 6.7% correlation.
insurance.boxplot(column = ["log_charges"], by = "sex")
<Axes: title={'center': 'log_charges'}, xlabel='sex'>
insurance.boxplot(column = ["log_charges"], by = "smoker")
<Axes: title={'center': 'log_charges'}, xlabel='smoker'>
insurance.boxplot(column = ["log_charges"], by = "region")
<Axes: title={'center': 'log_charges'}, xlabel='region'>
insurance.boxplot(column = ["log_charges"], by = "region")
<Axes: title={'center': 'log_charges'}, xlabel='region'>
Males seem to have a wider distribution of charges compared to women. Smokers have much higher costs than non-smokers. There doesn't seem tobe many appreciable differences between regions.
Based on the univariate relationships shown above, age
, bmi
and smoker
are positively associated with higher charges. We'll include these predictors in our final model.
# Splitting the data up into a training and test set
insurance["is_smoker"] = insurance["smoker"] == "yes"
X = insurance[["age", "bmi", "is_smoker"]]
y = insurance["log_charges"]
# 75% for training set, 25% for test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
random_state = 1)
# Training and checking model performance on training set
insurance_model = LinearRegression()
insurance_model.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
# Get predicted values by model
y_pred = insurance_model.predict(X_train)
# MSE on the log scale for the insurance charges
mean_squared_error(y_train, y_pred)
0.4546665339270644
# MSE on the original scale for the insurance charges
np.exp(mean_squared_error(y_train, y_pred))
1.575647870310887
# Coefficient of determination
r2_score(y_train, y_pred)
0.7421118855283421
# Quick visual check of residuals
check = pd.DataFrame()
check["residuals"] = y_train - y_pred
check["fitted"] = y_pred
check.plot.scatter(x = "fitted", y = "residuals")
<Axes: xlabel='fitted', ylabel='residuals'>
The residuals suggest some violations to the assumptions of linear regression. As fitted values get larger, the residuals trend downward. We expect an even band, centered around zero. This does not necessarily make the model predictions unusable, but it puts into question the linear regression assumptions.
# Getting the non-intercept coefficients
insurance_model.coef_
array([0.04892865, 0.01523672, 2.23063344])
Note: we are not concerned about if these changes are statistically significant, so we don't know if these associations are truly non-zero. Our primary goal is prediction.
# Getting MSE on test model
test_pred = insurance_model.predict(X_test)
mean_squared_error(y_test, test_pred)
0.4355350875308211
# Putting the outcome (in log-terms) back into the original scale
np.exp(mean_squared_error(y_test, test_pred))
1.545789970635098
The test MSE was about 0.435, while the training MSE was about 0.454. In this case, the two errors match up pretty well, so we can conclude that the model is not overfit. The residuals suggest that the model is predicting much lower costs for subjects who were actually charged much higher. Therefore the model struggles with these higher costs. As a whole, the model predictions are too conservative.
We might improve the model by including more complex terms in the regression, such as interactions or quadratic terms.