Cardiovascular diseases (CVDs) remain a leading cause of mortality globally, with an estimated 17.9 million people succumbing to it annually. The World Health Organization (WHO) considers CVDs to be a serious public health challenge and recommends early identification of risk factors as a crucial step towards reducing the burden of these diseases. Risk factors such as unhealthy diet, physical inactivity, and mental illness are known to contribute significantly to the development of CVDs. Therefore, it is imperative to identify these risk factors early on to prevent the onset of CVDs and reduce premature deaths.
In this project, we will use the Kaggle dataset and build a K-Nearest Neighbors classifier to accurately predict the likelihood of a patient having a heart disease in the future.
Our R&D company specializes in healthcare solutions and has collected anonymized data from multiple hospitals on several patients to predict the likelihood of a new patient having heart disease in the future. The dataset includes relevant information for each patient, such as personal information and medical data, including whether or not they have had heart disease before.
The dataset features are:
Age
: age of the patient [years]Sex
: sex of the patient [M: Male, F: Female]ChestPainType
: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]RestingBP
: resting blood pressure [mm Hg]Cholesterol
: serum cholesterol [mm/dl]FastingBS
: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]RestingECG
: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]MaxHR
: maximum heart rate achieved [Numeric value between 60 and 202]ExerciseAngina
: exercise-induced angina [Y: Yes, N: No]Oldpeak
: oldpeak = ST [Numeric value measured in depression]ST_Slope
: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]HeartDisease
: output class [1: heart disease, 0: Normal]To accurately predict the likelihood of a new patient having heart disease in the future, we will conduct exploratory data analysis on the dataset before building a model.
To proceed with the Exploratory Data Analysis (EDA) phase, we first need to ensure that the necessary libraries for the project have been imported and the dataset has been loaded into our environment.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
# Set the maximum column width
pd.set_option('display.max_colwidth', 100)
# Magic function to display plots in notebook
%matplotlib inline
# Read dataset
heart_df = pd.read_csv('heart.csv')
# View first five rows
heart_df.head()
Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | ExerciseAngina | Oldpeak | ST_Slope | HeartDisease | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40 | M | ATA | 140 | 289 | 0 | Normal | 172 | N | 0.0 | Up | 0 |
1 | 49 | F | NAP | 160 | 180 | 0 | Normal | 156 | N | 1.0 | Flat | 1 |
2 | 37 | M | ATA | 130 | 283 | 0 | ST | 98 | N | 0.0 | Up | 0 |
3 | 48 | F | ASY | 138 | 214 | 0 | Normal | 108 | Y | 1.5 | Flat | 1 |
4 | 54 | M | NAP | 150 | 195 | 0 | Normal | 122 | N | 0.0 | Up | 0 |
print(f'Number of features: {heart_df.shape[1]}')
print(f'Number of observations: {heart_df.shape[0]}')
Number of features: 12 Number of observations: 918
# Check data types of the features
heart_df.dtypes
Age int64 Sex object ChestPainType object RestingBP int64 Cholesterol int64 FastingBS int64 RestingECG object MaxHR int64 ExerciseAngina object Oldpeak float64 ST_Slope object HeartDisease int64 dtype: object
After examining the list of features and inspecting some of the rows from the dataset, we have identified both categorical and numerical features. The categorical features have a data type of object in this dataset. Notably, two columns - FastingBS
and HeartDisease
- have a data type of int64, but are categorical variables since they only take the values 0
and 1
.
Our initial focus will be on investigating and analyzing the numerical columns.
heart_df.describe()
Age | RestingBP | Cholesterol | FastingBS | MaxHR | Oldpeak | HeartDisease | |
---|---|---|---|---|---|---|---|
count | 918.000000 | 918.000000 | 918.000000 | 918.000000 | 918.000000 | 918.000000 | 918.000000 |
mean | 53.510893 | 132.396514 | 198.799564 | 0.233115 | 136.809368 | 0.887364 | 0.553377 |
std | 9.432617 | 18.514154 | 109.384145 | 0.423046 | 25.460334 | 1.066570 | 0.497414 |
min | 28.000000 | 0.000000 | 0.000000 | 0.000000 | 60.000000 | -2.600000 | 0.000000 |
25% | 47.000000 | 120.000000 | 173.250000 | 0.000000 | 120.000000 | 0.000000 | 0.000000 |
50% | 54.000000 | 130.000000 | 223.000000 | 0.000000 | 138.000000 | 0.600000 | 1.000000 |
75% | 60.000000 | 140.000000 | 267.000000 | 0.000000 | 156.000000 | 1.500000 | 1.000000 |
max | 77.000000 | 200.000000 | 603.000000 | 1.000000 | 202.000000 | 6.200000 | 1.000000 |
Based on the provided descriptive statistics, we can make the following observations:
53
years.Cholesterol
is higher than its mean by roughly 25
mm/dl, suggesting a left-skewed distribution with possible outliers skewing the distribution.RestingBP
and Cholesterol
have a minimum value of zero, which is not expected in these variables.Two observations stand out. First, RestingBP
and Cholesterol
having a minimum value of zero seems unlikely. RestingBP
is the resting blood pressure, and it's unlikely to be zero, while a cholesterol value of zero is also not possible as serum cholesterol is always present in the blood.
Secondly, we need domain expertise to help answer questions like this. For example, we could reach out to a subject matter expert or conduct further research on our own. As per the American Heart Association, serum cholesterol:
"is a composite of different measurements. Your total blood cholesterol is calculated by adding your HDL and LDL cholesterol levels, plus 20% of your triglyceride level."
Although these insights do not directly help us with our prediction model, they do provide us with valuable information regarding the quality of our dataset.
We will now move on to explore the categorical features. We will create visualizations to understand how the data is distributed in relation to our target variable -- HeartDisease
. For example, we will look at how many patients had a resting ECG categorized as ST and how many female patients had heart disease.
# Calculate missing values in all columns
heart_df.isnull().sum()
Age 0 Sex 0 ChestPainType 0 RestingBP 0 Cholesterol 0 FastingBS 0 RestingECG 0 MaxHR 0 ExerciseAngina 0 Oldpeak 0 ST_Slope 0 HeartDisease 0 dtype: int64
We'll create a bar chart for each categorical column in the heart_df
dataset, including FastingBS
and HeartDisease
. These charts will display the number of rows for each category in the column, with appropriate axis labels, data labels, and a title to make the visualization clear and informative.
for i , col in enumerate([1,2,3,4,5]):
print(i,',',col)
0 , 1 1 , 2 2 , 3 3 , 4 4 , 5
# Create a list of categorical columns
cat_cols = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope', 'FastingBS', 'HeartDisease']
# Define number of rows and columns for the subplots
num_rows = 4
num_cols = 2
# Set up the subplots
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(20,18))
# Loop over each categorical column to create a bar plot
for i, col in enumerate(cat_cols):
# Calculate the row and column index for the current subplot
row_idx = i // num_cols
col_idx = i % num_cols
# Count the number of rows for each category in the column (frequency distribution)
counts = heart_df[col].value_counts()
# Create a bar plot of the counts in the current subplot
ax = counts.plot(kind='bar', rot=0, color=cm.Blues(counts/counts.max()), ax=axes[row_idx, col_idx])
# Add labels for the x and y axes
ax.set_xlabel(col, size=12)
ax.set_ylabel('Count', size=12)
# Add a title for the plot
ax.set_title(f'Count of {col}', size=16)
# Add data labels to the bars
for i in ax.containers:
ax.bar_label(i, label_type='edge')
# If there are not enough columns to fill in the last row, create a blank plot
if len(cat_cols) % num_cols != 0:
blank_ax = axes[num_rows-1, num_cols-1]
blank_ax.axis('off')
# Adjust the spacing between subplots
fig.tight_layout()
plt.show()
Here are some relevant insights that can be gathered from the plot of each categorical column:
Now our next step is to create bar charts for every categorical column in the dataset, including FastingBS
. These bar charts will display the number of rows for each category of that column, grouped by HeartDisease
.
# Exclude 'HeartDisease' column from 'cat_cols' for grouping
cat_cols_2 = cat_cols[:-1]
# Set up figure and axis objects
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(14,12))
# Loop over categorical columns and create bar chart for each
for i, col in enumerate(cat_cols_2):
# Calculate the row and column index for the current subplot
row_idx = i // 2
col_idx = i % 2
# Group data by HeartDisease and the categorical column
counts = heart_df.groupby(['HeartDisease', col]).size().reset_index(name='count')
# Create bar chart
ax = counts.pivot(index=col, columns='HeartDisease', values='count').plot(kind='bar', ax=axes[row_idx, col_idx], rot=0)
# Add labels and title
ax.set_xlabel(col, size=12)
ax.set_ylabel('Count', size=12)
ax.set_title(f'Count of {col} grouped by HeartDisease', size=16)
# Add legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles, labels=['No Heart Disease', 'Heart Disease'], loc='upper right')
# Add data labels above each bar
for p in ax.patches:
ax.annotate(str(p.get_height()), xy=(p.get_x() + p.get_width() / 2., p.get_height()), xytext=(0, 5),
textcoords='offset points', ha='center', va='center')
plt.tight_layout()
plt.show()
Based on the bar plots, we can make the following observations:
We will continue our exploration of the data by creating various groupings and visualizations to extract more insights. Two questions that we will focus on are:
We will begin with the first question: What percentage of patients with heart disease are male and over 50 years old?
# Filter for patients with heart disease
heart_disease_df = heart_df[heart_df['HeartDisease'] == 1]
# Calculate percentage of patients with heart disease who are male and over 50 years old
male_over_50 = ((heart_disease_df['Sex'] == 'M') & (heart_disease_df['Age'] > 50)).sum()
total_heart_disease = len(heart_disease_df)
percent_male_over_50 = male_over_50 / total_heart_disease * 100
# Create pie chart
labels = ['Male and over 50', 'Other']
sizes = [percent_male_over_50, 100 - percent_male_over_50]
explode = (0.1, 0)
fig, ax = plt.subplots()
ax.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', startangle=90)
ax.axis('equal') # equal aspect ratio ensures that pie is drawn as a circle
ax.set_title('Percentage of Heart Disease Patients Who are Male and Over 50')
plt.show()
The graph above shows that approximately 68% of the patients with heart disease are male and over 50 years old. This is a significant finding as it indicates that there may be a higher risk of developing heart disease in males over 50 years old. This information could be valuable for developing targeted prevention and treatment strategies for this population.
Moving on to the second question: What is the median age of patients who were diagnosed with heart disease?
# Set figure size
plt.figure(figsize=(8, 6))
# Create box plot
bp = plt.boxplot(x=[heart_df[heart_df['HeartDisease'] == 0]['Age'], heart_df[heart_df['HeartDisease'] == 1]['Age']],
labels=['No Heart Disease', 'Heart Disease'], patch_artist=True, widths=0.5, showfliers=True,
flierprops={'marker':'o', 'markerfacecolor':'black', 'markersize':5})
# Set boxplot colors
colors = ['#1f77b4', '#ff7f0e']
for patch, color in zip(bp['boxes'], colors):
patch.set_facecolor(color)
# Set median line color
for median in bp['medians']:
median.set(color='black', linewidth=2)
# Set axis labels and title
plt.xlabel('Heart Disease Diagnosis')
plt.ylabel('Age')
plt.title('Distribution of Age by Heart Disease Diagnosis')
# Set y-axis range
plt.ylim(20, 90)
# Add horizontal grid lines
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Based on the boxplot visualization, we can see that the median age of patients without heart disease is 51 years old, while the median age of patients with heart disease is 57 years old. This suggests that age may be a risk factor for developing heart disease, as the median age of those with heart disease is higher than those without. However, it's important to note that there is overlap in the age distributions of those with and without heart disease, indicating that age is not the only factor influencing the development of heart disease.
In our previous analysis, we calculated the number of missing values in our entire dataset and didn't find any. Additionally, we observed some interesting patterns in our categorical columns:
HeartDisease
is relatively evenly distributed, with 410 patients not diagnosed with heart disease and 508 diagnosed with heart disease.Now, before proceeding with feature selection, we need to clean up our data. We noticed that some columns have 0 values that don't make sense. To handle these values, we have a few options:
HeartDisease
.While the third option could potentially provide us with a more accurate result, the size of our dataset is relatively small, so we can ignore it for the time being. Instead, we will determine the number of zero values in our dataset and determine the best approach to handle them based on their frequency of occurrence.
print(f"Number of rows that have 0 values for RestingBP: {(heart_df['RestingBP'] == 0).sum()}")
print(f"Number of rows that have 0 values for Cholesterol: {(heart_df['Cholesterol'] == 0).sum()}")
Number of rows that have 0 values for RestingBP: 1 Number of rows that have 0 values for Cholesterol: 172
Since our dataset only contains 918 rows, dropping 172 rows from the Cholesterol
column is not ideal as we would lose a significant amount of data, leaving us with only 746 rows. Therefore, we will opt for the second approach of imputing zeros with the median value of the column or the median value of the column grouped by HeartDisease
.
Note: This approach will only work if there are no missing values in the HeartDisease
column.
# Create a copy of the original dataframe
heart_clean_df = heart_df.copy()
# Impute 0 values in RestingBP with median value of the column grouped by HeartDisease
heart_clean_df['RestingBP'] = heart_clean_df.groupby('HeartDisease', group_keys=False)['RestingBP'].apply(lambda x: x.replace(0, x.median()))
# Impute 0 values in Cholesterol with median value of the column grouped by HeartDisease
heart_clean_df['Cholesterol'] = heart_clean_df.groupby('HeartDisease', group_keys=False)['Cholesterol'].apply(lambda x: x.replace(0, x.median()))
Let's verify the number of rows with 0 values in the RestingBP
and Cholesterol
columns after the imputation process.
heart_clean_df[["Cholesterol", "RestingBP"]].describe()
Cholesterol | RestingBP | |
---|---|---|
count | 918.000000 | 918.000000 |
mean | 239.675381 | 132.540305 |
std | 54.328249 | 17.989941 |
min | 85.000000 | 80.000000 |
25% | 214.000000 | 120.000000 |
50% | 225.000000 | 130.000000 |
75% | 267.000000 | 140.000000 |
max | 603.000000 | 200.000000 |
The minimum values for both columns have been updated and there are no more zero values present in either of them.
After cleaning our data, we can now move forward with feature selection for our model. Based on our EDA and a general understanding of the features, we have identified the following features as a starting point:
Age
Sex
ChestPainType
Cholesterol
FastingBS
Please note: The above list of features is not exhaustive. Therefore, we also may want to explore and select different features for the model and provide reasoning for selecting those features.
We'll also explore how the columns correlate to one another. Before we attempt that, we'll convert our categorical columns into dummy variables.
# Convert categorical variable into dummy variables
heart_clean_df = pd.get_dummies(heart_clean_df, drop_first=True)
# View results
heart_clean_df.head()
Age | RestingBP | Cholesterol | FastingBS | MaxHR | Oldpeak | HeartDisease | Sex_M | ChestPainType_ATA | ChestPainType_NAP | ChestPainType_TA | RestingECG_Normal | RestingECG_ST | ExerciseAngina_Y | ST_Slope_Flat | ST_Slope_Up | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40 | 140 | 289 | 0 | 172 | 0.0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 49 | 160 | 180 | 0 | 156 | 1.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 37 | 130 | 283 | 0 | 98 | 0.0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
3 | 48 | 138 | 214 | 0 | 108 | 1.5 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
4 | 54 | 150 | 195 | 0 | 122 | 0.0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
print(f'Number of columns with dummy variables: {heart_clean_df.shape[1]}')
Number of columns with dummy variables: 16
We can observe that the number of columns has increased from 12 to 16 after converting categorical features into dummy variables. With this expanded dataset, we can now create a Pearson's correlation heat map to explore how the columns correlate with one another.
# Calculate Pearson's correlation matrix
corr_mat = abs(heart_clean_df.corr())
# Create heatmap
fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(corr_mat, annot=True, cmap='Blues')
plt.title("Pearson's Correlation Heatmap for Heart Disease Data", fontsize=16)
plt.show()
Let's determine which features are moderately correlated to HeartDisease
by analyzing the heat map.
# Create a heatmap of the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(corr_mat[(corr_mat > 0.3) & (corr_mat < 1)], annot=True, cmap='Blues')
plt.title('Features Moderately Correlated with HeartDisease', size=16)
plt.show()
# Select top 7 features using square of the Pearson correlation
top_7_features = (corr_mat['HeartDisease'] ** 2).sort_values(ascending=False)[1:8].index.tolist()
# Create bar plot
plt.figure(figsize=(10, 5))
plt.bar(x=top_7_features, height=corr_mat.loc[top_7_features, 'HeartDisease'])
plt.xticks(rotation=45)
plt.title('Top 7 Correlated Features to Heart Disease: Bar Plot', fontsize=16)
plt.ylabel('Absolute Pearson Correlation Coefficient')
plt.xlabel('Features')
plt.show()
After analyzing the correlation heat map, we have identified the following features to have a positive correlation (with a correlation coefficient greater than 0.3) to HeartDisease
:
Oldpeak
MaxHR
or ChestPainType_ATA
as additional features to consider.Sex_M
ExerciseAngina_Y
ST_Slope_Flat
ST_Slope_Up
Interestingly, Cholesterol
is not strongly correlated to HeartDisease
, so we can consider ignoring it for now.
Based on our findings, we can narrow down the features that may be relevant and proceed with creating and training multiple models using one feature at a time.
We will begin by creating a separate model for each of the selected features and evaluating their performance using accuracy as a metric. In order to determine the optimal split percentage, we will experiment with different values of test_size. First, we need to split the data into training and validation sets.
# Split data into features 'X' and target variable 'y'
X = heart_clean_df.drop('HeartDisease', axis=1)
y = heart_clean_df['HeartDisease']
# Create list of selected features
sel_features = [
'Oldpeak',
'Sex_M',
'ExerciseAngina_Y',
'ST_Slope_Flat',
'ST_Slope_Up'
]
# Create list of test sizes to experiment with
test_sizes = [0.10, 0.15, 0.20, 0.25]
# Create dictionary of dictionaries to store results
results = {}
# Loop over test sizes
for test_size in test_sizes:
# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y,
test_size=test_size,
random_state=42)
# Create dictionary to store results for currect test size
test_results = {}
# Loop over selected features
for feature in sel_features:
# Fit knn classifier with current feature set
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train[[feature]], y_train)
# Evaluate accuracy on validation set
accuracy = knn.score(X_val[[feature]], y_val)
# Store accuracy in test_results dictionary
test_results[feature] = round(accuracy, 4)
# Store results from current test size in main dictionary
results[test_size] = test_results
To gain a better understanding of our findings, we can create a DataFrame from the results
dictionary. This will allow us to easily view the accuracy scores for each feature set and test size value combination.
# Create list to store dictionaries
results_list = []
# Loop over test size values
for size, features in results.items():
# Loop over features for current k value
for feature, accuracy in features.items():
# Create dictionary with test size, feature and accuracy keys
result_dict = {'size': size, 'feature': feature, 'accuracy': accuracy}
# Append to results list
results_list.append(result_dict)
# Convert list of dictionaries to DataFrame
test_size_df = pd.DataFrame(results_list)
test_size_df
size | feature | accuracy | |
---|---|---|---|
0 | 0.10 | Oldpeak | 0.6848 |
1 | 0.10 | Sex_M | 0.4130 |
2 | 0.10 | ExerciseAngina_Y | 0.6630 |
3 | 0.10 | ST_Slope_Flat | 0.5870 |
4 | 0.10 | ST_Slope_Up | 0.7935 |
5 | 0.15 | Oldpeak | 0.7029 |
6 | 0.15 | Sex_M | 0.4058 |
7 | 0.15 | ExerciseAngina_Y | 0.6522 |
8 | 0.15 | ST_Slope_Flat | 0.7319 |
9 | 0.15 | ST_Slope_Up | 0.7899 |
10 | 0.20 | Oldpeak | 0.6467 |
11 | 0.20 | Sex_M | 0.6413 |
12 | 0.20 | ExerciseAngina_Y | 0.6576 |
13 | 0.20 | ST_Slope_Flat | 0.7500 |
14 | 0.20 | ST_Slope_Up | 0.7989 |
15 | 0.25 | Oldpeak | 0.6174 |
16 | 0.25 | Sex_M | 0.6565 |
17 | 0.25 | ExerciseAngina_Y | 0.5739 |
18 | 0.25 | ST_Slope_Flat | 0.7435 |
19 | 0.25 | ST_Slope_Up | 0.8000 |
Based on the results, the following findings can be made:
ST_Slope_Up
consistently shows the highest accuracy across all test sizes, while Sex_M
consistently shows the lowest accuracy.Oldpeak
feature, the optimal test size is 0.15, while when using the ST_Slope_Up
feature, the optimal test size is 0.25.(n_neighbors)
used in the KNN classifier. Therefore, to obtain reliable results, it is recommended to run the analysis multiple times with different values of n_neighbors
and compare the results.We have decided to proceed with a test size of 0.15 for our experiments. Our next step will be to explore different values of n_neighbors
in order to determine the optimal value for our model.
# Split data into training (85%) and validation (15%)
X_train, X_val, y_train, y_val = train_test_split(X, y,
test_size=0.15,
random_state=42)
# List of k values (n_neighbors)
k_values = [1, 3, 5, 7, 10]
# Create dictionary of dictionaries to store results
results = {}
# Loop over k values
for value in k_values:
# Create dictionary to store results for currect k value
test_results = {}
# Loop over selected features
for feature in sel_features:
# Fit knn classifier with current feature set
knn = KNeighborsClassifier(n_neighbors=value)
knn.fit(X_train[[feature]], y_train)
# Evaluate accuracy on validation set
accuracy = knn.score(X_val[[feature]], y_val)
# Store accuracy in test_results dictionary
test_results[feature] = round(accuracy, 4)
# Store results from current k value in main dictionary
results[value] = test_results
# Create list to store dictionaries
results_list = []
# Loop over k values
for k, features in results.items():
# Loop over features for current k value
for feature, accuracy in features.items():
# Create dictionary with k, feature and accuracy keys
result_dict = {'n_neighbor': k, 'feature': feature, 'accuracy': accuracy}
# Append to results list
results_list.append(result_dict)
# Convert list of dictionaries to DataFrame
k_values_df = pd.DataFrame(results_list)
k_values_df
n_neighbor | feature | accuracy | |
---|---|---|---|
0 | 1 | Oldpeak | 0.6594 |
1 | 1 | Sex_M | 0.6522 |
2 | 1 | ExerciseAngina_Y | 0.6522 |
3 | 1 | ST_Slope_Flat | 0.7319 |
4 | 1 | ST_Slope_Up | 0.7899 |
5 | 3 | Oldpeak | 0.7029 |
6 | 3 | Sex_M | 0.4058 |
7 | 3 | ExerciseAngina_Y | 0.6522 |
8 | 3 | ST_Slope_Flat | 0.7319 |
9 | 3 | ST_Slope_Up | 0.7899 |
10 | 5 | Oldpeak | 0.7101 |
11 | 5 | Sex_M | 0.6522 |
12 | 5 | ExerciseAngina_Y | 0.6522 |
13 | 5 | ST_Slope_Flat | 0.7319 |
14 | 5 | ST_Slope_Up | 0.7899 |
15 | 7 | Oldpeak | 0.7029 |
16 | 7 | Sex_M | 0.6522 |
17 | 7 | ExerciseAngina_Y | 0.6522 |
18 | 7 | ST_Slope_Flat | 0.7319 |
19 | 7 | ST_Slope_Up | 0.7899 |
20 | 10 | Oldpeak | 0.6957 |
21 | 10 | Sex_M | 0.6522 |
22 | 10 | ExerciseAngina_Y | 0.6522 |
23 | 10 | ST_Slope_Flat | 0.7319 |
24 | 10 | ST_Slope_Up | 0.7899 |
Based on our experiments, we have found that the accuracy of the k-Nearest Neighbors model varies with different values of n_neighbors
. However, the improvement in accuracy is not significant for some values of n_neighbors
. Therefore, we recommend selecting the default value of k, which is 5, for n_neighbors
.
As we can see from the table, the accuracy scores for some values of n_neighbors
are similar, indicating that increasing n_neighbors
does not improve the model's performance significantly.
In our previous experiment we got the highest accuracy with ST_Slope_Up
as our feature. But what if we trained a model on all of the selected features together?
Let's combine all the these features and train a model on them to see if we can achieve higher accuracy than before. However, before doing so, we need to normalize our data because the features are not yet within the same range. We will only scale our selected features to the range (0,1). Note: We will only scale our selected features.
After conducting two experiments, we have decided to select the following hyperparameters for our k-NN model:
test_size
= 0.15n_neighbors
= 5 (default value)# Split data for training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X[sel_features], y,
test_size=0.15,
random_state=42)
# Instantiate scaler
scaler = MinMaxScaler()
# Fit and transform selected features in the training set
X_train_scaled = scaler.fit_transform(X_train)
# Transform same features in the validation set
X_val_scaled = scaler.transform(X_val)
# Fit the model on scaled features (using default n_neighbors value)
knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)
# Evaluate the model on scaled features
accuracy = knn.score(X_val_scaled, y_val)
print(f'Model accuracy: {accuracy*100:.2f}%')
Model accuracy: 75.36%
In the previous experiment, an accuracy of 78.99% was achieved with the feature ST_Slope_Up
using a 15% test size split. However, when all selected features were scaled and used together to train the model, the accuracy decreased to 75.36%. This suggests that the selected features have varying levels of importance and may not contribute equally to the performance of the model.
We can now experiment with the parameters and hyperparameters of our model to improve its performance. To do so, we'll utilize GridSearchCV
, which allows us to search over all the parameters for our classifier in scikit-learn. For example, for the KNeighborsClassifier
, we can choose to experiment with the following hyperparameters:
n_neighbors
weights
metric
and more.It's worth noting that GridSearchCV
creates a validation set as part of its process, which means we'll have to update our workflow accordingly.
# Split data into training (85%) and test (15%)
X_train, X_test, y_train, y_test = train_test_split(X[sel_features], y,
test_size=0.15,
random_state=42)
# List of selected features
sel_features = [
'Oldpeak',
'Sex_M',
'ExerciseAngina_Y',
'ST_Slope_Flat',
'ST_Slope_Up'
]
# Instantiate scaler
scaler = MinMaxScaler()
# Scale training set
X_train_scaled = scaler.fit_transform(X_train)
We will conduct a grid search to explore the following hyperparameter values:
n_neighbors
, we will search in the range of 1 to 20.uniform
and distance
for weights
.minkowski
(default) and manhattan
.Let's see how these hyperparameters can improve the performance of our model.
# Define the parameter grid for GridSearchCV
params = {'n_neighbors': list(range(1,21)),
'weights': ['uniform', 'distance'],
'metric': ['minkowski', 'manhattan']}
# Instantial model
knn = KNeighborsClassifier()
# Create GridSearchCV instance
knn_grid = GridSearchCV(estimator=knn, param_grid=params, scoring='accuracy')
# Fit the instance (knn_grid)
knn_grid.fit(X_train_scaled, y_train)
GridSearchCV(estimator=KNeighborsClassifier(), param_grid={'metric': ['minkowski', 'manhattan'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 'weights': ['uniform', 'distance']}, scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GridSearchCV(estimator=KNeighborsClassifier(), param_grid={'metric': ['minkowski', 'manhattan'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 'weights': ['uniform', 'distance']}, scoring='accuracy')
KNeighborsClassifier()
KNeighborsClassifier()
Let's retrieve the best accuracy and the corresponding parameters.
# Get best score and best parameters
best_score = knn_grid.best_score_
best_params = knn_grid.best_params_
print(f'Best score: {best_score*100:0.2f}%')
print(f'Best parameters: {best_params}')
Best score: 85.64% Best parameters: {'metric': 'minkowski', 'n_neighbors': 8, 'weights': 'uniform'}
The GridSearchCV results indicate that the best KNN classifier achieved an accuracy of 85.64% with a Minkowski
distance metric, 8
neighbors, and uniform
weighting. These hyperparameters proved to be effective for this classification task, demonstrating a significant improvement in accuracy compared to previous experiments. Overall, these results suggest that the KNN classifier model is a promising approach for this task.
Next, we will evaluate our model's performance on the test set. Remember, we must normalize our test set and utilize the same scaler that was created previously. However, we will only transform the test set this time and not fit the scaler on it.
Thankfully, scikit-learn provides us with an efficient way to use the model obtained from GridSearchCV
by using the attribute best_estimator_
. We can apply it directly to predict our model's performance on the test set.
# Scale same features from the test set
X_test_scaled = scaler.transform(X_test)
# Get the best estimator to make predictions
best_model = knn_grid.best_estimator_
# Make predictions on test set
predictions = best_model.predict(X_test_scaled)
# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print(f'{accuracy*100:.2f}%')
74.64%
After evaluating the KNN classifier model on a separate test set, the accuracy dropped to 74.64%. This suggests that the model may have overfit to the training set, and the performance on the test set is not as good as originally estimated. Scaling the same features used in the training set and using the best estimator obtained from GridSearchCV to make predictions were performed. However, the lower accuracy score indicates that further optimization or a different modeling approach may be necessary to improve performance on this particular classification task.
random_state
is a parameter in train_test_split
that sets a specific seed for the random number generator. This allows for reproducible results when generating random data splits. To optimize our model's performance, we will use GridSearchCV with various random_state
values to determine the hyperparameters that maximize test set accuracy.
There is no universally agreed upon value for random_state
as the optimal value varies based on the dataset and modeling task. In our experiment, we will try other common values for random_state
like 0, 415, 417, 419, and 2023 since we have already tried 42.
We will use same the following hyperparameters for data split and for GridSearchCV:
Training set
(85%) and Test set
(15%)n_neighbors
: 1 to 20weights
: 'uniform' and 'distance'metric
: 'minkowski' and 'manhattan'Note: It is important to note that the X_test
and y_test
variables are not used in this experiment because they are not needed for finding the optimal hyperparameters on the training set.
# Create an empty dictionary to store the results
results = {}
# Split the data into training and testing sets using different random_state values
random_states = [0, 415, 417, 419, 2023]
# Loop over random_states values
for state in random_states:
X_train, _, y_train, _ = train_test_split(X[sel_features], y, test_size=0.15, random_state=state)
# Instantiate scaler
scaler = MinMaxScaler()
# Rescaled train features
X_train_scaled = scaler.fit_transform(X_train)
# Create a KNN classifier
knn = KNeighborsClassifier()
# Define the parameter grid for GridSearchCV
params = {'n_neighbors': list(range(1, 21)), 'weights': ['uniform', 'distance'], 'metric': ['minkowski', 'manhattan']}
# Create GridSearchCV object with different parameters
knn_grid = GridSearchCV(estimator=knn, param_grid=params, scoring='accuracy')
# Fit the model to the training data
knn_grid.fit(X_train_scaled, y_train)
# Get best score and best parameters
best_score = knn_grid.best_score_
best_params = knn_grid.best_params_
# Store the results in the dictionary
results[f'random_state_{state}'] = {'best_score': round(best_score*100,2), 'best_params': best_params}
# Convert results dictionary to a dataframe
results_df = pd.DataFrame.from_dict(results, orient='index')
results_df
best_score | best_params | |
---|---|---|
random_state_0 | 85.38 | {'metric': 'minkowski', 'n_neighbors': 14, 'weights': 'uniform'} |
random_state_415 | 83.72 | {'metric': 'minkowski', 'n_neighbors': 9, 'weights': 'uniform'} |
random_state_417 | 84.62 | {'metric': 'minkowski', 'n_neighbors': 12, 'weights': 'uniform'} |
random_state_419 | 83.72 | {'metric': 'minkowski', 'n_neighbors': 14, 'weights': 'uniform'} |
random_state_2023 | 83.46 | {'metric': 'manhattan', 'n_neighbors': 19, 'weights': 'uniform'} |
Based on the experiment results, we can see that the highest accuracy achieved on the training set was 85.38%, which was obtained using the following hyperparameters: {'metric': 'minkowski', 'n_neighbors': 14, 'weights': 'uniform'}
with a random state of 0
.
To evaluate the performance of the model with the optimal hyperparameters on the test set, we'll follow the following steps:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X[sel_features], y,
test_size=0.15,
random_state=0)
# Instantiate MinMax scaler
scaler = MinMaxScaler()
# Rescale training and testing features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Build model
knn = KNeighborsClassifier(n_neighbors=14, weights='uniform', metric='minkowski')
# Fit the model with training set
knn.fit(X_train_scaled, y_train)
# Make model predictions on test data
predictions = knn.predict(X_test_scaled)
# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print(f'Model accuracy on test set: {accuracy*100:.2f}%')
Model accuracy on test set: 81.88%
Based on the results, the model is evaluated on a previously unseen test set, resulting in an accuracy of 81.88%. This means that the model is able to correctly classify 81.88% of the instances in the test set. While this accuracy is slightly lower than the training accuracy, it is still a reasonably good result, and the model can be considered to be generalizing well to new data. However, it is worth noting that the model's performance may vary depending on the specific test set used, and it is always important to evaluate the model on multiple test sets to get a more robust estimate of its performance.
Our final model was developed by training it on a selected set of features that include:
Oldpeak
Sex_M
ExerciseAngina_Y
ST_Slope_Flat
ST_Slope_Up
The model achieved a test set accuracy of 81.88%. However, it's worth noting that the accuracy achieved might not be indicative of a well-performing model given the limitations of the data.
Moving forward, there are several steps we can take to improve our results, for example:
Based on the insights gained from this project, we have developed a predictive model for heart disease detection using the provided dataset. Our EDA process helped us identify and address some data cleaning issues such as imputation of zero values. After selecting relevant features through correlation analysis, we experimented with different classifiers and ultimately achieved a test accuracy of 81.88% with KNN.
This project holds potential in developing a predictive model for heart disease detection using the provided dataset. However, it is important to note some limitations and potential drawbacks before implementing this model in a real-world healthcare setting.
Pros of using this model in a real-world healthcare setting:
Cons of using this model in a real-world healthcare setting:
In conclusion, our predictive model has shown promise in identifying patients at risk of heart disease. However, it is important to consider its limitations and potential drawbacks before implementing it in a real-world healthcare setting. Continuous validation and monitoring will be necessary to ensure its continued accuracy and usefulness.