
In this project, we will be working on developing a model to predict market prices for cars utilizing the various characteristics of a vehicle through the use of K-nearest number algorithm.

Using a dataset available from the UCI Machine Learning Archive collected based on car guide + insurance information in 1985.

STEP 1: Reading + Cleaning the dataset

cars <- read.csv("")
colnames(cars) <- c('symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_doors', 'body_style',
                    'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type',
                    'num_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke', 'compression', 'horsepower',
                    'peak_rpm', 'city_mpg', 'highway_mpg', 'price')
## 'data.frame':    204 obs. of  26 variables:
##  $ symboling        : int  3 1 2 2 2 1 1 1 0 2 ...
##  $ normalized_losses: chr  "?" "?" "164" "164" ...
##  $ make             : chr  "alfa-romero" "alfa-romero" "audi" "audi" ...
##  $ fuel_type        : chr  "gas" "gas" "gas" "gas" ...
##  $ aspiration       : chr  "std" "std" "std" "std" ...
##  $ num_doors        : chr  "two" "two" "four" "four" ...
##  $ body_style       : chr  "convertible" "hatchback" "sedan" "sedan" ...
##  $ drive_wheels     : chr  "rwd" "rwd" "fwd" "4wd" ...
##  $ engine_location  : chr  "front" "front" "front" "front" ...
##  $ wheel_base       : num  88.6 94.5 99.8 99.4 99.8 ...
##  $ length           : num  169 171 177 177 177 ...
##  $ width            : num  64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 64.8 ...
##  $ height           : num  48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 54.3 ...
##  $ curb_weight      : int  2548 2823 2337 2824 2507 2844 2954 3086 3053 2395 ...
##  $ engine_type      : chr  "dohc" "ohcv" "ohc" "ohc" ...
##  $ num_cylinders    : chr  "four" "six" "four" "five" ...
##  $ engine_size      : int  130 152 109 136 136 136 136 131 131 108 ...
##  $ fuel_system      : chr  "mpfi" "mpfi" "mpfi" "mpfi" ...
##  $ bore             : chr  "3.47" "2.68" "3.19" "3.19" ...
##  $ stroke           : chr  "2.68" "3.47" "3.40" "3.40" ...
##  $ compression      : num  9 9 10 8 8.5 8.5 8.5 8.3 7 8.8 ...
##  $ horsepower       : chr  "111" "154" "102" "115" ...
##  $ peak_rpm         : chr  "5000" "5000" "5500" "5500" ...
##  $ city_mpg         : int  21 19 24 18 19 19 19 17 16 23 ...
##  $ highway_mpg      : int  27 26 30 22 25 25 25 20 22 29 ...
##  $ price            : chr  "16500" "16500" "13950" "17450" ...

Based on our findings, we see the following:

  1. numeric: symboling, wheel_base, length, width, height, curb_weight, engine_size, compression, city_mpg, highway_mpg

  2. characters: normalized_losses, make, fuel_type, aspiration, num_doors, body_size, drive_wheels, engine_location, engine_type, num_cylinders, fuel_type, bore, stroke, horsepower, peak_rpm, price

Looking at this, we need to clean things up a bit with reassigning several variables to numeric variables and make a numeric-only dataframe.

cars <- cars %>% 
            price = as.numeric(price),
            normalized_losses = as.numeric(normalized_losses),
            bore = as.numeric(bore),
            stroke = as.numeric(stroke), 
            horsepower = as.numeric(horsepower), 
            peak_rpm= as.numeric(peak_rpm)
numeric_only = cars %>%  
  select(-make, -fuel_type, -aspiration, -num_doors, -body_style, -drive_wheels, -engine_location, -engine_type, -fuel_system, -num_cylinders)

na_counts = numeric_only %>% %>% colSums() 

Now that we have a numeric-only dataset, we need to handle the number of missing entries in this dataset.

There are two approaches that we can use.

cars_numeric_only_1 = numeric_only %>% 
              filter(! %>% 
              filter(! %>%
              filter(! %>%
              filter(! %>%
              filter(! %>%
Looking at this first approach, we see that we are essentially going to drop 22% of the original dataset.

Looking at our findings, it appears that the most significant missing value comes from normalized losses whilst the rest appear to have missing values constituting 1%-2%.

# Seeing as how mean and median are similar, it is likely parametric distribution. Thus imputation with the mean would likely be acceptable.

# Looking at the distribution of the horsepower, it seems to be a right-skewed parametric distribution. 
For the purpose of building our predictive model, we will be doing so with the imputted dataframe going forward.

Based of the measure of centrality of sale price, there appears to be a right-skewed distribution of car sale prices.

##     highway_mpg price outliers
## 2            26 16500      not
## 3            30 13950      not
## 4            22 17450      not
## 5            25 15250      not
## 6            25 17710      not
## 7            25 18920      not
## 8            20 23875      not
## 9            29 16430      not
## 10           29 16925      not
## 11           28 20970      not
## 12           28 21105      not
## 13           25 24565      not
## 14           22 30760  outlier
## 15           22 41315  outlier
## 16           20 36880  outlier
## 17           53  5151      not
## 18           43  6295      not
## 19           43  6575      not
## 20           41  5572      not
## 21           38  6377      not
## 22           30  7957      not
## 23           38  6229      not
## 24           38  6692      not
## 25           38  7609      not
## 26           30  8558      not
## 27           30  8921      not
## 28           24 12964      not
## 29           54  6479      not
## 30           38  6855      not
## 31           42  5399      not
## 32           34  6529      not
## 33           34  7129      not
## 34           34  7295      not
## 35           34  7295      not
## 36           33  7895      not
## 37           33  9095      not
## 38           33  8845      not
## 39           33 10295      not
## 40           28 12945      not
## 41           31 10345      not
## 42           29  6785      not
## 43           29 11048      not
## 44           19 32250  outlier
## 45           19 35550  outlier
## 46           17 36000  outlier
## 47           31  5195      not
## 48           38  6095      not
## 49           38  6795      not
## 50           38  6695      not
## 51           38  7395      not
## 52           23 10945      not
## 53           23 11845      not
## 54           23 13645      not
## 55           23 15645      not
## 56           32  8845      not
## 57           32  8495      not
## 58           32 10595      not
## 59           32 10245      not
## 60           42 10795      not
## 61           32 11245      not
## 62           27 18280      not
## 63           39 18344      not
## 64           25 25552      not
## 65           25 28248      not
## 66           25 28176      not
## 67           25 31600  outlier
## 68           18 34184  outlier
## 69           18 35056  outlier
## 70           16 40960  outlier
## 71           16 45400  outlier
## 72           24 16503      not
## 73           41  5389      not
## 74           38  6189      not
## 75           38  6669      not
## 76           30  7689      not
## 77           30  9959      not
## 78           32  8499      not
## 79           24 12629      not
## 80           24 14869      not
## 81           24 14489      not
## 82           32  6989      not
## 83           32  8189      not
## 84           30  9279      not
## 85           30  9279      not
## 86           37  5499      not
## 87           50  7099      not
## 88           37  6649      not
## 89           37  6849      not
## 90           37  7349      not
## 91           37  7299      not
## 92           37  7799      not
## 93           37  7499      not
## 94           37  7999      not
## 95           37  8249      not
## 96           34  8949      not
## 97           34  9549      not
## 98           22 13499      not
## 99           22 14399      not
## 100          25 13499      not
## 101          25 17199      not
## 102          23 19699      not
## 103          25 18399      not
## 104          24 11900      not
## 105          33 13200      not
## 106          24 12440      not
## 107          25 13860      not
## 108          24 15580      not
## 109          33 16900      not
## 110          24 16695      not
## 111          25 17075      not
## 112          24 16630      not
## 113          33 17950      not
## 114          24 18150      not
## 115          41  5572      not
## 116          30  7957      not
## 117          38  6229      not
## 118          38  6692      not
## 119          38  7609      not
## 120          30  8921      not
## 121          24 12764      not
## 122          27 22018      not
## 123          25 32528  outlier
## 124          25 34028  outlier
## 125          25 37028  outlier
## 126          31  9295      not
## 127          31  9895      not
## 128          28 11850      not
## 129          28 12170      not
## 130          28 15040      not
## 131          28 15510      not
## 132          26 18150      not
## 133          26 18620      not
## 134          36  5118      not
## 135          31  7053      not
## 136          31  7603      not
## 137          37  7126      not
## 138          33  7775      not
## 139          32  9960      not
## 140          25  9233      not
## 141          29 11259      not
## 142          32  7463      not
## 143          31 10198      not
## 144          29  8013      not
## 145          23 11694      not
## 146          39  5348      not
## 147          38  6338      not
## 148          38  6488      not
## 149          37  6918      not
## 150          32  7898      not
## 151          32  8778      not
## 152          37  6938      not
## 153          37  7198      not
## 154          36  7898      not
## 155          47  7788      not
## 156          47  7738      not
## 157          34  8358      not
## 158          34  9258      not
## 159          34  8058      not
## 160          34  8238      not
## 161          29  9298      not
## 162          29  9538      not
## 163          30  8449      not
## 164          30  9639      not
## 165          30  9989      not
## 166          30 11199      not
## 167          30 11549      not
## 168          30 17669      not
## 169          34  8948      not
## 170          33 10698      not
## 171          32  9988      not
## 172          32 10898      not
## 173          32 11248      not
## 174          24 16558      not
## 175          24 15998      not
## 176          24 15690      not
## 177          24 15750      not
## 178          46  7775      not
## 179          34  7975      not
## 180          46  7995      not
## 181          34  8195      not
## 182          34  8495      not
## 183          42  9495      not
## 184          32  9995      not
## 185          29 11595      not
## 186          29  9980      not
## 187          24 13295      not
## 188          38 13845      not
## 189          31 12290      not
## 190          28 12940      not
## 191          28 13415      not
## 192          28 15985      not
## 193          28 16515      not
## 194          22 18420      not
## 195          22 18950      not
## 196          28 16845      not
## 197          25 19045      not
## 198          23 21485      not
## 199          27 22470      not
## 200          25 22625      not
summarization = pricy_cars %>% 
  select(outliers, price, highway_mpg, city_mpg, peak_rpm, horsepower, engine_size, curb_weight) %>%
  group_by(outliers) %>%
    mean_price = mean(price), sd_price = sd(price), median_price = median(price), 
    mean_highway = mean(highway_mpg), sd_highway = sd(highway_mpg), median_highway = median(highway_mpg), 
    mean_city = mean(city_mpg), sd_city = sd(city_mpg), median_city = median(city_mpg),
    mean_rpm = mean(peak_rpm), sd_rpm = sd(peak_rpm), median_rpm = median(peak_rpm),
    mean_horsepower = mean(horsepower), sd_horsepower = sd(horsepower), median_horsepower = median(horsepower),
    mean_engine = mean(engine_size), sd_engine = sd(engine_size), median_engine = median(engine_size),
    mean_weight = mean(curb_weight), sd_weight = sd(curb_weight), median_weight = median(curb_weight)

## # A tibble: 2 × 22
##   outliers mean_price sd_price median_…¹ mean_…² sd_hi…³ media…⁴ mean_…⁵ sd_city
##   <chr>         <dbl>    <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 not          11492.    4991.     9960.    31.5    6.39    31      25.9    6.10
## 2 outlier      35967.    4153.    35303     20.5    3.46    19.5    15.9    2.13
## # … with 13 more variables: median_city <dbl>, mean_rpm <dbl>, sd_rpm <dbl>,
## #   median_rpm <dbl>, mean_horsepower <dbl>, sd_horsepower <dbl>,
## #   median_horsepower <dbl>, mean_engine <dbl>, sd_engine <dbl>,
## #   median_engine <dbl>, mean_weight <dbl>, sd_weight <dbl>,
## #   median_weight <dbl>, and abbreviated variable names ¹​median_price,
## #   ²​mean_highway, ³​sd_highway, ⁴​median_highway, ⁵​mean_city

These outlier vehicles were found to have:

  1. significantly lower highway and city fuel consumption
  2. significantly greater horsepower, engine size and weight

However given a closer look at these vehicles, it might be that luxury-brand vehicles (based on names of the automakers) would have constitute a greater premium in terms of pricing.

As these factors are what is going to impact sales price + with the trends being in line with non-outlier vehicles, we will make the decision to keep these outliers in our dataset.

STEP 3: Setting up our Model

PART A: Splitting up the dataset into a training set and testing set.

For our purpose of maximizing variance whilst minimzing inherent bias, we will be using a 80% training to 20% testing split.

set.seed(1) # For the sake of reproducibility 

train_indices = createDataPartition(y = clean_cars_numeric[['price']], 
                                    p =  0.8, 
                                    list = FALSE)

training_listings = clean_cars_numeric[train_indices, ] # Should be about 161
testing_listings = clean_cars_numeric[-train_indices,] # Should be about 40 

PART B: Hyperparameter Tuning

For this process, we will be using the Grid Search method to tune hyperparameters in our model.

As our intention is to utilize the KNN approach, there will only be one parameter (i.e. k) which we will find a range for different combination.

cv_folds = 15 #it's the sqrt(n) where n = 205
knn_grid = expand.grid(k = 1:100)
train_control = trainControl(method = 'cv', number = cv_folds)

PART C: Create the KNN Model

This will be the longest part as it will be a series of experimentation of figuring out which is the best approach to predicting car price. However, there are several approaches to this.

FIRST: Rationalizing predictor selection

Using multiple resources that had explored the topic of vehicle prices (both new and used) as listed here,here, here, here, we see that several factors within our dataframe may be ideal candidates as predictors: 1) city + highway mileage & 2) engine size

Looking past this, we can also speculate that horsepower + torque may also play a role considering it’s impact on fuel consumption.

Furthermore, it can also be speculated that car size may play a role in vehicle pricing given its relationship to insurance rates as noted here, the change in the current landscape of the marketplace as noted here, and how the influence of personal economics impact car purchasing habits as noted here

Thus, we will assume a comparison of several different models for comparing:

  1. A model containing city MPG and highway MPG

  2. A model containing city MPG, highway MPG and engine size

  3. A model containing city MPG, highway MPG, engine size and horsepower

  4. A model containing city MPG, highway MPG, engine size, horsepower and torque

  5. A model containing city MPG, highway MPG, engine size, horsepower, torque and curb weight

  6. A model containing city MPG, highway MPG, engine size, horsepower, torque, curb weight and length

  7. A model containing city MPG, highway MPG, engine size, horsepower, torque, curb weight, length and width

  8. A model containing city MPG, highway MPG, engine size, horsepower, torque, curb weight, length, width and height

knn_model_1 = train(price ~ city_mpg + highway_mpg,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_2 = train(price ~ city_mpg + highway_mpg + engine_size,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_3 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_4 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_5 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_6 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_7 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length +  width,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_8 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length +  width + height,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 

SECOND: Data Dredging Approach

This is essentially the “throw it against the wall and see what sticks” approach, where we will go through each variable and see what is best variable/parameter to add into our mode to best predict car prices.

Whilst we can do so by manually adding each variable individually and see what sticks (i.e. one model has price ~ city_mpg, the other has price ~ horsepower, etc.), we can use a stepwise regression analysis to look at seeing which variable is retained based on our dataset.

We will retain the model based on AIC scores

stepwise_model = train(price ~., # includes all variables within the dataframe
                       data = clean_cars_numeric,
                       trControl = train_control, 
                       method = 'leapSeq',
                       preProcess = c('center', 'scale'),
                       tuneGrid = data.frame(nvmax = 1:15)) 
# nvmax corresponding to a tuning parameter corresponds to the maximum number of predictors to be incorporated

# Another approach to using stepwise regression 
stepwise_model_A = train(price ~., 
                         data = clean_cars_numeric,
                         method = 'lmStepAIC',
                         trControl = train_control,
                         trace = FALSE)

# Another approach to using bi-directional stepwise regression 
linear_model_all <- lm(price ~., data = clean_cars_numeric)
stepwise_model_B <- stepAIC(linear_model_all, direction = "both", trace = FALSE)
PART D: Evaluate the models.

Let’s see how these models turn out.

# The Educated-Guess Approach
testing = testing_listings %>% 
    model_one_prediction = predict(knn_model_1, newdata = testing_listings),
    model_two_prediction = predict(knn_model_2, newdata = testing_listings),
    model_three_prediction = predict(knn_model_3, newdata = testing_listings),
    model_four_prediction = predict(knn_model_4, newdata = testing_listings),
    model_five_prediction = predict(knn_model_5, newdata = testing_listings),
    model_six_prediction = predict(knn_model_6, newdata = testing_listings),
    model_seven_prediction = predict(knn_model_7, newdata = testing_listings),
    model_eight_prediction = predict(knn_model_8, newdata = testing_listings),
    sq_error_model_one = (price - model_one_prediction)^2,
    sq_error_model_two = (price - model_two_prediction)^2,
    sq_error_model_three = (price - model_three_prediction)^2,
    sq_error_model_four = (price - model_four_prediction)^2,
    sq_error_model_five = (price - model_five_prediction)^2,
    sq_error_model_six = (price - model_six_prediction)^2,
    sq_error_model_seven = (price - model_seven_prediction)^2,
    sq_error_model_eight = (price - model_eight_prediction)^2

long_testing = testing %>% 
    cols =  sq_error_model_one:sq_error_model_eight,
    names_to = 'model',
    values_to = 'sq_error'

rmse_by_model = long_testing %>% 
                  group_by(model) %>%
                  summarize(rmse = sqrt(mean(sq_error)))

summed_model_a = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight, data = clean_cars_numeric)
summed_model_b = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length, data = clean_cars_numeric)
summed_model_c = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length + width + height, data = clean_cars_numeric)
predictions_model_a <- predict(summed_model_a, newdata = testing_listings)
predictions_model_b <- predict(summed_model_b, newdata = testing_listings)
predictions_model_c <- predict(summed_model_c, newdata = testing_listings)
postResample(pred = predictions_model_a, obs = testing_listings$price) # RMSE = 3348.235
##         RMSE     Rsquared          MAE 
## 2759.4948088    0.8652066 1962.9642552
postResample(pred = predictions_model_b, obs = testing_listings$price) # RMSE = 3345.551
##         RMSE     Rsquared          MAE 
## 2752.3901296    0.8659432 1950.4532659
postResample(pred = predictions_model_c, obs = testing_listings$price) # RMSE = 3141.412
##        RMSE    Rsquared         MAE 
## 2563.864202    0.885678 1845.171601

Looking at the above findings, we see that the top 3 models that appear to have performed the 'best' (based on RMSE) are

  1. KNN_model_5: contains the parameters city_mpg, highway_mpg, engine_size, horsepower, peak_rpm and curb_weight; the model seems to perform the best with a single closest neighbour
  2. KNN_model_6: contains the parameters city_mpg, highway_mpg, engine_size, horsepower, peak_rpm, curb_weight and length; the model seems to perform the best with a single closest neighbour
  3. KNN_model_8: contains the parameters city_mpg, highway_mpg, engine_size, horsepower, peak_rpm, curb_weight, length, width and height; the model seems to perform the best with a single closest neighbour

However, looking at the RMSE and the R-squared of each model, we see that Model # 8 had the highest score whereby it was predictive of 82.05% of the variance in car prices.

# Stepwise Regression Approach
summarized_model = lm(price ~ city_mpg + peak_rpm + horsepower + compression + stroke + engine_size + height + width, data = clean_cars_numeric)

## Call:
## lm(formula = price ~ city_mpg + peak_rpm + horsepower + compression + 
##     stroke + engine_size + height + width, data = clean_cars_numeric)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10996.2  -1684.7      3.9   1578.1  13813.7 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.668e+04  1.433e+04  -3.955 0.000108 ***
## city_mpg    -9.956e+01  7.581e+01  -1.313 0.190641    
## peak_rpm     2.337e+00  6.385e-01   3.660 0.000326 ***
## horsepower   3.947e+01  1.681e+01   2.348 0.019914 *  
## compression  3.159e+02  7.577e+01   4.169 4.64e-05 ***
## stroke      -2.742e+03  7.721e+02  -3.551 0.000483 ***
## engine_size  1.156e+02  1.346e+01   8.593 3.00e-15 ***
## height       1.645e+02  1.098e+02   1.498 0.135733    
## width        5.852e+02  1.978e+02   2.959 0.003476 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3209 on 191 degrees of freedom
## Multiple R-squared:  0.8443, Adjusted R-squared:  0.8378 
## F-statistic: 129.5 on 8 and 191 DF,  p-value: < 2.2e-16
prediction_summarized_model_a = predict(summarized_model, newdata = testing_listings)

postResample(pred = prediction_summarized_model_a, obs = testing_listings$price) # RMSE = 3083.854
##         RMSE     Rsquared          MAE 
## 2546.0779941    0.8878266 1854.2849324

Within the course of identifying the number of predictors to be included into the model, it appears as those the best model had included 8 variables.

These 8 variables include: width, height, engine size, stroke, compression, horsepower, torque and city fuel consumption.

Looking at this model, it seems that this particular model could explain ~ 83.78% variance of predicted car prices.

# Stepwise Regression Approach - Option B
summarize_model_A = lm(price ~ peak_rpm + horsepower + compression + stroke + engine_size + height + width, data = clean_cars_numeric)
## Call:
## lm(formula = price ~ peak_rpm + horsepower + compression + stroke + 
##     engine_size + height + width, data = clean_cars_numeric)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11400.0  -1666.3    -17.2   1499.1  13720.2 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.828e+04  1.131e+04  -6.036 8.01e-09 ***
## peak_rpm     2.314e+00  6.395e-01   3.618 0.000379 ***
## horsepower   5.188e+01  1.393e+01   3.723 0.000259 ***
## compression  2.725e+02  6.832e+01   3.988 9.46e-05 ***
## stroke      -2.730e+03  7.735e+02  -3.529 0.000523 ***
## engine_size  1.122e+02  1.323e+01   8.483 5.81e-15 ***
## height       1.880e+02  1.086e+02   1.732 0.084954 .  
## width        6.990e+02  1.781e+02   3.924 0.000121 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3215 on 192 degrees of freedom
## Multiple R-squared:  0.8429, Adjusted R-squared:  0.8372 
## F-statistic: 147.2 on 7 and 192 DF,  p-value: < 2.2e-16
prediction_summarize_model_a = predict(summarize_model_A, newdata = testing_listings)
postResample(pred = prediction_summarize_model_a, obs = testing_listings$price) # RMSE = 3136.352
##         RMSE     Rsquared          MAE 
## 2523.8409698    0.8904233 1851.9866798
# Stepwise Regression Approach - Option C
prediction_stepwise_model_B = predict(stepwise_model_B, newdata = testing_listings)
postResample(pred = prediction_stepwise_model_B, obs = testing_listings$price) # RMSE = 3136.352
##         RMSE     Rsquared          MAE 
## 2523.8409698    0.8904233 1851.9866798

Within the course of identifying the number of predictors to be included into the model, it appears as those the best model had included 7 variables.

These 7 variables include: width, height, engine size, stroke, compression, horsepower and torque.

Looking at this model, it seems that this particular model could explain ~ 83.72% variance of predicted car prices.

NOTE: it is interesting to note that whilst these last two models appeared to perform better compared to the educated-guess approach, it is marginally poorer in predicting car prices with the inclusion of city fuel consumption as a metric.


Overall looking at the impact of car characteristics influencing car prices, it seems as though the major players are: car width, car length, engine size, engine stroke, engine compression, horsepower, torque and city fuel consumption based on our model. It should be noted that further analysis should be performed when taking into consideration of variables that are categorical/nominal in nature such as branding, body type, etc.

EXTRA: Analysis Using the complete-cases dataset.

STEP 1: Trying the educated-guess approach

train_indicesA = createDataPartition(y = cars_numeric_only_1[['price']], 
                                    p =  0.8, 
                                    list = FALSE)
cv_folds = 13 # it's the sqrt(n) where n = 160
knn_grid = expand.grid(k = 1:100)
train_control = trainControl(method = 'cv', number = cv_folds)
testing_listingsA = cars_numeric_only_1[-train_indicesA,] # Should be about 32
training_listingsA = cars_numeric_only_1[train_indicesA, ] # shoudl be 128
knn_model_1A = train(price ~ city_mpg + highway_mpg,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_2A = train(price ~ city_mpg + highway_mpg + engine_size,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_3A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_4A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_5A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_6A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_7A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length +  width,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_8A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length +  width + height,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
testingA = testing_listingsA %>% 
    model_one_prediction = predict(knn_model_1A, newdata = testing_listingsA),
    model_two_prediction = predict(knn_model_2A, newdata = testing_listingsA),
    model_three_prediction = predict(knn_model_3A, newdata = testing_listingsA),
    model_four_prediction = predict(knn_model_4A, newdata = testing_listingsA),
    model_five_prediction = predict(knn_model_5A, newdata = testing_listingsA),
    model_six_prediction = predict(knn_model_6A, newdata = testing_listingsA),
    model_seven_prediction = predict(knn_model_7A, newdata = testing_listingsA),
    model_eight_prediction = predict(knn_model_8A, newdata = testing_listingsA),
    sq_error_model_one = (price - model_one_prediction)^2,
    sq_error_model_two = (price - model_two_prediction)^2,
    sq_error_model_three = (price - model_three_prediction)^2,
    sq_error_model_four = (price - model_four_prediction)^2,
    sq_error_model_five = (price - model_five_prediction)^2,
    sq_error_model_six = (price - model_six_prediction)^2,
    sq_error_model_seven = (price - model_seven_prediction)^2,
    sq_error_model_eight = (price - model_eight_prediction)^2
long_testingA = testingA %>% 
    cols =  sq_error_model_one:sq_error_model_eight,
    names_to = 'model',
    values_to = 'sq_error'
rmse_by_modelA = long_testingA %>% 
                  group_by(model) %>%
                  summarize(rmse = sqrt(mean(sq_error)))
projected_model_a = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length +  width + height, data = cars_numeric_only_1)
## Call:
## lm(formula = price ~ city_mpg + highway_mpg + engine_size + horsepower + 
##     peak_rpm + curb_weight + length + width + height, data = cars_numeric_only_1)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5520.5 -1294.0  -358.8  1153.5  7937.0 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.925e+04  1.454e+04  -4.762 4.48e-06 ***
## city_mpg     7.893e+01  1.477e+02   0.534   0.5939    
## highway_mpg -3.457e+01  1.388e+02  -0.249   0.8037    
## engine_size  3.578e+01  1.786e+01   2.004   0.0469 *  
## horsepower   1.550e+01  1.628e+01   0.952   0.3425    
## peak_rpm     8.516e-01  5.369e-01   1.586   0.1148    
## curb_weight  6.700e+00  1.493e+00   4.487 1.43e-05 ***
## length      -6.733e+01  4.307e+01  -1.563   0.1201    
## width        9.491e+02  2.231e+02   4.255 3.67e-05 ***
## height       4.569e+01  1.244e+02   0.367   0.7139    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2436 on 150 degrees of freedom
## Multiple R-squared:  0.8371, Adjusted R-squared:  0.8274 
## F-statistic: 85.66 on 9 and 150 DF,  p-value: < 2.2e-16
projected_model_b = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length , data = cars_numeric_only_1)
## Call:
## lm(formula = price ~ city_mpg + highway_mpg + engine_size + horsepower + 
##     peak_rpm + curb_weight + length, data = cars_numeric_only_1)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7121.2 -1225.3   -57.8  1013.9  8107.0 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.173e+04  7.512e+03  -2.893  0.00438 ** 
## city_mpg     1.560e+02  1.539e+02   1.014  0.31235    
## highway_mpg -9.831e+01  1.447e+02  -0.679  0.49796    
## engine_size  4.234e+01  1.764e+01   2.401  0.01756 *  
## horsepower   1.285e+01  1.664e+01   0.772  0.44122    
## peak_rpm     9.262e-01  5.626e-01   1.646  0.10173    
## curb_weight  8.557e+00  1.422e+00   6.016 1.28e-08 ***
## length       6.698e-01  3.834e+01   0.017  0.98609    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2565 on 152 degrees of freedom
## Multiple R-squared:  0.8171, Adjusted R-squared:  0.8086 
## F-statistic: 96.99 on 7 and 152 DF,  p-value: < 2.2e-16
predictions_model_1 <- predict(projected_model_a, newdata = testing_listingsA)
predictions_model_2 <- predict(projected_model_b, newdata = testing_listingsA)
postResample(pred = predictions_model_1, obs = testing_listingsA$price) # RMSE = 1999.039
##         RMSE     Rsquared          MAE 
## 2050.4475753    0.8531872 1605.1406928
postResample(pred = predictions_model_2, obs = testing_listingsA$price) # RMSE = 2066.850
##         RMSE     Rsquared          MAE 
## 1924.0041129    0.8665596 1477.4063047

Looking at the above findings, it was found that the eighth model appeared to be better performing model based on RMSE score and R-squared where our model appeared to predict 82.74% of the variance in car prices using the completed-cases dataset.

This model showed all of the selected variables as a better measure of

stepwise_model_A = train(price ~., 
                       data = cars_numeric_only_1,
                       trControl = train_control, 
                       method = 'leapSeq',
                       preProcess = c('center', 'scale'),
                       tuneGrid = data.frame(nvmax = 1:15))
summarized_model_A1 = lm(price ~ curb_weight, data = cars_numeric_only_1)
prediction_summarized_model_a1 = predict(summarized_model_A1, newdata = testing_listingsA)
postResample(pred = prediction_summarized_model_a1, obs = testing_listingsA$price) # RSME = 1985.155, R-squared = 0.8417
##         RMSE     Rsquared          MAE 
## 2055.5304142    0.8522134 1573.2709604
# second method
stepwise_model_B = train(price ~., 
                         data = cars_numeric_only_1,
                         method = 'lmStepAIC',
                         trControl = train_control,
                         trace = FALSE)
summary(stepwise_model_B$finalModel) # 
summarized_model_B1 = lm(price ~ wheel_base + length + width + curb_weight + engine_size + bore + stroke + compression + horsepower + peak_rpm, data = cars_numeric_only_1)
prediction_summarized_model_B1 = predict(summarized_model_B1, newdata = testing_listingsA)
postResample(pred = prediction_summarized_model_B1, obs = testing_listingsA$price) #RSME = 2016.58, R-squared = 0.8415
##         RMSE     Rsquared          MAE 
## 2103.2021951    0.8440939 1617.9408960

Looking at the use of a stepwise regression analysis, it appeared to show that within complete-cases analysis, only curb_weight was retained as a significant predictor whereby the model appeared to predict ~ 84.17% of the variance in car price predictions.

Obviously in comparison to the previous method, it was found that within complete cases, the KNN method provided a better model prediction in car prices within the given dataset.