Introduction

In this project, we will be working on developing a model to predict market prices for cars utilizing the various characteristics of a vehicle through the use of K-nearest number algorithm.

Using a dataset available from the UCI Machine Learning Archive collected based on car guide + insurance information in 1985.

STEP 1: Reading + Cleaning the dataset

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(broom)
library(lattice)
library(ggplot2)
library(knitr)
library(aod)
library(caret)
library(stringr)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ purrr   0.3.5
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::lift()   masks caret::lift()
library(purrr)
library(readr)
library(readxl)
library(magrittr)
## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
cars <- read.csv("imports-85.data")
colnames(cars) <- c('symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_doors', 'body_style',
                    'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type',
                    'num_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke', 'compression', 'horsepower',
                    'peak_rpm', 'city_mpg', 'highway_mpg', 'price')
str(cars)
## 'data.frame':    204 obs. of  26 variables:
##  $ symboling        : int  3 1 2 2 2 1 1 1 0 2 ...
##  $ normalized_losses: chr  "?" "?" "164" "164" ...
##  $ make             : chr  "alfa-romero" "alfa-romero" "audi" "audi" ...
##  $ fuel_type        : chr  "gas" "gas" "gas" "gas" ...
##  $ aspiration       : chr  "std" "std" "std" "std" ...
##  $ num_doors        : chr  "two" "two" "four" "four" ...
##  $ body_style       : chr  "convertible" "hatchback" "sedan" "sedan" ...
##  $ drive_wheels     : chr  "rwd" "rwd" "fwd" "4wd" ...
##  $ engine_location  : chr  "front" "front" "front" "front" ...
##  $ wheel_base       : num  88.6 94.5 99.8 99.4 99.8 ...
##  $ length           : num  169 171 177 177 177 ...
##  $ width            : num  64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 64.8 ...
##  $ height           : num  48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 54.3 ...
##  $ curb_weight      : int  2548 2823 2337 2824 2507 2844 2954 3086 3053 2395 ...
##  $ engine_type      : chr  "dohc" "ohcv" "ohc" "ohc" ...
##  $ num_cylinders    : chr  "four" "six" "four" "five" ...
##  $ engine_size      : int  130 152 109 136 136 136 136 131 131 108 ...
##  $ fuel_system      : chr  "mpfi" "mpfi" "mpfi" "mpfi" ...
##  $ bore             : chr  "3.47" "2.68" "3.19" "3.19" ...
##  $ stroke           : chr  "2.68" "3.47" "3.40" "3.40" ...
##  $ compression      : num  9 9 10 8 8.5 8.5 8.5 8.3 7 8.8 ...
##  $ horsepower       : chr  "111" "154" "102" "115" ...
##  $ peak_rpm         : chr  "5000" "5000" "5500" "5500" ...
##  $ city_mpg         : int  21 19 24 18 19 19 19 17 16 23 ...
##  $ highway_mpg      : int  27 26 30 22 25 25 25 20 22 29 ...
##  $ price            : chr  "16500" "16500" "13950" "17450" ...

Based on our findings, we see the following:

  1. numeric: symboling, wheel_base, length, width, height, curb_weight, engine_size, compression, city_mpg, highway_mpg

  2. characters: normalized_losses, make, fuel_type, aspiration, num_doors, body_size, drive_wheels, engine_location, engine_type, num_cylinders, fuel_type, bore, stroke, horsepower, peak_rpm, price

Looking at this, we need to clean things up a bit with reassigning several variables to numeric variables and make a numeric-only dataframe.

cars <- cars %>% 
          mutate(
            price = as.numeric(price),
            normalized_losses = as.numeric(normalized_losses),
            bore = as.numeric(bore),
            stroke = as.numeric(stroke), 
            horsepower = as.numeric(horsepower), 
            peak_rpm= as.numeric(peak_rpm)
          )
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
numeric_only = cars %>%  
  select(-make, -fuel_type, -aspiration, -num_doors, -body_style, -drive_wheels, -engine_location, -engine_type, -fuel_system, -num_cylinders)

na_counts = numeric_only %>% is.na() %>% colSums() 

Now that we have a numeric-only dataset, we need to handle the number of missing entries in this dataset.

There are two approaches that we can use.

# COMPLETE-CASES: 
cars_numeric_only_1 = numeric_only %>% 
              filter(!is.na(price)) %>% 
              filter(!is.na(normalized_losses)) %>%
              filter(!is.na(bore)) %>%
              filter(!is.na(stroke)) %>%
              filter(!is.na(horsepower)) %>%
              filter(!is.na(peak_rpm))
head(cars_numeric_only_1)
##   symboling normalized_losses wheel_base length width height curb_weight
## 1         2               164       99.8  176.6  66.2   54.3        2337
## 2         2               164       99.4  176.6  66.4   54.3        2824
## 3         1               158      105.8  192.7  71.4   55.7        2844
## 4         1               158      105.8  192.7  71.4   55.9        3086
## 5         2               192      101.2  176.8  64.8   54.3        2395
## 6         0               192      101.2  176.8  64.8   54.3        2395
##   engine_size bore stroke compression horsepower peak_rpm city_mpg highway_mpg
## 1         109 3.19    3.4        10.0        102     5500       24          30
## 2         136 3.19    3.4         8.0        115     5500       18          22
## 3         136 3.19    3.4         8.5        110     5500       19          25
## 4         131 3.13    3.4         8.3        140     5500       17          20
## 5         108 3.50    2.8         8.8        101     5800       23          29
## 6         108 3.50    2.8         8.8        101     5800       23          29
##   price
## 1 13950
## 2 17450
## 3 17710
## 4 23875
## 5 16430
## 6 16925

Looking at this first approach, we see that we are essentially going to drop 22% of the original dataset.

As a result, this introduces a significant bias into our future predictive model along with a reduction in our statistical power of our model.

# IMPUTATION:
# Assessment of rows with missing values in the dataset
na_logical = is.na(numeric_only)
na.dataframe = data.frame(is.na(numeric_only))
na.dataframe = na.dataframe %>% mutate(total_na = rowSums(.[1:16]))
na_counts_pct = na_counts *100 / nrow(na_logical)
na.df = data.frame(na_counts = na_counts, na_pct = na_counts_pct)
na.df = data.frame(t(na.df))

Looking at our findings, it appears that the most significant missing value comes from normalized losses whilst the rest appear to have missing values constituting 1%-2%.

Examining the data, we found that:

  1. Those with missing values with bore also had missing values for stroke
  2. Whose with missing values with horsepower also had missing values for peak_rpm
  3. Missing values for price was independent from missing values with bore, stroke, horsepower and peak_rpm
  4. All these cases also had missing values for normalized_losses

Since we are ultimately looking at price, we can first remove rows with missing prices

update_numeric_only = numeric_only %>% filter(!is.na(price))

As to how to handle the missing values of the rest, let’s look at the summary statistics of these variables.

mean(update_numeric_only$bore, na.rm = TRUE)
## [1] 3.33
sd(update_numeric_only$bore, na.rm = TRUE)
## [1] 0.2713027
median(update_numeric_only$bore, na.rm = TRUE) 
## [1] 3.31
# Seeing as how mean and median are similar, it is likely parametric distribution. Thus imputation with the mean would likely be acceptable.

mean(update_numeric_only$stroke, na.rm = TRUE)
## [1] 3.259847
sd(update_numeric_only$stroke, na.rm = TRUE)
## [1] 0.3173827
median(update_numeric_only$stroke, na.rm = TRUE) 
## [1] 3.29
# Seeing as how mean and median are similar, it is likely parametric distribution. Thus imputation with the mean would likely be acceptable. 

mean(update_numeric_only$peak_rpm, na.rm = TRUE)
## [1] 5118.182
sd(update_numeric_only$peak_rpm, na.rm = TRUE)
## [1] 481.6667
median(update_numeric_only$peak_rpm, na.rm = TRUE) 
## [1] 5200
# Seeing as how mean and median are similar, it is likely parametric distribution. Thus imputation with the mean would likely be acceptable. 
mean(update_numeric_only$horsepower, na.rm = TRUE)
## [1] 103.3586
sd(update_numeric_only$horsepower, na.rm = TRUE)
## [1] 37.64512
median(update_numeric_only$horsepower, na.rm = TRUE)
## [1] 95
ggplot(data = update_numeric_only,
       aes(x = horsepower)) +
  geom_histogram(bins = 20)
## Warning: Removed 2 rows containing non-finite values (stat_bin).

# Looking at the distribution of the horsepower, it seems to be a right-skewed parametric distribution. 
# This was validated by the large differential b/t median and mean values for horsepower. 
# Given this distribution in our model, we will fill it with the most appropriate measure of central tendency (i.e. median value) due to is robustness against outliers. 
mean(update_numeric_only$normalized_losses, na.rm = TRUE)
## [1] 122
sd(update_numeric_only$normalized_losses, na.rm = TRUE)
## [1] 35.44217
median(update_numeric_only$normalized_losses, na.rm = TRUE)
## [1] 115
ggplot(data = update_numeric_only,
       aes(x = normalized_losses)) +
  geom_histogram(bins = 20)
## Warning: Removed 36 rows containing non-finite values (stat_bin).

# Looking at the distribution of the horsepower, it seems to be a right-skewed parametric distribution. 
# This was validated by the large differential b/t median and mean values for horsepower. 
# As such, we will fill it with the median value instead of mean value as it's a better measure of central tendency 
update_numeric_only = update_numeric_only %>% 
                        mutate(
                          bore = ifelse(is.na(bore), round(mean(bore, na.rm = TRUE), 2), bore),
                          stroke = ifelse(is.na(stroke), round(mean(stroke, na.rm = TRUE), 2), stroke),
                          peak_rpm = ifelse(is.na(peak_rpm), round(mean(peak_rpm, na.rm = TRUE), 2), peak_rpm),
                          horsepower = ifelse(is.na(horsepower), median(horsepower, na.rm = TRUE), horsepower),
                          normalized_losses = ifelse(is.na(normalized_losses), median(normalized_losses, na.rm = TRUE), normalized_losses)
                        )

clean_cars_numeric = update_numeric_only

head(update_numeric_only)
##   symboling normalized_losses wheel_base length width height curb_weight
## 1         3               115       88.6  168.8  64.1   48.8        2548
## 2         1               115       94.5  171.2  65.5   52.4        2823
## 3         2               164       99.8  176.6  66.2   54.3        2337
## 4         2               164       99.4  176.6  66.4   54.3        2824
## 5         2               115       99.8  177.3  66.3   53.1        2507
## 6         1               158      105.8  192.7  71.4   55.7        2844
##   engine_size bore stroke compression horsepower peak_rpm city_mpg highway_mpg
## 1         130 3.47   2.68         9.0        111     5000       21          27
## 2         152 2.68   3.47         9.0        154     5000       19          26
## 3         109 3.19   3.40        10.0        102     5500       24          30
## 4         136 3.19   3.40         8.0        115     5500       18          22
## 5         136 3.19   3.40         8.5        110     5500       19          25
## 6         136 3.19   3.40         8.5        110     5500       19          25
##   price
## 1 16500
## 2 16500
## 3 13950
## 4 17450
## 5 15250
## 6 17710

For the purpose of building our predictive model, we will be doing so with the imputted dataframe going forward.

However, we will also look at the complete-cases only dataset in the extra section.

STEP 2: Examining Relationships b/t Predictors

featurePlot(clean_cars_numeric$horsepower, clean_cars_numeric$price, labels = c('horsepower','price'))

# There appears to be some evidence of a positive association b/t horsepower and sale price of cars

featurePlot(clean_cars_numeric$peak_rpm, clean_cars_numeric$price, labels = c('peak RPM','price'))

# There appears to be no association b/t peak RPM and sale price of cars 

featurePlot(clean_cars_numeric$city_mpg, clean_cars_numeric$price, labels = c('city MPG','price'))

# There appears to be a negative association between city fuel consumption and sale price of cars

featurePlot(clean_cars_numeric$highway_mpg, clean_cars_numeric$price, labels = c('highway MPG','price'))

# There appears to be a negative association between highway fuel consumption and sale price of cars

featurePlot(clean_cars_numeric$compression, clean_cars_numeric$price, labels = c('compression','price'))

# There appears to be a bipartisian distribution b/t compression and sale price of cars -> likely no association

featurePlot(clean_cars_numeric$stroke, clean_cars_numeric$price, labels = c('stroke','price'))

# There appears to be no association b/t stroke and sale price of cars

featurePlot(clean_cars_numeric$bore, clean_cars_numeric$price, labels = c('bore','price'))

# There appears to be a very weak positive trend b/t bore and sale price of cars

featurePlot(clean_cars_numeric$engine_size, clean_cars_numeric$price, labels = c('engine size','price'))

# There is a positive relationship b/t increasing engine size and sale price of cars

featurePlot(clean_cars_numeric$curb_weight, clean_cars_numeric$price, labels = c('curb_weight','price'))

# There appears to be somewhat of a positive relationship (fairly linear) b/t curb weight of the vehicle and sale price

featurePlot(clean_cars_numeric$height, clean_cars_numeric$price, labels = c('height','price'))

# There doesn't appear to be a linear relationship at all. 

featurePlot(clean_cars_numeric$length, clean_cars_numeric$price, labels = c('length','price'))

# There seems to be a somewhat poor positive relationship b/t length of car and its sale price

featurePlot(clean_cars_numeric$wheel_base, clean_cars_numeric$price, labels = c('wheel base','price'))

# There appears to be a very poor positive relationship (bordering no relationship) b/t wheelbase of the car and its sale price 

featurePlot(clean_cars_numeric$normalized_losses, clean_cars_numeric$price, labels = c('normalized losses','price'))

# There appears to be no relationship b/t sale price and normalized losses

featurePlot(clean_cars_numeric$symboling, clean_cars_numeric$price, labels = c('symboling','price'))

# There appears to be no relationship b/t sale price and normalized losses
ggplot(data = clean_cars_numeric,
       aes(x = price)) +
  geom_histogram(bins = 10)

mean(clean_cars_numeric$price) # 13207.13
## [1] 13205.69
sd(clean_cars_numeric$price) # 7947.07
## [1] 7966.983
median(clean_cars_numeric$price) # 10295.00
## [1] 10270
max(clean_cars_numeric$price) # 45400
## [1] 45400
2*sd(clean_cars_numeric$price) + mean(clean_cars_numeric$price) # 29101.26 (assume as the cut-off of what car prices to be within expectation)
## [1] 29139.66

Based of the measure of centrality of sale price, there appears to be a right-skewed distribution of car sale prices.

If we were to consider a normalized distribution to be in the range of mean +/- 2 SD as an expected price range, we see that there are 14 cars that were found to have prices equal to or greater than this range.

Let’s examine these outlier in compraison to cars within expected price range.

pricy_cars <- clean_cars_numeric %>%
                mutate(outliers = ifelse(price >= 29101.26, "outlier", "not"))
pricy_cars
##     symboling normalized_losses wheel_base length width height curb_weight
## 1           3               115       88.6  168.8  64.1   48.8        2548
## 2           1               115       94.5  171.2  65.5   52.4        2823
## 3           2               164       99.8  176.6  66.2   54.3        2337
## 4           2               164       99.4  176.6  66.4   54.3        2824
## 5           2               115       99.8  177.3  66.3   53.1        2507
## 6           1               158      105.8  192.7  71.4   55.7        2844
## 7           1               115      105.8  192.7  71.4   55.7        2954
## 8           1               158      105.8  192.7  71.4   55.9        3086
## 9           2               192      101.2  176.8  64.8   54.3        2395
## 10          0               192      101.2  176.8  64.8   54.3        2395
## 11          0               188      101.2  176.8  64.8   54.3        2710
## 12          0               188      101.2  176.8  64.8   54.3        2765
## 13          1               115      103.5  189.0  66.9   55.7        3055
## 14          0               115      103.5  189.0  66.9   55.7        3230
## 15          0               115      103.5  193.8  67.9   53.7        3380
## 16          0               115      110.0  197.0  70.9   56.3        3505
## 17          2               121       88.4  141.1  60.3   53.2        1488
## 18          1                98       94.5  155.9  63.6   52.0        1874
## 19          0                81       94.5  158.8  63.6   52.0        1909
## 20          1               118       93.7  157.3  63.8   50.8        1876
## 21          1               118       93.7  157.3  63.8   50.8        1876
## 22          1               118       93.7  157.3  63.8   50.8        2128
## 23          1               148       93.7  157.3  63.8   50.6        1967
## 24          1               148       93.7  157.3  63.8   50.6        1989
## 25          1               148       93.7  157.3  63.8   50.6        1989
## 26          1               148       93.7  157.3  63.8   50.6        2191
## 27         -1               110      103.3  174.6  64.6   59.8        2535
## 28          3               145       95.9  173.2  66.3   50.2        2811
## 29          2               137       86.6  144.6  63.9   50.8        1713
## 30          2               137       86.6  144.6  63.9   50.8        1819
## 31          1               101       93.7  150.0  64.0   52.6        1837
## 32          1               101       93.7  150.0  64.0   52.6        1940
## 33          1               101       93.7  150.0  64.0   52.6        1956
## 34          0               110       96.5  163.4  64.0   54.5        2010
## 35          0                78       96.5  157.1  63.9   58.3        2024
## 36          0               106       96.5  167.5  65.2   53.3        2236
## 37          0               106       96.5  167.5  65.2   53.3        2289
## 38          0                85       96.5  175.4  65.2   54.1        2304
## 39          0                85       96.5  175.4  62.5   54.1        2372
## 40          0                85       96.5  175.4  65.2   54.1        2465
## 41          1               107       96.5  169.1  66.0   51.0        2293
## 42          0               115       94.3  170.7  61.8   53.5        2337
## 43          2               115       96.0  172.6  65.2   51.4        2734
## 44          0               145      113.0  199.6  69.6   52.8        4066
## 45          0               115      113.0  199.6  69.6   52.8        4066
## 46          0               115      102.0  191.7  70.6   47.8        3950
## 47          1               104       93.1  159.1  64.2   54.1        1890
## 48          1               104       93.1  159.1  64.2   54.1        1900
## 49          1               104       93.1  159.1  64.2   54.1        1905
## 50          1               113       93.1  166.8  64.2   54.1        1945
## 51          1               113       93.1  166.8  64.2   54.1        1950
## 52          3               150       95.3  169.0  65.7   49.6        2380
## 53          3               150       95.3  169.0  65.7   49.6        2380
## 54          3               150       95.3  169.0  65.7   49.6        2385
## 55          3               150       95.3  169.0  65.7   49.6        2500
## 56          1               129       98.8  177.8  66.5   53.7        2385
## 57          0               115       98.8  177.8  66.5   55.5        2410
## 58          1               129       98.8  177.8  66.5   53.7        2385
## 59          0               115       98.8  177.8  66.5   55.5        2410
## 60          0               115       98.8  177.8  66.5   55.5        2443
## 61          0               115       98.8  177.8  66.5   55.5        2425
## 62          0               118      104.9  175.0  66.1   54.4        2670
## 63          0               115      104.9  175.0  66.1   54.4        2700
## 64         -1                93      110.0  190.9  70.3   56.5        3515
## 65         -1                93      110.0  190.9  70.3   58.7        3750
## 66          0                93      106.7  187.5  70.3   54.9        3495
## 67         -1                93      115.6  202.6  71.7   56.3        3770
## 68         -1               115      115.6  202.6  71.7   56.5        3740
## 69          3               142       96.6  180.3  70.5   50.8        3685
## 70          0               115      120.9  208.1  71.7   56.7        3900
## 71          1               115      112.0  199.2  72.0   55.4        3715
## 72          1               115      102.7  178.4  68.0   54.8        2910
## 73          2               161       93.7  157.3  64.4   50.8        1918
## 74          2               161       93.7  157.3  64.4   50.8        1944
## 75          2               161       93.7  157.3  64.4   50.8        2004
## 76          1               161       93.0  157.3  63.8   50.8        2145
## 77          3               153       96.3  173.0  65.4   49.4        2370
## 78          3               153       96.3  173.0  65.4   49.4        2328
## 79          3               115       95.9  173.2  66.3   50.2        2833
## 80          3               115       95.9  173.2  66.3   50.2        2921
## 81          3               115       95.9  173.2  66.3   50.2        2926
## 82          1               125       96.3  172.4  65.4   51.6        2365
## 83          1               125       96.3  172.4  65.4   51.6        2405
## 84          1               125       96.3  172.4  65.4   51.6        2403
## 85         -1               137       96.3  172.4  65.4   51.6        2403
## 86          1               128       94.5  165.3  63.8   54.5        1889
## 87          1               128       94.5  165.3  63.8   54.5        2017
## 88          1               128       94.5  165.3  63.8   54.5        1918
## 89          1               122       94.5  165.3  63.8   54.5        1938
## 90          1               103       94.5  170.2  63.8   53.5        2024
## 91          1               128       94.5  165.3  63.8   54.5        1951
## 92          1               128       94.5  165.6  63.8   53.3        2028
## 93          1               122       94.5  165.3  63.8   54.5        1971
## 94          1               103       94.5  170.2  63.8   53.5        2037
## 95          2               168       95.1  162.4  63.8   53.3        2008
## 96          0               106       97.2  173.4  65.2   54.7        2324
## 97          0               106       97.2  173.4  65.2   54.7        2302
## 98          0               128      100.4  181.7  66.5   55.1        3095
## 99          0               108      100.4  184.6  66.5   56.1        3296
## 100         0               108      100.4  184.6  66.5   55.1        3060
## 101         3               194       91.3  170.7  67.9   49.7        3071
## 102         3               194       91.3  170.7  67.9   49.7        3139
## 103         1               231       99.2  178.5  67.9   49.7        3139
## 104         0               161      107.9  186.7  68.4   56.7        3020
## 105         0               161      107.9  186.7  68.4   56.7        3197
## 106         0               115      114.2  198.9  68.4   58.7        3230
## 107         0               115      114.2  198.9  68.4   58.7        3430
## 108         0               161      107.9  186.7  68.4   56.7        3075
## 109         0               161      107.9  186.7  68.4   56.7        3252
## 110         0               115      114.2  198.9  68.4   56.7        3285
## 111         0               115      114.2  198.9  68.4   58.7        3485
## 112         0               161      107.9  186.7  68.4   56.7        3075
## 113         0               161      107.9  186.7  68.4   56.7        3252
## 114         0               161      108.0  186.7  68.3   56.0        3130
## 115         1               119       93.7  157.3  63.8   50.8        1918
## 116         1               119       93.7  157.3  63.8   50.8        2128
## 117         1               154       93.7  157.3  63.8   50.6        1967
## 118         1               154       93.7  167.3  63.8   50.8        1989
## 119         1               154       93.7  167.3  63.8   50.8        2191
## 120        -1                74      103.3  174.6  64.6   59.8        2535
## 121         3               115       95.9  173.2  66.3   50.2        2818
## 122         3               186       94.5  168.9  68.3   50.2        2778
## 123         3               115       89.5  168.9  65.0   51.6        2756
## 124         3               115       89.5  168.9  65.0   51.6        2756
## 125         3               115       89.5  168.9  65.0   51.6        2800
## 126         0               115       96.1  181.5  66.5   55.2        2579
## 127         2               115       96.1  176.8  66.6   50.5        2460
## 128         3               150       99.1  186.6  66.5   56.1        2658
## 129         2               104       99.1  186.6  66.5   56.1        2695
## 130         3               150       99.1  186.6  66.5   56.1        2707
## 131         2               104       99.1  186.6  66.5   56.1        2758
## 132         3               150       99.1  186.6  66.5   56.1        2808
## 133         2               104       99.1  186.6  66.5   56.1        2847
## 134         2                83       93.7  156.9  63.4   53.7        2050
## 135         2                83       93.7  157.9  63.6   53.7        2120
## 136         2                83       93.3  157.3  63.8   55.7        2240
## 137         0               102       97.2  172.0  65.4   52.5        2145
## 138         0               102       97.2  172.0  65.4   52.5        2190
## 139         0               102       97.2  172.0  65.4   52.5        2340
## 140         0               102       97.0  172.0  65.4   54.3        2385
## 141         0               102       97.0  172.0  65.4   54.3        2510
## 142         0                89       97.0  173.5  65.4   53.0        2290
## 143         0                89       97.0  173.5  65.4   53.0        2455
## 144         0                85       96.9  173.6  65.4   54.9        2420
## 145         0                85       96.9  173.6  65.4   54.9        2650
## 146         1                87       95.7  158.7  63.6   54.5        1985
## 147         1                87       95.7  158.7  63.6   54.5        2040
## 148         1                74       95.7  158.7  63.6   54.5        2015
## 149         0                77       95.7  169.7  63.6   59.1        2280
## 150         0                81       95.7  169.7  63.6   59.1        2290
## 151         0                91       95.7  169.7  63.6   59.1        3110
## 152         0                91       95.7  166.3  64.4   53.0        2081
## 153         0                91       95.7  166.3  64.4   52.8        2109
## 154         0                91       95.7  166.3  64.4   53.0        2275
## 155         0                91       95.7  166.3  64.4   52.8        2275
## 156         0                91       95.7  166.3  64.4   53.0        2094
## 157         0                91       95.7  166.3  64.4   52.8        2122
## 158         0                91       95.7  166.3  64.4   52.8        2140
## 159         1               168       94.5  168.7  64.0   52.6        2169
## 160         1               168       94.5  168.7  64.0   52.6        2204
## 161         1               168       94.5  168.7  64.0   52.6        2265
## 162         1               168       94.5  168.7  64.0   52.6        2300
## 163         2               134       98.4  176.2  65.6   52.0        2540
## 164         2               134       98.4  176.2  65.6   52.0        2536
## 165         2               134       98.4  176.2  65.6   52.0        2551
## 166         2               134       98.4  176.2  65.6   52.0        2679
## 167         2               134       98.4  176.2  65.6   52.0        2714
## 168         2               134       98.4  176.2  65.6   53.0        2975
## 169        -1                65      102.4  175.6  66.5   54.9        2326
## 170        -1                65      102.4  175.6  66.5   54.9        2480
## 171        -1                65      102.4  175.6  66.5   53.9        2414
## 172        -1                65      102.4  175.6  66.5   54.9        2414
## 173        -1                65      102.4  175.6  66.5   53.9        2458
## 174         3               197      102.9  183.5  67.7   52.0        2976
## 175         3               197      102.9  183.5  67.7   52.0        3016
## 176        -1                90      104.5  187.8  66.5   54.1        3131
## 177        -1               115      104.5  187.8  66.5   54.1        3151
## 178         2               122       97.3  171.7  65.5   55.7        2261
## 179         2               122       97.3  171.7  65.5   55.7        2209
## 180         2                94       97.3  171.7  65.5   55.7        2264
## 181         2                94       97.3  171.7  65.5   55.7        2212
## 182         2                94       97.3  171.7  65.5   55.7        2275
## 183         2                94       97.3  171.7  65.5   55.7        2319
## 184         2                94       97.3  171.7  65.5   55.7        2300
## 185         3               115       94.5  159.3  64.2   55.6        2254
## 186         3               256       94.5  165.7  64.0   51.4        2221
## 187         0               115      100.4  180.2  66.9   55.1        2661
## 188         0               115      100.4  180.2  66.9   55.1        2579
## 189         0               115      100.4  183.1  66.9   55.1        2563
## 190        -2               103      104.3  188.8  67.2   56.2        2912
## 191        -1                74      104.3  188.8  67.2   57.5        3034
## 192        -2               103      104.3  188.8  67.2   56.2        2935
## 193        -1                74      104.3  188.8  67.2   57.5        3042
## 194        -2               103      104.3  188.8  67.2   56.2        3045
## 195        -1                74      104.3  188.8  67.2   57.5        3157
## 196        -1                95      109.1  188.8  68.9   55.5        2952
## 197        -1                95      109.1  188.8  68.8   55.5        3049
## 198        -1                95      109.1  188.8  68.9   55.5        3012
## 199        -1                95      109.1  188.8  68.9   55.5        3217
## 200        -1                95      109.1  188.8  68.9   55.5        3062
##     engine_size bore stroke compression horsepower peak_rpm city_mpg
## 1           130 3.47   2.68        9.00        111  5000.00       21
## 2           152 2.68   3.47        9.00        154  5000.00       19
## 3           109 3.19   3.40       10.00        102  5500.00       24
## 4           136 3.19   3.40        8.00        115  5500.00       18
## 5           136 3.19   3.40        8.50        110  5500.00       19
## 6           136 3.19   3.40        8.50        110  5500.00       19
## 7           136 3.19   3.40        8.50        110  5500.00       19
## 8           131 3.13   3.40        8.30        140  5500.00       17
## 9           108 3.50   2.80        8.80        101  5800.00       23
## 10          108 3.50   2.80        8.80        101  5800.00       23
## 11          164 3.31   3.19        9.00        121  4250.00       21
## 12          164 3.31   3.19        9.00        121  4250.00       21
## 13          164 3.31   3.19        9.00        121  4250.00       20
## 14          209 3.62   3.39        8.00        182  5400.00       16
## 15          209 3.62   3.39        8.00        182  5400.00       16
## 16          209 3.62   3.39        8.00        182  5400.00       15
## 17           61 2.91   3.03        9.50         48  5100.00       47
## 18           90 3.03   3.11        9.60         70  5400.00       38
## 19           90 3.03   3.11        9.60         70  5400.00       38
## 20           90 2.97   3.23        9.41         68  5500.00       37
## 21           90 2.97   3.23        9.40         68  5500.00       31
## 22           98 3.03   3.39        7.60        102  5500.00       24
## 23           90 2.97   3.23        9.40         68  5500.00       31
## 24           90 2.97   3.23        9.40         68  5500.00       31
## 25           90 2.97   3.23        9.40         68  5500.00       31
## 26           98 3.03   3.39        7.60        102  5500.00       24
## 27          122 3.34   3.46        8.50         88  5000.00       24
## 28          156 3.60   3.90        7.00        145  5000.00       19
## 29           92 2.91   3.41        9.60         58  4800.00       49
## 30           92 2.91   3.41        9.20         76  6000.00       31
## 31           79 2.91   3.07       10.10         60  5500.00       38
## 32           92 2.91   3.41        9.20         76  6000.00       30
## 33           92 2.91   3.41        9.20         76  6000.00       30
## 34           92 2.91   3.41        9.20         76  6000.00       30
## 35           92 2.92   3.41        9.20         76  6000.00       30
## 36          110 3.15   3.58        9.00         86  5800.00       27
## 37          110 3.15   3.58        9.00         86  5800.00       27
## 38          110 3.15   3.58        9.00         86  5800.00       27
## 39          110 3.15   3.58        9.00         86  5800.00       27
## 40          110 3.15   3.58        9.00        101  5800.00       24
## 41          110 3.15   3.58        9.10        100  5500.00       25
## 42          111 3.31   3.23        8.50         78  4800.00       24
## 43          119 3.43   3.23        9.20         90  5000.00       24
## 44          258 3.63   4.17        8.10        176  4750.00       15
## 45          258 3.63   4.17        8.10        176  4750.00       15
## 46          326 3.54   2.76       11.50        262  5000.00       13
## 47           91 3.03   3.15        9.00         68  5000.00       30
## 48           91 3.03   3.15        9.00         68  5000.00       31
## 49           91 3.03   3.15        9.00         68  5000.00       31
## 50           91 3.03   3.15        9.00         68  5000.00       31
## 51           91 3.08   3.15        9.00         68  5000.00       31
## 52           70 3.33   3.26        9.40        101  6000.00       17
## 53           70 3.33   3.26        9.40        101  6000.00       17
## 54           70 3.33   3.26        9.40        101  6000.00       17
## 55           80 3.33   3.26        9.40        135  6000.00       16
## 56          122 3.39   3.39        8.60         84  4800.00       26
## 57          122 3.39   3.39        8.60         84  4800.00       26
## 58          122 3.39   3.39        8.60         84  4800.00       26
## 59          122 3.39   3.39        8.60         84  4800.00       26
## 60          122 3.39   3.39       22.70         64  4650.00       36
## 61          122 3.39   3.39        8.60         84  4800.00       26
## 62          140 3.76   3.16        8.00        120  5000.00       19
## 63          134 3.43   3.64       22.00         72  4200.00       31
## 64          183 3.58   3.64       21.50        123  4350.00       22
## 65          183 3.58   3.64       21.50        123  4350.00       22
## 66          183 3.58   3.64       21.50        123  4350.00       22
## 67          183 3.58   3.64       21.50        123  4350.00       22
## 68          234 3.46   3.10        8.30        155  4750.00       16
## 69          234 3.46   3.10        8.30        155  4750.00       16
## 70          308 3.80   3.35        8.00        184  4500.00       14
## 71          304 3.80   3.35        8.00        184  4500.00       14
## 72          140 3.78   3.12        8.00        175  5000.00       19
## 73           92 2.97   3.23        9.40         68  5500.00       37
## 74           92 2.97   3.23        9.40         68  5500.00       31
## 75           92 2.97   3.23        9.40         68  5500.00       31
## 76           98 3.03   3.39        7.60        102  5500.00       24
## 77          110 3.17   3.46        7.50        116  5500.00       23
## 78          122 3.35   3.46        8.50         88  5000.00       25
## 79          156 3.58   3.86        7.00        145  5000.00       19
## 80          156 3.59   3.86        7.00        145  5000.00       19
## 81          156 3.59   3.86        7.00        145  5000.00       19
## 82          122 3.35   3.46        8.50         88  5000.00       25
## 83          122 3.35   3.46        8.50         88  5000.00       25
## 84          110 3.17   3.46        7.50        116  5500.00       23
## 85          110 3.17   3.46        7.50        116  5500.00       23
## 86           97 3.15   3.29        9.40         69  5200.00       31
## 87          103 2.99   3.47       21.90         55  4800.00       45
## 88           97 3.15   3.29        9.40         69  5200.00       31
## 89           97 3.15   3.29        9.40         69  5200.00       31
## 90           97 3.15   3.29        9.40         69  5200.00       31
## 91           97 3.15   3.29        9.40         69  5200.00       31
## 92           97 3.15   3.29        9.40         69  5200.00       31
## 93           97 3.15   3.29        9.40         69  5200.00       31
## 94           97 3.15   3.29        9.40         69  5200.00       31
## 95           97 3.15   3.29        9.40         69  5200.00       31
## 96          120 3.33   3.47        8.50         97  5200.00       27
## 97          120 3.33   3.47        8.50         97  5200.00       27
## 98          181 3.43   3.27        9.00        152  5200.00       17
## 99          181 3.43   3.27        9.00        152  5200.00       17
## 100         181 3.43   3.27        9.00        152  5200.00       19
## 101         181 3.43   3.27        9.00        160  5200.00       19
## 102         181 3.43   3.27        7.80        200  5200.00       17
## 103         181 3.43   3.27        9.00        160  5200.00       19
## 104         120 3.46   3.19        8.40         97  5000.00       19
## 105         152 3.70   3.52       21.00         95  4150.00       28
## 106         120 3.46   3.19        8.40         97  5000.00       19
## 107         152 3.70   3.52       21.00         95  4150.00       25
## 108         120 3.46   2.19        8.40         95  5000.00       19
## 109         152 3.70   3.52       21.00         95  4150.00       28
## 110         120 3.46   2.19        8.40         95  5000.00       19
## 111         152 3.70   3.52       21.00         95  4150.00       25
## 112         120 3.46   3.19        8.40         97  5000.00       19
## 113         152 3.70   3.52       21.00         95  4150.00       28
## 114         134 3.61   3.21        7.00        142  5600.00       18
## 115          90 2.97   3.23        9.40         68  5500.00       37
## 116          98 3.03   3.39        7.60        102  5500.00       24
## 117          90 2.97   3.23        9.40         68  5500.00       31
## 118          90 2.97   3.23        9.40         68  5500.00       31
## 119          98 2.97   3.23        9.40         68  5500.00       31
## 120         122 3.35   3.46        8.50         88  5000.00       24
## 121         156 3.59   3.86        7.00        145  5000.00       19
## 122         151 3.94   3.11        9.50        143  5500.00       19
## 123         194 3.74   2.90        9.50        207  5900.00       17
## 124         194 3.74   2.90        9.50        207  5900.00       17
## 125         194 3.74   2.90        9.50        207  5900.00       17
## 126         132 3.46   3.90        8.70         95  5118.18       23
## 127         132 3.46   3.90        8.70         95  5118.18       23
## 128         121 3.54   3.07        9.31        110  5250.00       21
## 129         121 3.54   3.07        9.30        110  5250.00       21
## 130         121 2.54   2.07        9.30        110  5250.00       21
## 131         121 3.54   3.07        9.30        110  5250.00       21
## 132         121 3.54   3.07        9.00        160  5500.00       19
## 133         121 3.54   3.07        9.00        160  5500.00       19
## 134          97 3.62   2.36        9.00         69  4900.00       31
## 135         108 3.62   2.64        8.70         73  4400.00       26
## 136         108 3.62   2.64        8.70         73  4400.00       26
## 137         108 3.62   2.64        9.50         82  4800.00       32
## 138         108 3.62   2.64        9.50         82  4400.00       28
## 139         108 3.62   2.64        9.00         94  5200.00       26
## 140         108 3.62   2.64        9.00         82  4800.00       24
## 141         108 3.62   2.64        7.70        111  4800.00       24
## 142         108 3.62   2.64        9.00         82  4800.00       28
## 143         108 3.62   2.64        9.00         94  5200.00       25
## 144         108 3.62   2.64        9.00         82  4800.00       23
## 145         108 3.62   2.64        7.70        111  4800.00       23
## 146          92 3.05   3.03        9.00         62  4800.00       35
## 147          92 3.05   3.03        9.00         62  4800.00       31
## 148          92 3.05   3.03        9.00         62  4800.00       31
## 149          92 3.05   3.03        9.00         62  4800.00       31
## 150          92 3.05   3.03        9.00         62  4800.00       27
## 151          92 3.05   3.03        9.00         62  4800.00       27
## 152          98 3.19   3.03        9.00         70  4800.00       30
## 153          98 3.19   3.03        9.00         70  4800.00       30
## 154         110 3.27   3.35       22.50         56  4500.00       34
## 155         110 3.27   3.35       22.50         56  4500.00       38
## 156          98 3.19   3.03        9.00         70  4800.00       38
## 157          98 3.19   3.03        9.00         70  4800.00       28
## 158          98 3.19   3.03        9.00         70  4800.00       28
## 159          98 3.19   3.03        9.00         70  4800.00       29
## 160          98 3.19   3.03        9.00         70  4800.00       29
## 161          98 3.24   3.08        9.40        112  6600.00       26
## 162          98 3.24   3.08        9.40        112  6600.00       26
## 163         146 3.62   3.50        9.30        116  4800.00       24
## 164         146 3.62   3.50        9.30        116  4800.00       24
## 165         146 3.62   3.50        9.30        116  4800.00       24
## 166         146 3.62   3.50        9.30        116  4800.00       24
## 167         146 3.62   3.50        9.30        116  4800.00       24
## 168         146 3.62   3.50        9.30        116  4800.00       24
## 169         122 3.31   3.54        8.70         92  4200.00       29
## 170         110 3.27   3.35       22.50         73  4500.00       30
## 171         122 3.31   3.54        8.70         92  4200.00       27
## 172         122 3.31   3.54        8.70         92  4200.00       27
## 173         122 3.31   3.54        8.70         92  4200.00       27
## 174         171 3.27   3.35        9.30        161  5200.00       20
## 175         171 3.27   3.35        9.30        161  5200.00       19
## 176         171 3.27   3.35        9.20        156  5200.00       20
## 177         161 3.27   3.35        9.20        156  5200.00       19
## 178          97 3.01   3.40       23.00         52  4800.00       37
## 179         109 3.19   3.40        9.00         85  5250.00       27
## 180          97 3.01   3.40       23.00         52  4800.00       37
## 181         109 3.19   3.40        9.00         85  5250.00       27
## 182         109 3.19   3.40        9.00         85  5250.00       27
## 183          97 3.01   3.40       23.00         68  4500.00       37
## 184         109 3.19   3.40       10.00        100  5500.00       26
## 185         109 3.19   3.40        8.50         90  5500.00       24
## 186         109 3.19   3.40        8.50         90  5500.00       24
## 187         136 3.19   3.40        8.50        110  5500.00       19
## 188          97 3.01   3.40       23.00         68  4500.00       33
## 189         109 3.19   3.40        9.00         88  5500.00       25
## 190         141 3.78   3.15        9.50        114  5400.00       23
## 191         141 3.78   3.15        9.50        114  5400.00       23
## 192         141 3.78   3.15        9.50        114  5400.00       24
## 193         141 3.78   3.15        9.50        114  5400.00       24
## 194         130 3.62   3.15        7.50        162  5100.00       17
## 195         130 3.62   3.15        7.50        162  5100.00       17
## 196         141 3.78   3.15        9.50        114  5400.00       23
## 197         141 3.78   3.15        8.70        160  5300.00       19
## 198         173 3.58   2.87        8.80        134  5500.00       18
## 199         145 3.01   3.40       23.00        106  4800.00       26
## 200         141 3.78   3.15        9.50        114  5400.00       19
##     highway_mpg price outliers
## 1            27 16500      not
## 2            26 16500      not
## 3            30 13950      not
## 4            22 17450      not
## 5            25 15250      not
## 6            25 17710      not
## 7            25 18920      not
## 8            20 23875      not
## 9            29 16430      not
## 10           29 16925      not
## 11           28 20970      not
## 12           28 21105      not
## 13           25 24565      not
## 14           22 30760  outlier
## 15           22 41315  outlier
## 16           20 36880  outlier
## 17           53  5151      not
## 18           43  6295      not
## 19           43  6575      not
## 20           41  5572      not
## 21           38  6377      not
## 22           30  7957      not
## 23           38  6229      not
## 24           38  6692      not
## 25           38  7609      not
## 26           30  8558      not
## 27           30  8921      not
## 28           24 12964      not
## 29           54  6479      not
## 30           38  6855      not
## 31           42  5399      not
## 32           34  6529      not
## 33           34  7129      not
## 34           34  7295      not
## 35           34  7295      not
## 36           33  7895      not
## 37           33  9095      not
## 38           33  8845      not
## 39           33 10295      not
## 40           28 12945      not
## 41           31 10345      not
## 42           29  6785      not
## 43           29 11048      not
## 44           19 32250  outlier
## 45           19 35550  outlier
## 46           17 36000  outlier
## 47           31  5195      not
## 48           38  6095      not
## 49           38  6795      not
## 50           38  6695      not
## 51           38  7395      not
## 52           23 10945      not
## 53           23 11845      not
## 54           23 13645      not
## 55           23 15645      not
## 56           32  8845      not
## 57           32  8495      not
## 58           32 10595      not
## 59           32 10245      not
## 60           42 10795      not
## 61           32 11245      not
## 62           27 18280      not
## 63           39 18344      not
## 64           25 25552      not
## 65           25 28248      not
## 66           25 28176      not
## 67           25 31600  outlier
## 68           18 34184  outlier
## 69           18 35056  outlier
## 70           16 40960  outlier
## 71           16 45400  outlier
## 72           24 16503      not
## 73           41  5389      not
## 74           38  6189      not
## 75           38  6669      not
## 76           30  7689      not
## 77           30  9959      not
## 78           32  8499      not
## 79           24 12629      not
## 80           24 14869      not
## 81           24 14489      not
## 82           32  6989      not
## 83           32  8189      not
## 84           30  9279      not
## 85           30  9279      not
## 86           37  5499      not
## 87           50  7099      not
## 88           37  6649      not
## 89           37  6849      not
## 90           37  7349      not
## 91           37  7299      not
## 92           37  7799      not
## 93           37  7499      not
## 94           37  7999      not
## 95           37  8249      not
## 96           34  8949      not
## 97           34  9549      not
## 98           22 13499      not
## 99           22 14399      not
## 100          25 13499      not
## 101          25 17199      not
## 102          23 19699      not
## 103          25 18399      not
## 104          24 11900      not
## 105          33 13200      not
## 106          24 12440      not
## 107          25 13860      not
## 108          24 15580      not
## 109          33 16900      not
## 110          24 16695      not
## 111          25 17075      not
## 112          24 16630      not
## 113          33 17950      not
## 114          24 18150      not
## 115          41  5572      not
## 116          30  7957      not
## 117          38  6229      not
## 118          38  6692      not
## 119          38  7609      not
## 120          30  8921      not
## 121          24 12764      not
## 122          27 22018      not
## 123          25 32528  outlier
## 124          25 34028  outlier
## 125          25 37028  outlier
## 126          31  9295      not
## 127          31  9895      not
## 128          28 11850      not
## 129          28 12170      not
## 130          28 15040      not
## 131          28 15510      not
## 132          26 18150      not
## 133          26 18620      not
## 134          36  5118      not
## 135          31  7053      not
## 136          31  7603      not
## 137          37  7126      not
## 138          33  7775      not
## 139          32  9960      not
## 140          25  9233      not
## 141          29 11259      not
## 142          32  7463      not
## 143          31 10198      not
## 144          29  8013      not
## 145          23 11694      not
## 146          39  5348      not
## 147          38  6338      not
## 148          38  6488      not
## 149          37  6918      not
## 150          32  7898      not
## 151          32  8778      not
## 152          37  6938      not
## 153          37  7198      not
## 154          36  7898      not
## 155          47  7788      not
## 156          47  7738      not
## 157          34  8358      not
## 158          34  9258      not
## 159          34  8058      not
## 160          34  8238      not
## 161          29  9298      not
## 162          29  9538      not
## 163          30  8449      not
## 164          30  9639      not
## 165          30  9989      not
## 166          30 11199      not
## 167          30 11549      not
## 168          30 17669      not
## 169          34  8948      not
## 170          33 10698      not
## 171          32  9988      not
## 172          32 10898      not
## 173          32 11248      not
## 174          24 16558      not
## 175          24 15998      not
## 176          24 15690      not
## 177          24 15750      not
## 178          46  7775      not
## 179          34  7975      not
## 180          46  7995      not
## 181          34  8195      not
## 182          34  8495      not
## 183          42  9495      not
## 184          32  9995      not
## 185          29 11595      not
## 186          29  9980      not
## 187          24 13295      not
## 188          38 13845      not
## 189          31 12290      not
## 190          28 12940      not
## 191          28 13415      not
## 192          28 15985      not
## 193          28 16515      not
## 194          22 18420      not
## 195          22 18950      not
## 196          28 16845      not
## 197          25 19045      not
## 198          23 21485      not
## 199          27 22470      not
## 200          25 22625      not
summarization = pricy_cars %>% 
  select(outliers, price, highway_mpg, city_mpg, peak_rpm, horsepower, engine_size, curb_weight) %>%
  group_by(outliers) %>%
  summarize(
    mean_price = mean(price), sd_price = sd(price), median_price = median(price), 
    mean_highway = mean(highway_mpg), sd_highway = sd(highway_mpg), median_highway = median(highway_mpg), 
    mean_city = mean(city_mpg), sd_city = sd(city_mpg), median_city = median(city_mpg),
    mean_rpm = mean(peak_rpm), sd_rpm = sd(peak_rpm), median_rpm = median(peak_rpm),
    mean_horsepower = mean(horsepower), sd_horsepower = sd(horsepower), median_horsepower = median(horsepower),
    mean_engine = mean(engine_size), sd_engine = sd(engine_size), median_engine = median(engine_size),
    mean_weight = mean(curb_weight), sd_weight = sd(curb_weight), median_weight = median(curb_weight)
  )

summarization
## # A tibble: 2 × 22
##   outliers mean_price sd_price median_…¹ mean_…² sd_hi…³ media…⁴ mean_…⁵ sd_city
##   <chr>         <dbl>    <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 not          11492.    4991.     9960.    31.5    6.39    31      25.9    6.10
## 2 outlier      35967.    4153.    35303     20.5    3.46    19.5    15.9    2.13
## # … with 13 more variables: median_city <dbl>, mean_rpm <dbl>, sd_rpm <dbl>,
## #   median_rpm <dbl>, mean_horsepower <dbl>, sd_horsepower <dbl>,
## #   median_horsepower <dbl>, mean_engine <dbl>, sd_engine <dbl>,
## #   median_engine <dbl>, mean_weight <dbl>, sd_weight <dbl>,
## #   median_weight <dbl>, and abbreviated variable names ¹​median_price,
## #   ²​mean_highway, ³​sd_highway, ⁴​median_highway, ⁵​mean_city

These outlier vehicles were found to have:

  1. significantly lower highway and city fuel consumption
  2. significantly greater horsepower, engine size and weight

However given a closer look at these vehicles, it might be that luxury-brand vehicles (based on names of the automakers) would have constitute a greater premium in terms of pricing.

As these factors are what is going to impact sales price + with the trends being in line with non-outlier vehicles, we will make the decision to keep these outliers in our dataset.

STEP 3: Setting up our Model

PART A: Splitting up the dataset into a training set and testing set.

For our purpose of maximizing variance whilst minimzing inherent bias, we will be using a 80% training to 20% testing split.

set.seed(1) # For the sake of reproducibility 

train_indices = createDataPartition(y = clean_cars_numeric[['price']], 
                                    p =  0.8, 
                                    list = FALSE)

training_listings = clean_cars_numeric[train_indices, ] # Should be about 161
testing_listings = clean_cars_numeric[-train_indices,] # Should be about 40 

PART B: Hyperparameter Tuning

For this process, we will be using the Grid Search method to tune hyperparameters in our model.

As our intention is to utilize the KNN approach, there will only be one parameter (i.e. k) which we will find a range for different combination.

cv_folds = 15 #it's the sqrt(n) where n = 205
knn_grid = expand.grid(k = 1:100)
train_control = trainControl(method = 'cv', number = cv_folds)

PART C: Create the KNN Model

This will be the longest part as it will be a series of experimentation of figuring out which is the best approach to predicting car price. However, there are several approaches to this.

FIRST: Rationalizing predictor selection

Using multiple resources that had explored the topic of vehicle prices (both new and used) as listed here,here, here, here, we see that several factors within our dataframe may be ideal candidates as predictors: 1) city + highway mileage & 2) engine size

Looking past this, we can also speculate that horsepower + torque may also play a role considering it’s impact on fuel consumption.

Furthermore, it can also be speculated that car size may play a role in vehicle pricing given its relationship to insurance rates as noted here, the change in the current landscape of the marketplace as noted here, and how the influence of personal economics impact car purchasing habits as noted here

Thus, we will assume a comparison of several different models for comparing:

  1. A model containing city MPG and highway MPG

  2. A model containing city MPG, highway MPG and engine size

  3. A model containing city MPG, highway MPG, engine size and horsepower

  4. A model containing city MPG, highway MPG, engine size, horsepower and torque

  5. A model containing city MPG, highway MPG, engine size, horsepower, torque and curb weight

  6. A model containing city MPG, highway MPG, engine size, horsepower, torque, curb weight and length

  7. A model containing city MPG, highway MPG, engine size, horsepower, torque, curb weight, length and width

  8. A model containing city MPG, highway MPG, engine size, horsepower, torque, curb weight, length, width and height

knn_model_1 = train(price ~ city_mpg + highway_mpg,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_2 = train(price ~ city_mpg + highway_mpg + engine_size,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_3 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_4 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_5 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_6 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_7 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length +  width,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_8 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length +  width + height,
                    data = clean_cars_numeric,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 

SECOND: Data Dredging Approach

This is essentially the “throw it against the wall and see what sticks” approach, where we will go through each variable and see what is best variable/parameter to add into our mode to best predict car prices.

Whilst we can do so by manually adding each variable individually and see what sticks (i.e. one model has price ~ city_mpg, the other has price ~ horsepower, etc.), we can use a stepwise regression analysis to look at seeing which variable is retained based on our dataset.

We will retain the model based on AIC scores

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(leaps)
stepwise_model = train(price ~., # includes all variables within the dataframe
                       data = clean_cars_numeric,
                       trControl = train_control, 
                       method = 'leapSeq',
                       preProcess = c('center', 'scale'),
                       tuneGrid = data.frame(nvmax = 1:15)) 
# nvmax corresponding to a tuning parameter corresponds to the maximum number of predictors to be incorporated

# Another approach to using stepwise regression 
stepwise_model_A = train(price ~., 
                         data = clean_cars_numeric,
                         method = 'lmStepAIC',
                         trControl = train_control,
                         trace = FALSE)

# Another approach to using bi-directional stepwise regression 
linear_model_all <- lm(price ~., data = clean_cars_numeric)
stepwise_model_B <- stepAIC(linear_model_all, direction = "both", trace = FALSE)
stepwise_model_B
## 
## Call:
## lm(formula = price ~ width + height + engine_size + stroke + 
##     compression + horsepower + peak_rpm, data = clean_cars_numeric)
## 
## Coefficients:
## (Intercept)        width       height  engine_size       stroke  compression  
##  -68275.619      698.961      187.970      112.213    -2729.523      272.478  
##  horsepower     peak_rpm  
##      51.880        2.314

PART D: Evaluate the models.

Let’s see how these models turn out.

# The Educated-Guess Approach
testing = testing_listings %>% 
  mutate(
    model_one_prediction = predict(knn_model_1, newdata = testing_listings),
    model_two_prediction = predict(knn_model_2, newdata = testing_listings),
    model_three_prediction = predict(knn_model_3, newdata = testing_listings),
    model_four_prediction = predict(knn_model_4, newdata = testing_listings),
    model_five_prediction = predict(knn_model_5, newdata = testing_listings),
    model_six_prediction = predict(knn_model_6, newdata = testing_listings),
    model_seven_prediction = predict(knn_model_7, newdata = testing_listings),
    model_eight_prediction = predict(knn_model_8, newdata = testing_listings),
    sq_error_model_one = (price - model_one_prediction)^2,
    sq_error_model_two = (price - model_two_prediction)^2,
    sq_error_model_three = (price - model_three_prediction)^2,
    sq_error_model_four = (price - model_four_prediction)^2,
    sq_error_model_five = (price - model_five_prediction)^2,
    sq_error_model_six = (price - model_six_prediction)^2,
    sq_error_model_seven = (price - model_seven_prediction)^2,
    sq_error_model_eight = (price - model_eight_prediction)^2
  )

long_testing = testing %>% 
  pivot_longer(
    cols =  sq_error_model_one:sq_error_model_eight,
    names_to = 'model',
    values_to = 'sq_error'
  )

rmse_by_model = long_testing %>% 
                  group_by(model) %>%
                  summarize(rmse = sqrt(mean(sq_error)))

summed_model_a = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight, data = clean_cars_numeric)
summed_model_b = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length, data = clean_cars_numeric)
summed_model_c = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length + width + height, data = clean_cars_numeric)
predictions_model_a <- predict(summed_model_a, newdata = testing_listings)
predictions_model_b <- predict(summed_model_b, newdata = testing_listings)
predictions_model_c <- predict(summed_model_c, newdata = testing_listings)
postResample(pred = predictions_model_a, obs = testing_listings$price) # RMSE = 3348.235
##         RMSE     Rsquared          MAE 
## 2759.4948088    0.8652066 1962.9642552
postResample(pred = predictions_model_b, obs = testing_listings$price) # RMSE = 3345.551
##         RMSE     Rsquared          MAE 
## 2752.3901296    0.8659432 1950.4532659
postResample(pred = predictions_model_c, obs = testing_listings$price) # RMSE = 3141.412
##        RMSE    Rsquared         MAE 
## 2563.864202    0.885678 1845.171601

Looking at the above findings, we see that the top 3 models that appear to have performed the ‘best’ (based on RMSE) are

  1. KNN_model_5: contains the parameters city_mpg, highway_mpg, engine_size, horsepower, peak_rpm and curb_weight; the model seems to perform the best with a single closest neighbour
  2. KNN_model_6: contains the parameters city_mpg, highway_mpg, engine_size, horsepower, peak_rpm, curb_weight and length; the model seems to perform the best with a single closest neighbour
  3. KNN_model_8: contains the parameters city_mpg, highway_mpg, engine_size, horsepower, peak_rpm, curb_weight, length, width and height; the model seems to perform the best with a single closest neighbour

However, looking at the RMSE and the R-squared of each model, we see that Model # 8 had the highest score whereby it was predictive of 82.05% of the variance in car prices.

# Stepwise Regression Approach
stepwise_model
## Linear Regression with Stepwise Selection 
## 
## 200 samples
##  15 predictor
## 
## Pre-processing: centered (15), scaled (15) 
## Resampling: Cross-Validated (15 fold) 
## Summary of sample sizes: 187, 188, 185, 188, 186, 187, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##    1     3910.091  0.7190993  2859.375
##    2     3846.532  0.7568909  2691.482
##    3     3517.138  0.7924893  2589.376
##    4     3500.449  0.8034444  2575.584
##    5     3499.347  0.8024931  2572.064
##    6     3549.923  0.7939102  2701.091
##    7     3835.687  0.7742852  2667.338
##    8     3420.309  0.8206737  2516.168
##    9     3484.930  0.8152225  2609.172
##   10     3428.751  0.8198426  2495.174
##   11     3504.878  0.8119616  2584.587
##   12     3362.435  0.8289794  2463.348
##   13     3368.405  0.8290710  2484.589
##   14     3365.325  0.8313338  2495.912
##   15     3359.034  0.8310511  2502.934
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 15.
stepwise_model$results 
##    nvmax     RMSE  Rsquared      MAE    RMSESD RsquaredSD    MAESD
## 1      1 3910.091 0.7190993 2859.375  679.0218 0.20996012 494.5426
## 2      2 3846.532 0.7568909 2691.482  761.6904 0.14843640 452.8594
## 3      3 3517.138 0.7924893 2589.376  777.0529 0.12660326 594.6951
## 4      4 3500.449 0.8034444 2575.584  836.7178 0.10721483 580.0019
## 5      5 3499.347 0.8024931 2572.064  906.8095 0.10827541 584.3168
## 6      6 3549.923 0.7939102 2701.091 1184.5210 0.11898552 788.4440
## 7      7 3835.687 0.7742852 2667.338 1227.4745 0.13325168 743.3132
## 8      8 3420.309 0.8206737 2516.168  919.6046 0.10047677 654.6060
## 9      9 3484.930 0.8152225 2609.172  929.3991 0.10092127 675.9808
## 10    10 3428.751 0.8198426 2495.174  808.8460 0.10572377 498.7029
## 11    11 3504.878 0.8119616 2584.587 1016.4914 0.11729074 621.1184
## 12    12 3362.435 0.8289794 2463.348  905.4385 0.10026686 602.0992
## 13    13 3368.405 0.8290710 2484.589  942.5705 0.09984006 677.1885
## 14    14 3365.325 0.8313338 2495.912  953.0371 0.09811661 676.0982
## 15    15 3359.034 0.8310511 2502.934  921.9604 0.10105924 651.0425
summary(stepwise_model$finalModel) 
## Subset selection object
## 15 Variables  (and intercept)
##                   Forced in Forced out
## symboling             FALSE      FALSE
## normalized_losses     FALSE      FALSE
## wheel_base            FALSE      FALSE
## length                FALSE      FALSE
## width                 FALSE      FALSE
## height                FALSE      FALSE
## curb_weight           FALSE      FALSE
## engine_size           FALSE      FALSE
## bore                  FALSE      FALSE
## stroke                FALSE      FALSE
## compression           FALSE      FALSE
## horsepower            FALSE      FALSE
## peak_rpm              FALSE      FALSE
## city_mpg              FALSE      FALSE
## highway_mpg           FALSE      FALSE
## 1 subsets of each size up to 15
## Selection Algorithm: 'sequential replacement'
##           symboling normalized_losses wheel_base length width height
## 1  ( 1 )  " "       " "               " "        " "    " "   " "   
## 2  ( 1 )  " "       " "               " "        " "    " "   " "   
## 3  ( 1 )  " "       " "               " "        " "    " "   " "   
## 4  ( 1 )  " "       " "               " "        " "    " "   " "   
## 5  ( 1 )  " "       " "               " "        " "    " "   " "   
## 6  ( 1 )  " "       " "               " "        " "    "*"   " "   
## 7  ( 1 )  " "       " "               " "        " "    "*"   "*"   
## 8  ( 1 )  " "       " "               " "        " "    "*"   "*"   
## 9  ( 1 )  " "       " "               " "        "*"    "*"   "*"   
## 10  ( 1 ) " "       " "               " "        "*"    "*"   "*"   
## 11  ( 1 ) " "       " "               "*"        "*"    "*"   "*"   
## 12  ( 1 ) "*"       " "               "*"        "*"    "*"   "*"   
## 13  ( 1 ) "*"       "*"               "*"        "*"    "*"   "*"   
## 14  ( 1 ) "*"       "*"               "*"        "*"    "*"   "*"   
## 15  ( 1 ) "*"       "*"               "*"        "*"    "*"   "*"   
##           curb_weight engine_size bore stroke compression horsepower peak_rpm
## 1  ( 1 )  " "         "*"         " "  " "    " "         " "        " "     
## 2  ( 1 )  "*"         "*"         " "  " "    " "         " "        " "     
## 3  ( 1 )  "*"         "*"         " "  " "    " "         " "        "*"     
## 4  ( 1 )  "*"         "*"         " "  "*"    " "         " "        "*"     
## 5  ( 1 )  "*"         "*"         " "  "*"    "*"         " "        "*"     
## 6  ( 1 )  " "         "*"         " "  "*"    "*"         "*"        "*"     
## 7  ( 1 )  " "         "*"         " "  "*"    "*"         "*"        "*"     
## 8  ( 1 )  " "         "*"         " "  "*"    "*"         "*"        "*"     
## 9  ( 1 )  " "         "*"         " "  "*"    "*"         "*"        "*"     
## 10  ( 1 ) " "         "*"         " "  "*"    "*"         "*"        "*"     
## 11  ( 1 ) " "         "*"         " "  "*"    "*"         "*"        "*"     
## 12  ( 1 ) " "         "*"         " "  "*"    "*"         "*"        "*"     
## 13  ( 1 ) "*"         "*"         "*"  "*"    "*"         "*"        "*"     
## 14  ( 1 ) "*"         "*"         "*"  "*"    "*"         "*"        "*"     
## 15  ( 1 ) "*"         "*"         "*"  "*"    "*"         "*"        "*"     
##           city_mpg highway_mpg
## 1  ( 1 )  " "      " "        
## 2  ( 1 )  " "      " "        
## 3  ( 1 )  " "      " "        
## 4  ( 1 )  " "      " "        
## 5  ( 1 )  " "      " "        
## 6  ( 1 )  " "      " "        
## 7  ( 1 )  " "      " "        
## 8  ( 1 )  "*"      " "        
## 9  ( 1 )  "*"      " "        
## 10  ( 1 ) "*"      "*"        
## 11  ( 1 ) "*"      "*"        
## 12  ( 1 ) "*"      "*"        
## 13  ( 1 ) " "      " "        
## 14  ( 1 ) "*"      " "        
## 15  ( 1 ) "*"      "*"
summarized_model = lm(price ~ city_mpg + peak_rpm + horsepower + compression + stroke + engine_size + height + width, data = clean_cars_numeric)

summary(summarized_model)
## 
## Call:
## lm(formula = price ~ city_mpg + peak_rpm + horsepower + compression + 
##     stroke + engine_size + height + width, data = clean_cars_numeric)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10996.2  -1684.7      3.9   1578.1  13813.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.668e+04  1.433e+04  -3.955 0.000108 ***
## city_mpg    -9.956e+01  7.581e+01  -1.313 0.190641    
## peak_rpm     2.337e+00  6.385e-01   3.660 0.000326 ***
## horsepower   3.947e+01  1.681e+01   2.348 0.019914 *  
## compression  3.159e+02  7.577e+01   4.169 4.64e-05 ***
## stroke      -2.742e+03  7.721e+02  -3.551 0.000483 ***
## engine_size  1.156e+02  1.346e+01   8.593 3.00e-15 ***
## height       1.645e+02  1.098e+02   1.498 0.135733    
## width        5.852e+02  1.978e+02   2.959 0.003476 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3209 on 191 degrees of freedom
## Multiple R-squared:  0.8443, Adjusted R-squared:  0.8378 
## F-statistic: 129.5 on 8 and 191 DF,  p-value: < 2.2e-16
prediction_summarized_model_a = predict(summarized_model, newdata = testing_listings)

postResample(pred = prediction_summarized_model_a, obs = testing_listings$price) # RMSE = 3083.854
##         RMSE     Rsquared          MAE 
## 2546.0779941    0.8878266 1854.2849324

Within the course of identifying the number of predictors to be included into the model, it appears as those the best model had included 8 variables.

These 8 variables include: width, height, engine size, stroke, compression, horsepower, torque and city fuel consumption.

Looking at this model, it seems that this particular model could explain ~ 83.78% variance of predicted car prices.

# Stepwise Regression Approach - Option B
stepwise_model_A
## Linear Regression with Stepwise Selection 
## 
## 200 samples
##  15 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (15 fold) 
## Summary of sample sizes: 186, 187, 185, 187, 188, 187, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   3449.37  0.8121083  2561.163
stepwise_model_A$results
##   parameter    RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      none 3449.37 0.8121083 2561.163 802.9063 0.08966347 427.2272
stepwise_model_A$finalModel
## 
## Call:
## lm(formula = .outcome ~ width + height + engine_size + stroke + 
##     compression + horsepower + peak_rpm, data = dat)
## 
## Coefficients:
## (Intercept)        width       height  engine_size       stroke  compression  
##  -68275.619      698.961      187.970      112.213    -2729.523      272.478  
##  horsepower     peak_rpm  
##      51.880        2.314
summary(stepwise_model_A$finalModel)
## 
## Call:
## lm(formula = .outcome ~ width + height + engine_size + stroke + 
##     compression + horsepower + peak_rpm, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11400.0  -1666.3    -17.2   1499.1  13720.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.828e+04  1.131e+04  -6.036 8.01e-09 ***
## width        6.990e+02  1.781e+02   3.924 0.000121 ***
## height       1.880e+02  1.086e+02   1.732 0.084954 .  
## engine_size  1.122e+02  1.323e+01   8.483 5.81e-15 ***
## stroke      -2.730e+03  7.735e+02  -3.529 0.000523 ***
## compression  2.725e+02  6.832e+01   3.988 9.46e-05 ***
## horsepower   5.188e+01  1.393e+01   3.723 0.000259 ***
## peak_rpm     2.314e+00  6.395e-01   3.618 0.000379 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3215 on 192 degrees of freedom
## Multiple R-squared:  0.8429, Adjusted R-squared:  0.8372 
## F-statistic: 147.2 on 7 and 192 DF,  p-value: < 2.2e-16
summarize_model_A = lm(price ~ peak_rpm + horsepower + compression + stroke + engine_size + height + width, data = clean_cars_numeric)
summary(summarize_model_A)
## 
## Call:
## lm(formula = price ~ peak_rpm + horsepower + compression + stroke + 
##     engine_size + height + width, data = clean_cars_numeric)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11400.0  -1666.3    -17.2   1499.1  13720.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.828e+04  1.131e+04  -6.036 8.01e-09 ***
## peak_rpm     2.314e+00  6.395e-01   3.618 0.000379 ***
## horsepower   5.188e+01  1.393e+01   3.723 0.000259 ***
## compression  2.725e+02  6.832e+01   3.988 9.46e-05 ***
## stroke      -2.730e+03  7.735e+02  -3.529 0.000523 ***
## engine_size  1.122e+02  1.323e+01   8.483 5.81e-15 ***
## height       1.880e+02  1.086e+02   1.732 0.084954 .  
## width        6.990e+02  1.781e+02   3.924 0.000121 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3215 on 192 degrees of freedom
## Multiple R-squared:  0.8429, Adjusted R-squared:  0.8372 
## F-statistic: 147.2 on 7 and 192 DF,  p-value: < 2.2e-16
prediction_summarize_model_a = predict(summarize_model_A, newdata = testing_listings)
postResample(pred = prediction_summarize_model_a, obs = testing_listings$price) # RMSE = 3136.352
##         RMSE     Rsquared          MAE 
## 2523.8409698    0.8904233 1851.9866798
# Stepwise Regression Approach - Option C
stepwise_model_B
## 
## Call:
## lm(formula = price ~ width + height + engine_size + stroke + 
##     compression + horsepower + peak_rpm, data = clean_cars_numeric)
## 
## Coefficients:
## (Intercept)        width       height  engine_size       stroke  compression  
##  -68275.619      698.961      187.970      112.213    -2729.523      272.478  
##  horsepower     peak_rpm  
##      51.880        2.314
stepwise_model_B 
## 
## Call:
## lm(formula = price ~ width + height + engine_size + stroke + 
##     compression + horsepower + peak_rpm, data = clean_cars_numeric)
## 
## Coefficients:
## (Intercept)        width       height  engine_size       stroke  compression  
##  -68275.619      698.961      187.970      112.213    -2729.523      272.478  
##  horsepower     peak_rpm  
##      51.880        2.314
summary(stepwise_model_B)
## 
## Call:
## lm(formula = price ~ width + height + engine_size + stroke + 
##     compression + horsepower + peak_rpm, data = clean_cars_numeric)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11400.0  -1666.3    -17.2   1499.1  13720.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.828e+04  1.131e+04  -6.036 8.01e-09 ***
## width        6.990e+02  1.781e+02   3.924 0.000121 ***
## height       1.880e+02  1.086e+02   1.732 0.084954 .  
## engine_size  1.122e+02  1.323e+01   8.483 5.81e-15 ***
## stroke      -2.730e+03  7.735e+02  -3.529 0.000523 ***
## compression  2.725e+02  6.832e+01   3.988 9.46e-05 ***
## horsepower   5.188e+01  1.393e+01   3.723 0.000259 ***
## peak_rpm     2.314e+00  6.395e-01   3.618 0.000379 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3215 on 192 degrees of freedom
## Multiple R-squared:  0.8429, Adjusted R-squared:  0.8372 
## F-statistic: 147.2 on 7 and 192 DF,  p-value: < 2.2e-16
prediction_stepwise_model_B = predict(stepwise_model_B, newdata = testing_listings)
postResample(pred = prediction_stepwise_model_B, obs = testing_listings$price) # RMSE = 3136.352
##         RMSE     Rsquared          MAE 
## 2523.8409698    0.8904233 1851.9866798

Within the course of identifying the number of predictors to be included into the model, it appears as those the best model had included 7 variables.

These 7 variables include: width, height, engine size, stroke, compression, horsepower and torque.

Looking at this model, it seems that this particular model could explain ~ 83.72% variance of predicted car prices.

NOTE: it is interesting to note that whilst these last two models appeared to perform better compared to the educated-guess approach, it is marginally poorer in predicting car prices with the inclusion of city fuel consumption as a metric.

CONCLUSION

Overall looking at the impact of car characteristics influencing car prices, it seems as though the major players are: car width, car length, engine size, engine stroke, engine compression, horsepower, torque and city fuel consumption based on our model. It should be noted that further analysis should be performed when taking into consideration of variables that are categorical/nominal in nature such as branding, body type, etc.

EXTRA: Analysis Using the complete-cases dataset.

STEP 1: Trying the educated-guess approach

train_indicesA = createDataPartition(y = cars_numeric_only_1[['price']], 
                                    p =  0.8, 
                                    list = FALSE)
cv_folds = 13 # it's the sqrt(n) where n = 160
knn_grid = expand.grid(k = 1:100)
train_control = trainControl(method = 'cv', number = cv_folds)
testing_listingsA = cars_numeric_only_1[-train_indicesA,] # Should be about 32
training_listingsA = cars_numeric_only_1[train_indicesA, ] # shoudl be 128
knn_model_1A = train(price ~ city_mpg + highway_mpg,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_2A = train(price ~ city_mpg + highway_mpg + engine_size,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_3A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_4A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_5A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_6A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_7A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length +  width,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
knn_model_8A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length +  width + height,
                    data = cars_numeric_only_1,
                    method = 'knn',
                    trControl = train_control,
                    preProcess = c('center', 'scale'), 
                    tuneGrid = knn_grid) 
testingA = testing_listingsA %>% 
  mutate(
    model_one_prediction = predict(knn_model_1A, newdata = testing_listingsA),
    model_two_prediction = predict(knn_model_2A, newdata = testing_listingsA),
    model_three_prediction = predict(knn_model_3A, newdata = testing_listingsA),
    model_four_prediction = predict(knn_model_4A, newdata = testing_listingsA),
    model_five_prediction = predict(knn_model_5A, newdata = testing_listingsA),
    model_six_prediction = predict(knn_model_6A, newdata = testing_listingsA),
    model_seven_prediction = predict(knn_model_7A, newdata = testing_listingsA),
    model_eight_prediction = predict(knn_model_8A, newdata = testing_listingsA),
    sq_error_model_one = (price - model_one_prediction)^2,
    sq_error_model_two = (price - model_two_prediction)^2,
    sq_error_model_three = (price - model_three_prediction)^2,
    sq_error_model_four = (price - model_four_prediction)^2,
    sq_error_model_five = (price - model_five_prediction)^2,
    sq_error_model_six = (price - model_six_prediction)^2,
    sq_error_model_seven = (price - model_seven_prediction)^2,
    sq_error_model_eight = (price - model_eight_prediction)^2
  )
long_testingA = testingA %>% 
  pivot_longer(
    cols =  sq_error_model_one:sq_error_model_eight,
    names_to = 'model',
    values_to = 'sq_error'
  )
rmse_by_modelA = long_testingA %>% 
                  group_by(model) %>%
                  summarize(rmse = sqrt(mean(sq_error)))
projected_model_a = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length +  width + height, data = cars_numeric_only_1)
summary(projected_model_a)
## 
## Call:
## lm(formula = price ~ city_mpg + highway_mpg + engine_size + horsepower + 
##     peak_rpm + curb_weight + length + width + height, data = cars_numeric_only_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5520.5 -1294.0  -358.8  1153.5  7937.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.925e+04  1.454e+04  -4.762 4.48e-06 ***
## city_mpg     7.893e+01  1.477e+02   0.534   0.5939    
## highway_mpg -3.457e+01  1.388e+02  -0.249   0.8037    
## engine_size  3.578e+01  1.786e+01   2.004   0.0469 *  
## horsepower   1.550e+01  1.628e+01   0.952   0.3425    
## peak_rpm     8.516e-01  5.369e-01   1.586   0.1148    
## curb_weight  6.700e+00  1.493e+00   4.487 1.43e-05 ***
## length      -6.733e+01  4.307e+01  -1.563   0.1201    
## width        9.491e+02  2.231e+02   4.255 3.67e-05 ***
## height       4.569e+01  1.244e+02   0.367   0.7139    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2436 on 150 degrees of freedom
## Multiple R-squared:  0.8371, Adjusted R-squared:  0.8274 
## F-statistic: 85.66 on 9 and 150 DF,  p-value: < 2.2e-16
projected_model_b = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length , data = cars_numeric_only_1)
summary(projected_model_b)
## 
## Call:
## lm(formula = price ~ city_mpg + highway_mpg + engine_size + horsepower + 
##     peak_rpm + curb_weight + length, data = cars_numeric_only_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7121.2 -1225.3   -57.8  1013.9  8107.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.173e+04  7.512e+03  -2.893  0.00438 ** 
## city_mpg     1.560e+02  1.539e+02   1.014  0.31235    
## highway_mpg -9.831e+01  1.447e+02  -0.679  0.49796    
## engine_size  4.234e+01  1.764e+01   2.401  0.01756 *  
## horsepower   1.285e+01  1.664e+01   0.772  0.44122    
## peak_rpm     9.262e-01  5.626e-01   1.646  0.10173    
## curb_weight  8.557e+00  1.422e+00   6.016 1.28e-08 ***
## length       6.698e-01  3.834e+01   0.017  0.98609    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2565 on 152 degrees of freedom
## Multiple R-squared:  0.8171, Adjusted R-squared:  0.8086 
## F-statistic: 96.99 on 7 and 152 DF,  p-value: < 2.2e-16
predictions_model_1 <- predict(projected_model_a, newdata = testing_listingsA)
predictions_model_2 <- predict(projected_model_b, newdata = testing_listingsA)
postResample(pred = predictions_model_1, obs = testing_listingsA$price) # RMSE = 1999.039
##         RMSE     Rsquared          MAE 
## 2050.4475753    0.8531872 1605.1406928
postResample(pred = predictions_model_2, obs = testing_listingsA$price) # RMSE = 2066.850
##         RMSE     Rsquared          MAE 
## 1924.0041129    0.8665596 1477.4063047

Looking at the above findings, it was found that the eighth model appeared to be better performing model based on RMSE score and R-squared where our model appeared to predict 82.74% of the variance in car prices using the completed-cases dataset.

This model showed all of the selected variables as a better measure of

stepwise_model_A = train(price ~., 
                       data = cars_numeric_only_1,
                       trControl = train_control, 
                       method = 'leapSeq',
                       preProcess = c('center', 'scale'),
                       tuneGrid = data.frame(nvmax = 1:15))
stepwise_model_A
## Linear Regression with Stepwise Selection 
## 
## 160 samples
##  15 predictor
## 
## Pre-processing: centered (15), scaled (15) 
## Resampling: Cross-Validated (13 fold) 
## Summary of sample sizes: 148, 148, 147, 148, 148, 148, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##    1     2579.634  0.7929105  1833.293
##    2     2610.867  0.7845960  1889.033
##    3     2747.253  0.7668762  2004.186
##    4     2799.128  0.7703185  1974.953
##    5     2846.507  0.7650220  2014.433
##    6     2921.272  0.7627612  2008.727
##    7     2791.904  0.7703027  1987.699
##    8     2732.024  0.7772226  1934.562
##    9     2731.164  0.7751851  1949.991
##   10     2622.885  0.7928392  1866.182
##   11     2613.598  0.7937393  1859.052
##   12     2624.003  0.7926376  1872.347
##   13     2603.542  0.7962381  1870.056
##   14     2624.413  0.7934172  1882.241
##   15     2627.498  0.7930156  1886.600
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 1.
stepwise_model_A$results 
##    nvmax     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      1 2579.634 0.7929105 1833.293 643.5565  0.1293629 404.9013
## 2      2 2610.867 0.7845960 1889.033 696.7756  0.1368709 475.3891
## 3      3 2747.253 0.7668762 2004.186 687.4821  0.1458420 478.5374
## 4      4 2799.128 0.7703185 1974.953 759.5961  0.1322827 413.9602
## 5      5 2846.507 0.7650220 2014.433 806.0254  0.1308851 447.9836
## 6      6 2921.272 0.7627612 2008.727 636.7451  0.1296065 363.1962
## 7      7 2791.904 0.7703027 1987.699 677.9448  0.1305660 428.5148
## 8      8 2732.024 0.7772226 1934.562 671.4970  0.1258543 405.6321
## 9      9 2731.164 0.7751851 1949.991 646.8468  0.1239406 420.2103
## 10    10 2622.885 0.7928392 1866.182 607.4639  0.1126485 397.7653
## 11    11 2613.598 0.7937393 1859.052 619.4499  0.1187760 395.7822
## 12    12 2624.003 0.7926376 1872.347 602.9018  0.1125727 395.9963
## 13    13 2603.542 0.7962381 1870.056 589.3961  0.1104432 380.4119
## 14    14 2624.413 0.7934172 1882.241 607.4069  0.1120866 387.4576
## 15    15 2627.498 0.7930156 1886.600 609.8408  0.1126217 387.6436
summary(stepwise_model_A$finalModel) # it appears only one predictor variable was retained
## Subset selection object
## 15 Variables  (and intercept)
##                   Forced in Forced out
## symboling             FALSE      FALSE
## normalized_losses     FALSE      FALSE
## wheel_base            FALSE      FALSE
## length                FALSE      FALSE
## width                 FALSE      FALSE
## height                FALSE      FALSE
## curb_weight           FALSE      FALSE
## engine_size           FALSE      FALSE
## bore                  FALSE      FALSE
## stroke                FALSE      FALSE
## compression           FALSE      FALSE
## horsepower            FALSE      FALSE
## peak_rpm              FALSE      FALSE
## city_mpg              FALSE      FALSE
## highway_mpg           FALSE      FALSE
## 1 subsets of each size up to 1
## Selection Algorithm: 'sequential replacement'
##          symboling normalized_losses wheel_base length width height curb_weight
## 1  ( 1 ) " "       " "               " "        " "    " "   " "    "*"        
##          engine_size bore stroke compression horsepower peak_rpm city_mpg
## 1  ( 1 ) " "         " "  " "    " "         " "        " "      " "     
##          highway_mpg
## 1  ( 1 ) " "
summarized_model_A1 = lm(price ~ curb_weight, data = cars_numeric_only_1)
summary(summarized_model_A1)
## 
## Call:
## lm(formula = price ~ curb_weight, data = cars_numeric_only_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9739.8 -1447.2  -144.8  1108.9 10271.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15376.867   1090.042  -14.11   <2e-16 ***
## curb_weight     10.899      0.435   25.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2638 on 158 degrees of freedom
## Multiple R-squared:  0.7989, Adjusted R-squared:  0.7976 
## F-statistic: 627.7 on 1 and 158 DF,  p-value: < 2.2e-16
prediction_summarized_model_a1 = predict(summarized_model_A1, newdata = testing_listingsA)
postResample(pred = prediction_summarized_model_a1, obs = testing_listingsA$price) # RSME = 1985.155, R-squared = 0.8417
##         RMSE     Rsquared          MAE 
## 2055.5304142    0.8522134 1573.2709604
# second method
stepwise_model_B = train(price ~., 
                         data = cars_numeric_only_1,
                         method = 'lmStepAIC',
                         trControl = train_control,
                         trace = FALSE)
stepwise_model_B
## Linear Regression with Stepwise Selection 
## 
## 160 samples
##  15 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (13 fold) 
## Summary of sample sizes: 148, 148, 148, 146, 148, 148, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2657.779  0.8023394  1915.337
stepwise_model_B$results 
##   parameter     RMSE  Rsquared      MAE  RMSESD RsquaredSD    MAESD
## 1      none 2657.779 0.8023394 1915.337 761.703  0.1091877 562.7411
summary(stepwise_model_B$finalModel) # 
## 
## Call:
## lm(formula = .outcome ~ wheel_base + length + width + curb_weight + 
##     engine_size + bore + stroke + compression + horsepower + 
##     peak_rpm, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5805.7 -1294.3  -283.1  1099.2  7460.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.582e+04  1.201e+04  -4.646 7.37e-06 ***
## wheel_base   1.736e+02  8.593e+01   2.020 0.045177 *  
## length      -8.955e+01  4.258e+01  -2.103 0.037157 *  
## width        7.865e+02  2.196e+02   3.581 0.000462 ***
## curb_weight  5.153e+00  1.420e+00   3.628 0.000391 ***
## engine_size  5.201e+01  1.753e+01   2.967 0.003507 ** 
## bore        -2.036e+03  1.049e+03  -1.942 0.054036 .  
## stroke      -1.929e+03  7.579e+02  -2.545 0.011931 *  
## compression  1.109e+02  6.467e+01   1.714 0.088563 .  
## horsepower   2.711e+01  1.534e+01   1.768 0.079099 .  
## peak_rpm     8.618e-01  5.468e-01   1.576 0.117134    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2352 on 149 degrees of freedom
## Multiple R-squared:  0.8492, Adjusted R-squared:  0.8391 
## F-statistic:  83.9 on 10 and 149 DF,  p-value: < 2.2e-16
summarized_model_B1 = lm(price ~ wheel_base + length + width + curb_weight + engine_size + bore + stroke + compression + horsepower + peak_rpm, data = cars_numeric_only_1)
summary(summarized_model_B1)
## 
## Call:
## lm(formula = price ~ wheel_base + length + width + curb_weight + 
##     engine_size + bore + stroke + compression + horsepower + 
##     peak_rpm, data = cars_numeric_only_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5805.7 -1294.3  -283.1  1099.2  7460.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.582e+04  1.201e+04  -4.646 7.37e-06 ***
## wheel_base   1.736e+02  8.593e+01   2.020 0.045177 *  
## length      -8.955e+01  4.258e+01  -2.103 0.037157 *  
## width        7.865e+02  2.196e+02   3.581 0.000462 ***
## curb_weight  5.153e+00  1.420e+00   3.628 0.000391 ***
## engine_size  5.201e+01  1.753e+01   2.967 0.003507 ** 
## bore        -2.036e+03  1.049e+03  -1.942 0.054036 .  
## stroke      -1.929e+03  7.579e+02  -2.545 0.011931 *  
## compression  1.109e+02  6.467e+01   1.714 0.088563 .  
## horsepower   2.711e+01  1.534e+01   1.768 0.079099 .  
## peak_rpm     8.618e-01  5.468e-01   1.576 0.117134    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2352 on 149 degrees of freedom
## Multiple R-squared:  0.8492, Adjusted R-squared:  0.8391 
## F-statistic:  83.9 on 10 and 149 DF,  p-value: < 2.2e-16
prediction_summarized_model_B1 = predict(summarized_model_B1, newdata = testing_listingsA)
postResample(pred = prediction_summarized_model_B1, obs = testing_listingsA$price) #RSME = 2016.58, R-squared = 0.8415
##         RMSE     Rsquared          MAE 
## 2103.2021951    0.8440939 1617.9408960

Looking at the use of a stepwise regression analysis, it appeared to show that within complete-cases analysis, only curb_weight was retained as a significant predictor whereby the model appeared to predict ~ 84.17% of the variance in car price predictions.

Obviously in comparison to the previous method, it was found that within complete cases, the KNN method provided a better model prediction in car prices within the given dataset.