In this project, we will be working on developing a model to predict market prices for cars utilizing the various characteristics of a vehicle through the use of K-nearest number algorithm.
Using a dataset available from the UCI Machine Learning Archive collected based on car guide + insurance information in 1985.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(broom)
library(lattice)
library(ggplot2)
library(knitr)
library(aod)
library(caret)
library(stringr)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ purrr 0.3.5
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
library(purrr)
library(readr)
library(readxl)
library(magrittr)
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
cars <- read.csv("imports-85.data")
colnames(cars) <- c('symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_doors', 'body_style',
'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type',
'num_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke', 'compression', 'horsepower',
'peak_rpm', 'city_mpg', 'highway_mpg', 'price')
str(cars)
## 'data.frame': 204 obs. of 26 variables:
## $ symboling : int 3 1 2 2 2 1 1 1 0 2 ...
## $ normalized_losses: chr "?" "?" "164" "164" ...
## $ make : chr "alfa-romero" "alfa-romero" "audi" "audi" ...
## $ fuel_type : chr "gas" "gas" "gas" "gas" ...
## $ aspiration : chr "std" "std" "std" "std" ...
## $ num_doors : chr "two" "two" "four" "four" ...
## $ body_style : chr "convertible" "hatchback" "sedan" "sedan" ...
## $ drive_wheels : chr "rwd" "rwd" "fwd" "4wd" ...
## $ engine_location : chr "front" "front" "front" "front" ...
## $ wheel_base : num 88.6 94.5 99.8 99.4 99.8 ...
## $ length : num 169 171 177 177 177 ...
## $ width : num 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 64.8 ...
## $ height : num 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 54.3 ...
## $ curb_weight : int 2548 2823 2337 2824 2507 2844 2954 3086 3053 2395 ...
## $ engine_type : chr "dohc" "ohcv" "ohc" "ohc" ...
## $ num_cylinders : chr "four" "six" "four" "five" ...
## $ engine_size : int 130 152 109 136 136 136 136 131 131 108 ...
## $ fuel_system : chr "mpfi" "mpfi" "mpfi" "mpfi" ...
## $ bore : chr "3.47" "2.68" "3.19" "3.19" ...
## $ stroke : chr "2.68" "3.47" "3.40" "3.40" ...
## $ compression : num 9 9 10 8 8.5 8.5 8.5 8.3 7 8.8 ...
## $ horsepower : chr "111" "154" "102" "115" ...
## $ peak_rpm : chr "5000" "5000" "5500" "5500" ...
## $ city_mpg : int 21 19 24 18 19 19 19 17 16 23 ...
## $ highway_mpg : int 27 26 30 22 25 25 25 20 22 29 ...
## $ price : chr "16500" "16500" "13950" "17450" ...
Based on our findings, we see the following:
numeric: symboling, wheel_base, length, width, height, curb_weight, engine_size, compression, city_mpg, highway_mpg
characters: normalized_losses, make, fuel_type, aspiration, num_doors, body_size, drive_wheels, engine_location, engine_type, num_cylinders, fuel_type, bore, stroke, horsepower, peak_rpm, price
Looking at this, we need to clean things up a bit with reassigning several variables to numeric variables and make a numeric-only dataframe.
cars <- cars %>%
mutate(
price = as.numeric(price),
normalized_losses = as.numeric(normalized_losses),
bore = as.numeric(bore),
stroke = as.numeric(stroke),
horsepower = as.numeric(horsepower),
peak_rpm= as.numeric(peak_rpm)
)
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
numeric_only = cars %>%
select(-make, -fuel_type, -aspiration, -num_doors, -body_style, -drive_wheels, -engine_location, -engine_type, -fuel_system, -num_cylinders)
na_counts = numeric_only %>% is.na() %>% colSums()
Now that we have a numeric-only dataset, we need to handle the number of missing entries in this dataset.
There are two approaches that we can use.
# COMPLETE-CASES:
cars_numeric_only_1 = numeric_only %>%
filter(!is.na(price)) %>%
filter(!is.na(normalized_losses)) %>%
filter(!is.na(bore)) %>%
filter(!is.na(stroke)) %>%
filter(!is.na(horsepower)) %>%
filter(!is.na(peak_rpm))
head(cars_numeric_only_1)
## symboling normalized_losses wheel_base length width height curb_weight
## 1 2 164 99.8 176.6 66.2 54.3 2337
## 2 2 164 99.4 176.6 66.4 54.3 2824
## 3 1 158 105.8 192.7 71.4 55.7 2844
## 4 1 158 105.8 192.7 71.4 55.9 3086
## 5 2 192 101.2 176.8 64.8 54.3 2395
## 6 0 192 101.2 176.8 64.8 54.3 2395
## engine_size bore stroke compression horsepower peak_rpm city_mpg highway_mpg
## 1 109 3.19 3.4 10.0 102 5500 24 30
## 2 136 3.19 3.4 8.0 115 5500 18 22
## 3 136 3.19 3.4 8.5 110 5500 19 25
## 4 131 3.13 3.4 8.3 140 5500 17 20
## 5 108 3.50 2.8 8.8 101 5800 23 29
## 6 108 3.50 2.8 8.8 101 5800 23 29
## price
## 1 13950
## 2 17450
## 3 17710
## 4 23875
## 5 16430
## 6 16925
Looking at this first approach, we see that we are essentially going to drop 22% of the original dataset.
As a result, this introduces a significant bias into our future predictive model along with a reduction in our statistical power of our model.
# IMPUTATION:
# Assessment of rows with missing values in the dataset
na_logical = is.na(numeric_only)
na.dataframe = data.frame(is.na(numeric_only))
na.dataframe = na.dataframe %>% mutate(total_na = rowSums(.[1:16]))
na_counts_pct = na_counts *100 / nrow(na_logical)
na.df = data.frame(na_counts = na_counts, na_pct = na_counts_pct)
na.df = data.frame(t(na.df))
Looking at our findings, it appears that the most significant missing value comes from normalized losses whilst the rest appear to have missing values constituting 1%-2%.
Examining the data, we found that:
Since we are ultimately looking at price, we can first remove rows with missing prices
update_numeric_only = numeric_only %>% filter(!is.na(price))
As to how to handle the missing values of the rest, let’s look at the summary statistics of these variables.
mean(update_numeric_only$bore, na.rm = TRUE)
## [1] 3.33
sd(update_numeric_only$bore, na.rm = TRUE)
## [1] 0.2713027
median(update_numeric_only$bore, na.rm = TRUE)
## [1] 3.31
# Seeing as how mean and median are similar, it is likely parametric distribution. Thus imputation with the mean would likely be acceptable.
mean(update_numeric_only$stroke, na.rm = TRUE)
## [1] 3.259847
sd(update_numeric_only$stroke, na.rm = TRUE)
## [1] 0.3173827
median(update_numeric_only$stroke, na.rm = TRUE)
## [1] 3.29
# Seeing as how mean and median are similar, it is likely parametric distribution. Thus imputation with the mean would likely be acceptable.
mean(update_numeric_only$peak_rpm, na.rm = TRUE)
## [1] 5118.182
sd(update_numeric_only$peak_rpm, na.rm = TRUE)
## [1] 481.6667
median(update_numeric_only$peak_rpm, na.rm = TRUE)
## [1] 5200
# Seeing as how mean and median are similar, it is likely parametric distribution. Thus imputation with the mean would likely be acceptable.
mean(update_numeric_only$horsepower, na.rm = TRUE)
## [1] 103.3586
sd(update_numeric_only$horsepower, na.rm = TRUE)
## [1] 37.64512
median(update_numeric_only$horsepower, na.rm = TRUE)
## [1] 95
ggplot(data = update_numeric_only,
aes(x = horsepower)) +
geom_histogram(bins = 20)
## Warning: Removed 2 rows containing non-finite values (stat_bin).
# Looking at the distribution of the horsepower, it seems to be a right-skewed parametric distribution.
# This was validated by the large differential b/t median and mean values for horsepower.
# Given this distribution in our model, we will fill it with the most appropriate measure of central tendency (i.e. median value) due to is robustness against outliers.
mean(update_numeric_only$normalized_losses, na.rm = TRUE)
## [1] 122
sd(update_numeric_only$normalized_losses, na.rm = TRUE)
## [1] 35.44217
median(update_numeric_only$normalized_losses, na.rm = TRUE)
## [1] 115
ggplot(data = update_numeric_only,
aes(x = normalized_losses)) +
geom_histogram(bins = 20)
## Warning: Removed 36 rows containing non-finite values (stat_bin).
# Looking at the distribution of the horsepower, it seems to be a right-skewed parametric distribution.
# This was validated by the large differential b/t median and mean values for horsepower.
# As such, we will fill it with the median value instead of mean value as it's a better measure of central tendency
update_numeric_only = update_numeric_only %>%
mutate(
bore = ifelse(is.na(bore), round(mean(bore, na.rm = TRUE), 2), bore),
stroke = ifelse(is.na(stroke), round(mean(stroke, na.rm = TRUE), 2), stroke),
peak_rpm = ifelse(is.na(peak_rpm), round(mean(peak_rpm, na.rm = TRUE), 2), peak_rpm),
horsepower = ifelse(is.na(horsepower), median(horsepower, na.rm = TRUE), horsepower),
normalized_losses = ifelse(is.na(normalized_losses), median(normalized_losses, na.rm = TRUE), normalized_losses)
)
clean_cars_numeric = update_numeric_only
head(update_numeric_only)
## symboling normalized_losses wheel_base length width height curb_weight
## 1 3 115 88.6 168.8 64.1 48.8 2548
## 2 1 115 94.5 171.2 65.5 52.4 2823
## 3 2 164 99.8 176.6 66.2 54.3 2337
## 4 2 164 99.4 176.6 66.4 54.3 2824
## 5 2 115 99.8 177.3 66.3 53.1 2507
## 6 1 158 105.8 192.7 71.4 55.7 2844
## engine_size bore stroke compression horsepower peak_rpm city_mpg highway_mpg
## 1 130 3.47 2.68 9.0 111 5000 21 27
## 2 152 2.68 3.47 9.0 154 5000 19 26
## 3 109 3.19 3.40 10.0 102 5500 24 30
## 4 136 3.19 3.40 8.0 115 5500 18 22
## 5 136 3.19 3.40 8.5 110 5500 19 25
## 6 136 3.19 3.40 8.5 110 5500 19 25
## price
## 1 16500
## 2 16500
## 3 13950
## 4 17450
## 5 15250
## 6 17710
For the purpose of building our predictive model, we will be doing so with the imputted dataframe going forward.
However, we will also look at the complete-cases only dataset in the extra section.
featurePlot(clean_cars_numeric$horsepower, clean_cars_numeric$price, labels = c('horsepower','price'))
# There appears to be some evidence of a positive association b/t horsepower and sale price of cars
featurePlot(clean_cars_numeric$peak_rpm, clean_cars_numeric$price, labels = c('peak RPM','price'))
# There appears to be no association b/t peak RPM and sale price of cars
featurePlot(clean_cars_numeric$city_mpg, clean_cars_numeric$price, labels = c('city MPG','price'))
# There appears to be a negative association between city fuel consumption and sale price of cars
featurePlot(clean_cars_numeric$highway_mpg, clean_cars_numeric$price, labels = c('highway MPG','price'))
# There appears to be a negative association between highway fuel consumption and sale price of cars
featurePlot(clean_cars_numeric$compression, clean_cars_numeric$price, labels = c('compression','price'))
# There appears to be a bipartisian distribution b/t compression and sale price of cars -> likely no association
featurePlot(clean_cars_numeric$stroke, clean_cars_numeric$price, labels = c('stroke','price'))
# There appears to be no association b/t stroke and sale price of cars
featurePlot(clean_cars_numeric$bore, clean_cars_numeric$price, labels = c('bore','price'))
# There appears to be a very weak positive trend b/t bore and sale price of cars
featurePlot(clean_cars_numeric$engine_size, clean_cars_numeric$price, labels = c('engine size','price'))
# There is a positive relationship b/t increasing engine size and sale price of cars
featurePlot(clean_cars_numeric$curb_weight, clean_cars_numeric$price, labels = c('curb_weight','price'))
# There appears to be somewhat of a positive relationship (fairly linear) b/t curb weight of the vehicle and sale price
featurePlot(clean_cars_numeric$height, clean_cars_numeric$price, labels = c('height','price'))
# There doesn't appear to be a linear relationship at all.
featurePlot(clean_cars_numeric$length, clean_cars_numeric$price, labels = c('length','price'))
# There seems to be a somewhat poor positive relationship b/t length of car and its sale price
featurePlot(clean_cars_numeric$wheel_base, clean_cars_numeric$price, labels = c('wheel base','price'))
# There appears to be a very poor positive relationship (bordering no relationship) b/t wheelbase of the car and its sale price
featurePlot(clean_cars_numeric$normalized_losses, clean_cars_numeric$price, labels = c('normalized losses','price'))
# There appears to be no relationship b/t sale price and normalized losses
featurePlot(clean_cars_numeric$symboling, clean_cars_numeric$price, labels = c('symboling','price'))
# There appears to be no relationship b/t sale price and normalized losses
ggplot(data = clean_cars_numeric,
aes(x = price)) +
geom_histogram(bins = 10)
mean(clean_cars_numeric$price) # 13207.13
## [1] 13205.69
sd(clean_cars_numeric$price) # 7947.07
## [1] 7966.983
median(clean_cars_numeric$price) # 10295.00
## [1] 10270
max(clean_cars_numeric$price) # 45400
## [1] 45400
2*sd(clean_cars_numeric$price) + mean(clean_cars_numeric$price) # 29101.26 (assume as the cut-off of what car prices to be within expectation)
## [1] 29139.66
Based of the measure of centrality of sale price, there appears to be a right-skewed distribution of car sale prices.
If we were to consider a normalized distribution to be in the range of mean +/- 2 SD as an expected price range, we see that there are 14 cars that were found to have prices equal to or greater than this range.
Let’s examine these outlier in compraison to cars within expected price range.
pricy_cars <- clean_cars_numeric %>%
mutate(outliers = ifelse(price >= 29101.26, "outlier", "not"))
pricy_cars
## symboling normalized_losses wheel_base length width height curb_weight
## 1 3 115 88.6 168.8 64.1 48.8 2548
## 2 1 115 94.5 171.2 65.5 52.4 2823
## 3 2 164 99.8 176.6 66.2 54.3 2337
## 4 2 164 99.4 176.6 66.4 54.3 2824
## 5 2 115 99.8 177.3 66.3 53.1 2507
## 6 1 158 105.8 192.7 71.4 55.7 2844
## 7 1 115 105.8 192.7 71.4 55.7 2954
## 8 1 158 105.8 192.7 71.4 55.9 3086
## 9 2 192 101.2 176.8 64.8 54.3 2395
## 10 0 192 101.2 176.8 64.8 54.3 2395
## 11 0 188 101.2 176.8 64.8 54.3 2710
## 12 0 188 101.2 176.8 64.8 54.3 2765
## 13 1 115 103.5 189.0 66.9 55.7 3055
## 14 0 115 103.5 189.0 66.9 55.7 3230
## 15 0 115 103.5 193.8 67.9 53.7 3380
## 16 0 115 110.0 197.0 70.9 56.3 3505
## 17 2 121 88.4 141.1 60.3 53.2 1488
## 18 1 98 94.5 155.9 63.6 52.0 1874
## 19 0 81 94.5 158.8 63.6 52.0 1909
## 20 1 118 93.7 157.3 63.8 50.8 1876
## 21 1 118 93.7 157.3 63.8 50.8 1876
## 22 1 118 93.7 157.3 63.8 50.8 2128
## 23 1 148 93.7 157.3 63.8 50.6 1967
## 24 1 148 93.7 157.3 63.8 50.6 1989
## 25 1 148 93.7 157.3 63.8 50.6 1989
## 26 1 148 93.7 157.3 63.8 50.6 2191
## 27 -1 110 103.3 174.6 64.6 59.8 2535
## 28 3 145 95.9 173.2 66.3 50.2 2811
## 29 2 137 86.6 144.6 63.9 50.8 1713
## 30 2 137 86.6 144.6 63.9 50.8 1819
## 31 1 101 93.7 150.0 64.0 52.6 1837
## 32 1 101 93.7 150.0 64.0 52.6 1940
## 33 1 101 93.7 150.0 64.0 52.6 1956
## 34 0 110 96.5 163.4 64.0 54.5 2010
## 35 0 78 96.5 157.1 63.9 58.3 2024
## 36 0 106 96.5 167.5 65.2 53.3 2236
## 37 0 106 96.5 167.5 65.2 53.3 2289
## 38 0 85 96.5 175.4 65.2 54.1 2304
## 39 0 85 96.5 175.4 62.5 54.1 2372
## 40 0 85 96.5 175.4 65.2 54.1 2465
## 41 1 107 96.5 169.1 66.0 51.0 2293
## 42 0 115 94.3 170.7 61.8 53.5 2337
## 43 2 115 96.0 172.6 65.2 51.4 2734
## 44 0 145 113.0 199.6 69.6 52.8 4066
## 45 0 115 113.0 199.6 69.6 52.8 4066
## 46 0 115 102.0 191.7 70.6 47.8 3950
## 47 1 104 93.1 159.1 64.2 54.1 1890
## 48 1 104 93.1 159.1 64.2 54.1 1900
## 49 1 104 93.1 159.1 64.2 54.1 1905
## 50 1 113 93.1 166.8 64.2 54.1 1945
## 51 1 113 93.1 166.8 64.2 54.1 1950
## 52 3 150 95.3 169.0 65.7 49.6 2380
## 53 3 150 95.3 169.0 65.7 49.6 2380
## 54 3 150 95.3 169.0 65.7 49.6 2385
## 55 3 150 95.3 169.0 65.7 49.6 2500
## 56 1 129 98.8 177.8 66.5 53.7 2385
## 57 0 115 98.8 177.8 66.5 55.5 2410
## 58 1 129 98.8 177.8 66.5 53.7 2385
## 59 0 115 98.8 177.8 66.5 55.5 2410
## 60 0 115 98.8 177.8 66.5 55.5 2443
## 61 0 115 98.8 177.8 66.5 55.5 2425
## 62 0 118 104.9 175.0 66.1 54.4 2670
## 63 0 115 104.9 175.0 66.1 54.4 2700
## 64 -1 93 110.0 190.9 70.3 56.5 3515
## 65 -1 93 110.0 190.9 70.3 58.7 3750
## 66 0 93 106.7 187.5 70.3 54.9 3495
## 67 -1 93 115.6 202.6 71.7 56.3 3770
## 68 -1 115 115.6 202.6 71.7 56.5 3740
## 69 3 142 96.6 180.3 70.5 50.8 3685
## 70 0 115 120.9 208.1 71.7 56.7 3900
## 71 1 115 112.0 199.2 72.0 55.4 3715
## 72 1 115 102.7 178.4 68.0 54.8 2910
## 73 2 161 93.7 157.3 64.4 50.8 1918
## 74 2 161 93.7 157.3 64.4 50.8 1944
## 75 2 161 93.7 157.3 64.4 50.8 2004
## 76 1 161 93.0 157.3 63.8 50.8 2145
## 77 3 153 96.3 173.0 65.4 49.4 2370
## 78 3 153 96.3 173.0 65.4 49.4 2328
## 79 3 115 95.9 173.2 66.3 50.2 2833
## 80 3 115 95.9 173.2 66.3 50.2 2921
## 81 3 115 95.9 173.2 66.3 50.2 2926
## 82 1 125 96.3 172.4 65.4 51.6 2365
## 83 1 125 96.3 172.4 65.4 51.6 2405
## 84 1 125 96.3 172.4 65.4 51.6 2403
## 85 -1 137 96.3 172.4 65.4 51.6 2403
## 86 1 128 94.5 165.3 63.8 54.5 1889
## 87 1 128 94.5 165.3 63.8 54.5 2017
## 88 1 128 94.5 165.3 63.8 54.5 1918
## 89 1 122 94.5 165.3 63.8 54.5 1938
## 90 1 103 94.5 170.2 63.8 53.5 2024
## 91 1 128 94.5 165.3 63.8 54.5 1951
## 92 1 128 94.5 165.6 63.8 53.3 2028
## 93 1 122 94.5 165.3 63.8 54.5 1971
## 94 1 103 94.5 170.2 63.8 53.5 2037
## 95 2 168 95.1 162.4 63.8 53.3 2008
## 96 0 106 97.2 173.4 65.2 54.7 2324
## 97 0 106 97.2 173.4 65.2 54.7 2302
## 98 0 128 100.4 181.7 66.5 55.1 3095
## 99 0 108 100.4 184.6 66.5 56.1 3296
## 100 0 108 100.4 184.6 66.5 55.1 3060
## 101 3 194 91.3 170.7 67.9 49.7 3071
## 102 3 194 91.3 170.7 67.9 49.7 3139
## 103 1 231 99.2 178.5 67.9 49.7 3139
## 104 0 161 107.9 186.7 68.4 56.7 3020
## 105 0 161 107.9 186.7 68.4 56.7 3197
## 106 0 115 114.2 198.9 68.4 58.7 3230
## 107 0 115 114.2 198.9 68.4 58.7 3430
## 108 0 161 107.9 186.7 68.4 56.7 3075
## 109 0 161 107.9 186.7 68.4 56.7 3252
## 110 0 115 114.2 198.9 68.4 56.7 3285
## 111 0 115 114.2 198.9 68.4 58.7 3485
## 112 0 161 107.9 186.7 68.4 56.7 3075
## 113 0 161 107.9 186.7 68.4 56.7 3252
## 114 0 161 108.0 186.7 68.3 56.0 3130
## 115 1 119 93.7 157.3 63.8 50.8 1918
## 116 1 119 93.7 157.3 63.8 50.8 2128
## 117 1 154 93.7 157.3 63.8 50.6 1967
## 118 1 154 93.7 167.3 63.8 50.8 1989
## 119 1 154 93.7 167.3 63.8 50.8 2191
## 120 -1 74 103.3 174.6 64.6 59.8 2535
## 121 3 115 95.9 173.2 66.3 50.2 2818
## 122 3 186 94.5 168.9 68.3 50.2 2778
## 123 3 115 89.5 168.9 65.0 51.6 2756
## 124 3 115 89.5 168.9 65.0 51.6 2756
## 125 3 115 89.5 168.9 65.0 51.6 2800
## 126 0 115 96.1 181.5 66.5 55.2 2579
## 127 2 115 96.1 176.8 66.6 50.5 2460
## 128 3 150 99.1 186.6 66.5 56.1 2658
## 129 2 104 99.1 186.6 66.5 56.1 2695
## 130 3 150 99.1 186.6 66.5 56.1 2707
## 131 2 104 99.1 186.6 66.5 56.1 2758
## 132 3 150 99.1 186.6 66.5 56.1 2808
## 133 2 104 99.1 186.6 66.5 56.1 2847
## 134 2 83 93.7 156.9 63.4 53.7 2050
## 135 2 83 93.7 157.9 63.6 53.7 2120
## 136 2 83 93.3 157.3 63.8 55.7 2240
## 137 0 102 97.2 172.0 65.4 52.5 2145
## 138 0 102 97.2 172.0 65.4 52.5 2190
## 139 0 102 97.2 172.0 65.4 52.5 2340
## 140 0 102 97.0 172.0 65.4 54.3 2385
## 141 0 102 97.0 172.0 65.4 54.3 2510
## 142 0 89 97.0 173.5 65.4 53.0 2290
## 143 0 89 97.0 173.5 65.4 53.0 2455
## 144 0 85 96.9 173.6 65.4 54.9 2420
## 145 0 85 96.9 173.6 65.4 54.9 2650
## 146 1 87 95.7 158.7 63.6 54.5 1985
## 147 1 87 95.7 158.7 63.6 54.5 2040
## 148 1 74 95.7 158.7 63.6 54.5 2015
## 149 0 77 95.7 169.7 63.6 59.1 2280
## 150 0 81 95.7 169.7 63.6 59.1 2290
## 151 0 91 95.7 169.7 63.6 59.1 3110
## 152 0 91 95.7 166.3 64.4 53.0 2081
## 153 0 91 95.7 166.3 64.4 52.8 2109
## 154 0 91 95.7 166.3 64.4 53.0 2275
## 155 0 91 95.7 166.3 64.4 52.8 2275
## 156 0 91 95.7 166.3 64.4 53.0 2094
## 157 0 91 95.7 166.3 64.4 52.8 2122
## 158 0 91 95.7 166.3 64.4 52.8 2140
## 159 1 168 94.5 168.7 64.0 52.6 2169
## 160 1 168 94.5 168.7 64.0 52.6 2204
## 161 1 168 94.5 168.7 64.0 52.6 2265
## 162 1 168 94.5 168.7 64.0 52.6 2300
## 163 2 134 98.4 176.2 65.6 52.0 2540
## 164 2 134 98.4 176.2 65.6 52.0 2536
## 165 2 134 98.4 176.2 65.6 52.0 2551
## 166 2 134 98.4 176.2 65.6 52.0 2679
## 167 2 134 98.4 176.2 65.6 52.0 2714
## 168 2 134 98.4 176.2 65.6 53.0 2975
## 169 -1 65 102.4 175.6 66.5 54.9 2326
## 170 -1 65 102.4 175.6 66.5 54.9 2480
## 171 -1 65 102.4 175.6 66.5 53.9 2414
## 172 -1 65 102.4 175.6 66.5 54.9 2414
## 173 -1 65 102.4 175.6 66.5 53.9 2458
## 174 3 197 102.9 183.5 67.7 52.0 2976
## 175 3 197 102.9 183.5 67.7 52.0 3016
## 176 -1 90 104.5 187.8 66.5 54.1 3131
## 177 -1 115 104.5 187.8 66.5 54.1 3151
## 178 2 122 97.3 171.7 65.5 55.7 2261
## 179 2 122 97.3 171.7 65.5 55.7 2209
## 180 2 94 97.3 171.7 65.5 55.7 2264
## 181 2 94 97.3 171.7 65.5 55.7 2212
## 182 2 94 97.3 171.7 65.5 55.7 2275
## 183 2 94 97.3 171.7 65.5 55.7 2319
## 184 2 94 97.3 171.7 65.5 55.7 2300
## 185 3 115 94.5 159.3 64.2 55.6 2254
## 186 3 256 94.5 165.7 64.0 51.4 2221
## 187 0 115 100.4 180.2 66.9 55.1 2661
## 188 0 115 100.4 180.2 66.9 55.1 2579
## 189 0 115 100.4 183.1 66.9 55.1 2563
## 190 -2 103 104.3 188.8 67.2 56.2 2912
## 191 -1 74 104.3 188.8 67.2 57.5 3034
## 192 -2 103 104.3 188.8 67.2 56.2 2935
## 193 -1 74 104.3 188.8 67.2 57.5 3042
## 194 -2 103 104.3 188.8 67.2 56.2 3045
## 195 -1 74 104.3 188.8 67.2 57.5 3157
## 196 -1 95 109.1 188.8 68.9 55.5 2952
## 197 -1 95 109.1 188.8 68.8 55.5 3049
## 198 -1 95 109.1 188.8 68.9 55.5 3012
## 199 -1 95 109.1 188.8 68.9 55.5 3217
## 200 -1 95 109.1 188.8 68.9 55.5 3062
## engine_size bore stroke compression horsepower peak_rpm city_mpg
## 1 130 3.47 2.68 9.00 111 5000.00 21
## 2 152 2.68 3.47 9.00 154 5000.00 19
## 3 109 3.19 3.40 10.00 102 5500.00 24
## 4 136 3.19 3.40 8.00 115 5500.00 18
## 5 136 3.19 3.40 8.50 110 5500.00 19
## 6 136 3.19 3.40 8.50 110 5500.00 19
## 7 136 3.19 3.40 8.50 110 5500.00 19
## 8 131 3.13 3.40 8.30 140 5500.00 17
## 9 108 3.50 2.80 8.80 101 5800.00 23
## 10 108 3.50 2.80 8.80 101 5800.00 23
## 11 164 3.31 3.19 9.00 121 4250.00 21
## 12 164 3.31 3.19 9.00 121 4250.00 21
## 13 164 3.31 3.19 9.00 121 4250.00 20
## 14 209 3.62 3.39 8.00 182 5400.00 16
## 15 209 3.62 3.39 8.00 182 5400.00 16
## 16 209 3.62 3.39 8.00 182 5400.00 15
## 17 61 2.91 3.03 9.50 48 5100.00 47
## 18 90 3.03 3.11 9.60 70 5400.00 38
## 19 90 3.03 3.11 9.60 70 5400.00 38
## 20 90 2.97 3.23 9.41 68 5500.00 37
## 21 90 2.97 3.23 9.40 68 5500.00 31
## 22 98 3.03 3.39 7.60 102 5500.00 24
## 23 90 2.97 3.23 9.40 68 5500.00 31
## 24 90 2.97 3.23 9.40 68 5500.00 31
## 25 90 2.97 3.23 9.40 68 5500.00 31
## 26 98 3.03 3.39 7.60 102 5500.00 24
## 27 122 3.34 3.46 8.50 88 5000.00 24
## 28 156 3.60 3.90 7.00 145 5000.00 19
## 29 92 2.91 3.41 9.60 58 4800.00 49
## 30 92 2.91 3.41 9.20 76 6000.00 31
## 31 79 2.91 3.07 10.10 60 5500.00 38
## 32 92 2.91 3.41 9.20 76 6000.00 30
## 33 92 2.91 3.41 9.20 76 6000.00 30
## 34 92 2.91 3.41 9.20 76 6000.00 30
## 35 92 2.92 3.41 9.20 76 6000.00 30
## 36 110 3.15 3.58 9.00 86 5800.00 27
## 37 110 3.15 3.58 9.00 86 5800.00 27
## 38 110 3.15 3.58 9.00 86 5800.00 27
## 39 110 3.15 3.58 9.00 86 5800.00 27
## 40 110 3.15 3.58 9.00 101 5800.00 24
## 41 110 3.15 3.58 9.10 100 5500.00 25
## 42 111 3.31 3.23 8.50 78 4800.00 24
## 43 119 3.43 3.23 9.20 90 5000.00 24
## 44 258 3.63 4.17 8.10 176 4750.00 15
## 45 258 3.63 4.17 8.10 176 4750.00 15
## 46 326 3.54 2.76 11.50 262 5000.00 13
## 47 91 3.03 3.15 9.00 68 5000.00 30
## 48 91 3.03 3.15 9.00 68 5000.00 31
## 49 91 3.03 3.15 9.00 68 5000.00 31
## 50 91 3.03 3.15 9.00 68 5000.00 31
## 51 91 3.08 3.15 9.00 68 5000.00 31
## 52 70 3.33 3.26 9.40 101 6000.00 17
## 53 70 3.33 3.26 9.40 101 6000.00 17
## 54 70 3.33 3.26 9.40 101 6000.00 17
## 55 80 3.33 3.26 9.40 135 6000.00 16
## 56 122 3.39 3.39 8.60 84 4800.00 26
## 57 122 3.39 3.39 8.60 84 4800.00 26
## 58 122 3.39 3.39 8.60 84 4800.00 26
## 59 122 3.39 3.39 8.60 84 4800.00 26
## 60 122 3.39 3.39 22.70 64 4650.00 36
## 61 122 3.39 3.39 8.60 84 4800.00 26
## 62 140 3.76 3.16 8.00 120 5000.00 19
## 63 134 3.43 3.64 22.00 72 4200.00 31
## 64 183 3.58 3.64 21.50 123 4350.00 22
## 65 183 3.58 3.64 21.50 123 4350.00 22
## 66 183 3.58 3.64 21.50 123 4350.00 22
## 67 183 3.58 3.64 21.50 123 4350.00 22
## 68 234 3.46 3.10 8.30 155 4750.00 16
## 69 234 3.46 3.10 8.30 155 4750.00 16
## 70 308 3.80 3.35 8.00 184 4500.00 14
## 71 304 3.80 3.35 8.00 184 4500.00 14
## 72 140 3.78 3.12 8.00 175 5000.00 19
## 73 92 2.97 3.23 9.40 68 5500.00 37
## 74 92 2.97 3.23 9.40 68 5500.00 31
## 75 92 2.97 3.23 9.40 68 5500.00 31
## 76 98 3.03 3.39 7.60 102 5500.00 24
## 77 110 3.17 3.46 7.50 116 5500.00 23
## 78 122 3.35 3.46 8.50 88 5000.00 25
## 79 156 3.58 3.86 7.00 145 5000.00 19
## 80 156 3.59 3.86 7.00 145 5000.00 19
## 81 156 3.59 3.86 7.00 145 5000.00 19
## 82 122 3.35 3.46 8.50 88 5000.00 25
## 83 122 3.35 3.46 8.50 88 5000.00 25
## 84 110 3.17 3.46 7.50 116 5500.00 23
## 85 110 3.17 3.46 7.50 116 5500.00 23
## 86 97 3.15 3.29 9.40 69 5200.00 31
## 87 103 2.99 3.47 21.90 55 4800.00 45
## 88 97 3.15 3.29 9.40 69 5200.00 31
## 89 97 3.15 3.29 9.40 69 5200.00 31
## 90 97 3.15 3.29 9.40 69 5200.00 31
## 91 97 3.15 3.29 9.40 69 5200.00 31
## 92 97 3.15 3.29 9.40 69 5200.00 31
## 93 97 3.15 3.29 9.40 69 5200.00 31
## 94 97 3.15 3.29 9.40 69 5200.00 31
## 95 97 3.15 3.29 9.40 69 5200.00 31
## 96 120 3.33 3.47 8.50 97 5200.00 27
## 97 120 3.33 3.47 8.50 97 5200.00 27
## 98 181 3.43 3.27 9.00 152 5200.00 17
## 99 181 3.43 3.27 9.00 152 5200.00 17
## 100 181 3.43 3.27 9.00 152 5200.00 19
## 101 181 3.43 3.27 9.00 160 5200.00 19
## 102 181 3.43 3.27 7.80 200 5200.00 17
## 103 181 3.43 3.27 9.00 160 5200.00 19
## 104 120 3.46 3.19 8.40 97 5000.00 19
## 105 152 3.70 3.52 21.00 95 4150.00 28
## 106 120 3.46 3.19 8.40 97 5000.00 19
## 107 152 3.70 3.52 21.00 95 4150.00 25
## 108 120 3.46 2.19 8.40 95 5000.00 19
## 109 152 3.70 3.52 21.00 95 4150.00 28
## 110 120 3.46 2.19 8.40 95 5000.00 19
## 111 152 3.70 3.52 21.00 95 4150.00 25
## 112 120 3.46 3.19 8.40 97 5000.00 19
## 113 152 3.70 3.52 21.00 95 4150.00 28
## 114 134 3.61 3.21 7.00 142 5600.00 18
## 115 90 2.97 3.23 9.40 68 5500.00 37
## 116 98 3.03 3.39 7.60 102 5500.00 24
## 117 90 2.97 3.23 9.40 68 5500.00 31
## 118 90 2.97 3.23 9.40 68 5500.00 31
## 119 98 2.97 3.23 9.40 68 5500.00 31
## 120 122 3.35 3.46 8.50 88 5000.00 24
## 121 156 3.59 3.86 7.00 145 5000.00 19
## 122 151 3.94 3.11 9.50 143 5500.00 19
## 123 194 3.74 2.90 9.50 207 5900.00 17
## 124 194 3.74 2.90 9.50 207 5900.00 17
## 125 194 3.74 2.90 9.50 207 5900.00 17
## 126 132 3.46 3.90 8.70 95 5118.18 23
## 127 132 3.46 3.90 8.70 95 5118.18 23
## 128 121 3.54 3.07 9.31 110 5250.00 21
## 129 121 3.54 3.07 9.30 110 5250.00 21
## 130 121 2.54 2.07 9.30 110 5250.00 21
## 131 121 3.54 3.07 9.30 110 5250.00 21
## 132 121 3.54 3.07 9.00 160 5500.00 19
## 133 121 3.54 3.07 9.00 160 5500.00 19
## 134 97 3.62 2.36 9.00 69 4900.00 31
## 135 108 3.62 2.64 8.70 73 4400.00 26
## 136 108 3.62 2.64 8.70 73 4400.00 26
## 137 108 3.62 2.64 9.50 82 4800.00 32
## 138 108 3.62 2.64 9.50 82 4400.00 28
## 139 108 3.62 2.64 9.00 94 5200.00 26
## 140 108 3.62 2.64 9.00 82 4800.00 24
## 141 108 3.62 2.64 7.70 111 4800.00 24
## 142 108 3.62 2.64 9.00 82 4800.00 28
## 143 108 3.62 2.64 9.00 94 5200.00 25
## 144 108 3.62 2.64 9.00 82 4800.00 23
## 145 108 3.62 2.64 7.70 111 4800.00 23
## 146 92 3.05 3.03 9.00 62 4800.00 35
## 147 92 3.05 3.03 9.00 62 4800.00 31
## 148 92 3.05 3.03 9.00 62 4800.00 31
## 149 92 3.05 3.03 9.00 62 4800.00 31
## 150 92 3.05 3.03 9.00 62 4800.00 27
## 151 92 3.05 3.03 9.00 62 4800.00 27
## 152 98 3.19 3.03 9.00 70 4800.00 30
## 153 98 3.19 3.03 9.00 70 4800.00 30
## 154 110 3.27 3.35 22.50 56 4500.00 34
## 155 110 3.27 3.35 22.50 56 4500.00 38
## 156 98 3.19 3.03 9.00 70 4800.00 38
## 157 98 3.19 3.03 9.00 70 4800.00 28
## 158 98 3.19 3.03 9.00 70 4800.00 28
## 159 98 3.19 3.03 9.00 70 4800.00 29
## 160 98 3.19 3.03 9.00 70 4800.00 29
## 161 98 3.24 3.08 9.40 112 6600.00 26
## 162 98 3.24 3.08 9.40 112 6600.00 26
## 163 146 3.62 3.50 9.30 116 4800.00 24
## 164 146 3.62 3.50 9.30 116 4800.00 24
## 165 146 3.62 3.50 9.30 116 4800.00 24
## 166 146 3.62 3.50 9.30 116 4800.00 24
## 167 146 3.62 3.50 9.30 116 4800.00 24
## 168 146 3.62 3.50 9.30 116 4800.00 24
## 169 122 3.31 3.54 8.70 92 4200.00 29
## 170 110 3.27 3.35 22.50 73 4500.00 30
## 171 122 3.31 3.54 8.70 92 4200.00 27
## 172 122 3.31 3.54 8.70 92 4200.00 27
## 173 122 3.31 3.54 8.70 92 4200.00 27
## 174 171 3.27 3.35 9.30 161 5200.00 20
## 175 171 3.27 3.35 9.30 161 5200.00 19
## 176 171 3.27 3.35 9.20 156 5200.00 20
## 177 161 3.27 3.35 9.20 156 5200.00 19
## 178 97 3.01 3.40 23.00 52 4800.00 37
## 179 109 3.19 3.40 9.00 85 5250.00 27
## 180 97 3.01 3.40 23.00 52 4800.00 37
## 181 109 3.19 3.40 9.00 85 5250.00 27
## 182 109 3.19 3.40 9.00 85 5250.00 27
## 183 97 3.01 3.40 23.00 68 4500.00 37
## 184 109 3.19 3.40 10.00 100 5500.00 26
## 185 109 3.19 3.40 8.50 90 5500.00 24
## 186 109 3.19 3.40 8.50 90 5500.00 24
## 187 136 3.19 3.40 8.50 110 5500.00 19
## 188 97 3.01 3.40 23.00 68 4500.00 33
## 189 109 3.19 3.40 9.00 88 5500.00 25
## 190 141 3.78 3.15 9.50 114 5400.00 23
## 191 141 3.78 3.15 9.50 114 5400.00 23
## 192 141 3.78 3.15 9.50 114 5400.00 24
## 193 141 3.78 3.15 9.50 114 5400.00 24
## 194 130 3.62 3.15 7.50 162 5100.00 17
## 195 130 3.62 3.15 7.50 162 5100.00 17
## 196 141 3.78 3.15 9.50 114 5400.00 23
## 197 141 3.78 3.15 8.70 160 5300.00 19
## 198 173 3.58 2.87 8.80 134 5500.00 18
## 199 145 3.01 3.40 23.00 106 4800.00 26
## 200 141 3.78 3.15 9.50 114 5400.00 19
## highway_mpg price outliers
## 1 27 16500 not
## 2 26 16500 not
## 3 30 13950 not
## 4 22 17450 not
## 5 25 15250 not
## 6 25 17710 not
## 7 25 18920 not
## 8 20 23875 not
## 9 29 16430 not
## 10 29 16925 not
## 11 28 20970 not
## 12 28 21105 not
## 13 25 24565 not
## 14 22 30760 outlier
## 15 22 41315 outlier
## 16 20 36880 outlier
## 17 53 5151 not
## 18 43 6295 not
## 19 43 6575 not
## 20 41 5572 not
## 21 38 6377 not
## 22 30 7957 not
## 23 38 6229 not
## 24 38 6692 not
## 25 38 7609 not
## 26 30 8558 not
## 27 30 8921 not
## 28 24 12964 not
## 29 54 6479 not
## 30 38 6855 not
## 31 42 5399 not
## 32 34 6529 not
## 33 34 7129 not
## 34 34 7295 not
## 35 34 7295 not
## 36 33 7895 not
## 37 33 9095 not
## 38 33 8845 not
## 39 33 10295 not
## 40 28 12945 not
## 41 31 10345 not
## 42 29 6785 not
## 43 29 11048 not
## 44 19 32250 outlier
## 45 19 35550 outlier
## 46 17 36000 outlier
## 47 31 5195 not
## 48 38 6095 not
## 49 38 6795 not
## 50 38 6695 not
## 51 38 7395 not
## 52 23 10945 not
## 53 23 11845 not
## 54 23 13645 not
## 55 23 15645 not
## 56 32 8845 not
## 57 32 8495 not
## 58 32 10595 not
## 59 32 10245 not
## 60 42 10795 not
## 61 32 11245 not
## 62 27 18280 not
## 63 39 18344 not
## 64 25 25552 not
## 65 25 28248 not
## 66 25 28176 not
## 67 25 31600 outlier
## 68 18 34184 outlier
## 69 18 35056 outlier
## 70 16 40960 outlier
## 71 16 45400 outlier
## 72 24 16503 not
## 73 41 5389 not
## 74 38 6189 not
## 75 38 6669 not
## 76 30 7689 not
## 77 30 9959 not
## 78 32 8499 not
## 79 24 12629 not
## 80 24 14869 not
## 81 24 14489 not
## 82 32 6989 not
## 83 32 8189 not
## 84 30 9279 not
## 85 30 9279 not
## 86 37 5499 not
## 87 50 7099 not
## 88 37 6649 not
## 89 37 6849 not
## 90 37 7349 not
## 91 37 7299 not
## 92 37 7799 not
## 93 37 7499 not
## 94 37 7999 not
## 95 37 8249 not
## 96 34 8949 not
## 97 34 9549 not
## 98 22 13499 not
## 99 22 14399 not
## 100 25 13499 not
## 101 25 17199 not
## 102 23 19699 not
## 103 25 18399 not
## 104 24 11900 not
## 105 33 13200 not
## 106 24 12440 not
## 107 25 13860 not
## 108 24 15580 not
## 109 33 16900 not
## 110 24 16695 not
## 111 25 17075 not
## 112 24 16630 not
## 113 33 17950 not
## 114 24 18150 not
## 115 41 5572 not
## 116 30 7957 not
## 117 38 6229 not
## 118 38 6692 not
## 119 38 7609 not
## 120 30 8921 not
## 121 24 12764 not
## 122 27 22018 not
## 123 25 32528 outlier
## 124 25 34028 outlier
## 125 25 37028 outlier
## 126 31 9295 not
## 127 31 9895 not
## 128 28 11850 not
## 129 28 12170 not
## 130 28 15040 not
## 131 28 15510 not
## 132 26 18150 not
## 133 26 18620 not
## 134 36 5118 not
## 135 31 7053 not
## 136 31 7603 not
## 137 37 7126 not
## 138 33 7775 not
## 139 32 9960 not
## 140 25 9233 not
## 141 29 11259 not
## 142 32 7463 not
## 143 31 10198 not
## 144 29 8013 not
## 145 23 11694 not
## 146 39 5348 not
## 147 38 6338 not
## 148 38 6488 not
## 149 37 6918 not
## 150 32 7898 not
## 151 32 8778 not
## 152 37 6938 not
## 153 37 7198 not
## 154 36 7898 not
## 155 47 7788 not
## 156 47 7738 not
## 157 34 8358 not
## 158 34 9258 not
## 159 34 8058 not
## 160 34 8238 not
## 161 29 9298 not
## 162 29 9538 not
## 163 30 8449 not
## 164 30 9639 not
## 165 30 9989 not
## 166 30 11199 not
## 167 30 11549 not
## 168 30 17669 not
## 169 34 8948 not
## 170 33 10698 not
## 171 32 9988 not
## 172 32 10898 not
## 173 32 11248 not
## 174 24 16558 not
## 175 24 15998 not
## 176 24 15690 not
## 177 24 15750 not
## 178 46 7775 not
## 179 34 7975 not
## 180 46 7995 not
## 181 34 8195 not
## 182 34 8495 not
## 183 42 9495 not
## 184 32 9995 not
## 185 29 11595 not
## 186 29 9980 not
## 187 24 13295 not
## 188 38 13845 not
## 189 31 12290 not
## 190 28 12940 not
## 191 28 13415 not
## 192 28 15985 not
## 193 28 16515 not
## 194 22 18420 not
## 195 22 18950 not
## 196 28 16845 not
## 197 25 19045 not
## 198 23 21485 not
## 199 27 22470 not
## 200 25 22625 not
summarization = pricy_cars %>%
select(outliers, price, highway_mpg, city_mpg, peak_rpm, horsepower, engine_size, curb_weight) %>%
group_by(outliers) %>%
summarize(
mean_price = mean(price), sd_price = sd(price), median_price = median(price),
mean_highway = mean(highway_mpg), sd_highway = sd(highway_mpg), median_highway = median(highway_mpg),
mean_city = mean(city_mpg), sd_city = sd(city_mpg), median_city = median(city_mpg),
mean_rpm = mean(peak_rpm), sd_rpm = sd(peak_rpm), median_rpm = median(peak_rpm),
mean_horsepower = mean(horsepower), sd_horsepower = sd(horsepower), median_horsepower = median(horsepower),
mean_engine = mean(engine_size), sd_engine = sd(engine_size), median_engine = median(engine_size),
mean_weight = mean(curb_weight), sd_weight = sd(curb_weight), median_weight = median(curb_weight)
)
summarization
## # A tibble: 2 × 22
## outliers mean_price sd_price median_…¹ mean_…² sd_hi…³ media…⁴ mean_…⁵ sd_city
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 not 11492. 4991. 9960. 31.5 6.39 31 25.9 6.10
## 2 outlier 35967. 4153. 35303 20.5 3.46 19.5 15.9 2.13
## # … with 13 more variables: median_city <dbl>, mean_rpm <dbl>, sd_rpm <dbl>,
## # median_rpm <dbl>, mean_horsepower <dbl>, sd_horsepower <dbl>,
## # median_horsepower <dbl>, mean_engine <dbl>, sd_engine <dbl>,
## # median_engine <dbl>, mean_weight <dbl>, sd_weight <dbl>,
## # median_weight <dbl>, and abbreviated variable names ¹median_price,
## # ²mean_highway, ³sd_highway, ⁴median_highway, ⁵mean_city
These outlier vehicles were found to have:
However given a closer look at these vehicles, it might be that luxury-brand vehicles (based on names of the automakers) would have constitute a greater premium in terms of pricing.
As these factors are what is going to impact sales price + with the trends being in line with non-outlier vehicles, we will make the decision to keep these outliers in our dataset.
For our purpose of maximizing variance whilst minimzing inherent bias, we will be using a 80% training to 20% testing split.
set.seed(1) # For the sake of reproducibility
train_indices = createDataPartition(y = clean_cars_numeric[['price']],
p = 0.8,
list = FALSE)
training_listings = clean_cars_numeric[train_indices, ] # Should be about 161
testing_listings = clean_cars_numeric[-train_indices,] # Should be about 40
For this process, we will be using the Grid Search method to tune hyperparameters in our model.
As our intention is to utilize the KNN approach, there will only be one parameter (i.e. k) which we will find a range for different combination.
cv_folds = 15 #it's the sqrt(n) where n = 205
knn_grid = expand.grid(k = 1:100)
train_control = trainControl(method = 'cv', number = cv_folds)
This will be the longest part as it will be a series of experimentation of figuring out which is the best approach to predicting car price. However, there are several approaches to this.
Using multiple resources that had explored the topic of vehicle prices (both new and used) as listed here,here, here, here, we see that several factors within our dataframe may be ideal candidates as predictors: 1) city + highway mileage & 2) engine size
Looking past this, we can also speculate that horsepower + torque may also play a role considering it’s impact on fuel consumption.
Furthermore, it can also be speculated that car size may play a role in vehicle pricing given its relationship to insurance rates as noted here, the change in the current landscape of the marketplace as noted here, and how the influence of personal economics impact car purchasing habits as noted here
Thus, we will assume a comparison of several different models for comparing:
A model containing city MPG and highway MPG
A model containing city MPG, highway MPG and engine size
A model containing city MPG, highway MPG, engine size and horsepower
A model containing city MPG, highway MPG, engine size, horsepower and torque
A model containing city MPG, highway MPG, engine size, horsepower, torque and curb weight
A model containing city MPG, highway MPG, engine size, horsepower, torque, curb weight and length
A model containing city MPG, highway MPG, engine size, horsepower, torque, curb weight, length and width
A model containing city MPG, highway MPG, engine size, horsepower, torque, curb weight, length, width and height
knn_model_1 = train(price ~ city_mpg + highway_mpg,
data = clean_cars_numeric,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_2 = train(price ~ city_mpg + highway_mpg + engine_size,
data = clean_cars_numeric,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_3 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower,
data = clean_cars_numeric,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_4 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm,
data = clean_cars_numeric,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_5 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight,
data = clean_cars_numeric,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_6 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length,
data = clean_cars_numeric,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_7 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length + width,
data = clean_cars_numeric,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_8 = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length + width + height,
data = clean_cars_numeric,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
This is essentially the “throw it against the wall and see what sticks” approach, where we will go through each variable and see what is best variable/parameter to add into our mode to best predict car prices.
Whilst we can do so by manually adding each variable individually and see what sticks (i.e. one model has price ~ city_mpg, the other has price ~ horsepower, etc.), we can use a stepwise regression analysis to look at seeing which variable is retained based on our dataset.
We will retain the model based on AIC scores
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(leaps)
stepwise_model = train(price ~., # includes all variables within the dataframe
data = clean_cars_numeric,
trControl = train_control,
method = 'leapSeq',
preProcess = c('center', 'scale'),
tuneGrid = data.frame(nvmax = 1:15))
# nvmax corresponding to a tuning parameter corresponds to the maximum number of predictors to be incorporated
# Another approach to using stepwise regression
stepwise_model_A = train(price ~.,
data = clean_cars_numeric,
method = 'lmStepAIC',
trControl = train_control,
trace = FALSE)
# Another approach to using bi-directional stepwise regression
linear_model_all <- lm(price ~., data = clean_cars_numeric)
stepwise_model_B <- stepAIC(linear_model_all, direction = "both", trace = FALSE)
stepwise_model_B
##
## Call:
## lm(formula = price ~ width + height + engine_size + stroke +
## compression + horsepower + peak_rpm, data = clean_cars_numeric)
##
## Coefficients:
## (Intercept) width height engine_size stroke compression
## -68275.619 698.961 187.970 112.213 -2729.523 272.478
## horsepower peak_rpm
## 51.880 2.314
Let’s see how these models turn out.
# The Educated-Guess Approach
testing = testing_listings %>%
mutate(
model_one_prediction = predict(knn_model_1, newdata = testing_listings),
model_two_prediction = predict(knn_model_2, newdata = testing_listings),
model_three_prediction = predict(knn_model_3, newdata = testing_listings),
model_four_prediction = predict(knn_model_4, newdata = testing_listings),
model_five_prediction = predict(knn_model_5, newdata = testing_listings),
model_six_prediction = predict(knn_model_6, newdata = testing_listings),
model_seven_prediction = predict(knn_model_7, newdata = testing_listings),
model_eight_prediction = predict(knn_model_8, newdata = testing_listings),
sq_error_model_one = (price - model_one_prediction)^2,
sq_error_model_two = (price - model_two_prediction)^2,
sq_error_model_three = (price - model_three_prediction)^2,
sq_error_model_four = (price - model_four_prediction)^2,
sq_error_model_five = (price - model_five_prediction)^2,
sq_error_model_six = (price - model_six_prediction)^2,
sq_error_model_seven = (price - model_seven_prediction)^2,
sq_error_model_eight = (price - model_eight_prediction)^2
)
long_testing = testing %>%
pivot_longer(
cols = sq_error_model_one:sq_error_model_eight,
names_to = 'model',
values_to = 'sq_error'
)
rmse_by_model = long_testing %>%
group_by(model) %>%
summarize(rmse = sqrt(mean(sq_error)))
summed_model_a = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight, data = clean_cars_numeric)
summed_model_b = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length, data = clean_cars_numeric)
summed_model_c = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length + width + height, data = clean_cars_numeric)
predictions_model_a <- predict(summed_model_a, newdata = testing_listings)
predictions_model_b <- predict(summed_model_b, newdata = testing_listings)
predictions_model_c <- predict(summed_model_c, newdata = testing_listings)
postResample(pred = predictions_model_a, obs = testing_listings$price) # RMSE = 3348.235
## RMSE Rsquared MAE
## 2759.4948088 0.8652066 1962.9642552
postResample(pred = predictions_model_b, obs = testing_listings$price) # RMSE = 3345.551
## RMSE Rsquared MAE
## 2752.3901296 0.8659432 1950.4532659
postResample(pred = predictions_model_c, obs = testing_listings$price) # RMSE = 3141.412
## RMSE Rsquared MAE
## 2563.864202 0.885678 1845.171601
Looking at the above findings, we see that the top 3 models that appear to have performed the ‘best’ (based on RMSE) are
However, looking at the RMSE and the R-squared of each model, we see that Model # 8 had the highest score whereby it was predictive of 82.05% of the variance in car prices.
# Stepwise Regression Approach
stepwise_model
## Linear Regression with Stepwise Selection
##
## 200 samples
## 15 predictor
##
## Pre-processing: centered (15), scaled (15)
## Resampling: Cross-Validated (15 fold)
## Summary of sample sizes: 187, 188, 185, 188, 186, 187, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 3910.091 0.7190993 2859.375
## 2 3846.532 0.7568909 2691.482
## 3 3517.138 0.7924893 2589.376
## 4 3500.449 0.8034444 2575.584
## 5 3499.347 0.8024931 2572.064
## 6 3549.923 0.7939102 2701.091
## 7 3835.687 0.7742852 2667.338
## 8 3420.309 0.8206737 2516.168
## 9 3484.930 0.8152225 2609.172
## 10 3428.751 0.8198426 2495.174
## 11 3504.878 0.8119616 2584.587
## 12 3362.435 0.8289794 2463.348
## 13 3368.405 0.8290710 2484.589
## 14 3365.325 0.8313338 2495.912
## 15 3359.034 0.8310511 2502.934
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 15.
stepwise_model$results
## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 3910.091 0.7190993 2859.375 679.0218 0.20996012 494.5426
## 2 2 3846.532 0.7568909 2691.482 761.6904 0.14843640 452.8594
## 3 3 3517.138 0.7924893 2589.376 777.0529 0.12660326 594.6951
## 4 4 3500.449 0.8034444 2575.584 836.7178 0.10721483 580.0019
## 5 5 3499.347 0.8024931 2572.064 906.8095 0.10827541 584.3168
## 6 6 3549.923 0.7939102 2701.091 1184.5210 0.11898552 788.4440
## 7 7 3835.687 0.7742852 2667.338 1227.4745 0.13325168 743.3132
## 8 8 3420.309 0.8206737 2516.168 919.6046 0.10047677 654.6060
## 9 9 3484.930 0.8152225 2609.172 929.3991 0.10092127 675.9808
## 10 10 3428.751 0.8198426 2495.174 808.8460 0.10572377 498.7029
## 11 11 3504.878 0.8119616 2584.587 1016.4914 0.11729074 621.1184
## 12 12 3362.435 0.8289794 2463.348 905.4385 0.10026686 602.0992
## 13 13 3368.405 0.8290710 2484.589 942.5705 0.09984006 677.1885
## 14 14 3365.325 0.8313338 2495.912 953.0371 0.09811661 676.0982
## 15 15 3359.034 0.8310511 2502.934 921.9604 0.10105924 651.0425
summary(stepwise_model$finalModel)
## Subset selection object
## 15 Variables (and intercept)
## Forced in Forced out
## symboling FALSE FALSE
## normalized_losses FALSE FALSE
## wheel_base FALSE FALSE
## length FALSE FALSE
## width FALSE FALSE
## height FALSE FALSE
## curb_weight FALSE FALSE
## engine_size FALSE FALSE
## bore FALSE FALSE
## stroke FALSE FALSE
## compression FALSE FALSE
## horsepower FALSE FALSE
## peak_rpm FALSE FALSE
## city_mpg FALSE FALSE
## highway_mpg FALSE FALSE
## 1 subsets of each size up to 15
## Selection Algorithm: 'sequential replacement'
## symboling normalized_losses wheel_base length width height
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " "*" " "
## 7 ( 1 ) " " " " " " " " "*" "*"
## 8 ( 1 ) " " " " " " " " "*" "*"
## 9 ( 1 ) " " " " " " "*" "*" "*"
## 10 ( 1 ) " " " " " " "*" "*" "*"
## 11 ( 1 ) " " " " "*" "*" "*" "*"
## 12 ( 1 ) "*" " " "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*"
## curb_weight engine_size bore stroke compression horsepower peak_rpm
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " " " "
## 3 ( 1 ) "*" "*" " " " " " " " " "*"
## 4 ( 1 ) "*" "*" " " "*" " " " " "*"
## 5 ( 1 ) "*" "*" " " "*" "*" " " "*"
## 6 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 7 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 8 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 9 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 10 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 11 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 12 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## city_mpg highway_mpg
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) "*" " "
## 9 ( 1 ) "*" " "
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) " " " "
## 14 ( 1 ) "*" " "
## 15 ( 1 ) "*" "*"
summarized_model = lm(price ~ city_mpg + peak_rpm + horsepower + compression + stroke + engine_size + height + width, data = clean_cars_numeric)
summary(summarized_model)
##
## Call:
## lm(formula = price ~ city_mpg + peak_rpm + horsepower + compression +
## stroke + engine_size + height + width, data = clean_cars_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10996.2 -1684.7 3.9 1578.1 13813.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.668e+04 1.433e+04 -3.955 0.000108 ***
## city_mpg -9.956e+01 7.581e+01 -1.313 0.190641
## peak_rpm 2.337e+00 6.385e-01 3.660 0.000326 ***
## horsepower 3.947e+01 1.681e+01 2.348 0.019914 *
## compression 3.159e+02 7.577e+01 4.169 4.64e-05 ***
## stroke -2.742e+03 7.721e+02 -3.551 0.000483 ***
## engine_size 1.156e+02 1.346e+01 8.593 3.00e-15 ***
## height 1.645e+02 1.098e+02 1.498 0.135733
## width 5.852e+02 1.978e+02 2.959 0.003476 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3209 on 191 degrees of freedom
## Multiple R-squared: 0.8443, Adjusted R-squared: 0.8378
## F-statistic: 129.5 on 8 and 191 DF, p-value: < 2.2e-16
prediction_summarized_model_a = predict(summarized_model, newdata = testing_listings)
postResample(pred = prediction_summarized_model_a, obs = testing_listings$price) # RMSE = 3083.854
## RMSE Rsquared MAE
## 2546.0779941 0.8878266 1854.2849324
Within the course of identifying the number of predictors to be included into the model, it appears as those the best model had included 8 variables.
These 8 variables include: width, height, engine size, stroke, compression, horsepower, torque and city fuel consumption.
Looking at this model, it seems that this particular model could explain ~ 83.78% variance of predicted car prices.
# Stepwise Regression Approach - Option B
stepwise_model_A
## Linear Regression with Stepwise Selection
##
## 200 samples
## 15 predictor
##
## No pre-processing
## Resampling: Cross-Validated (15 fold)
## Summary of sample sizes: 186, 187, 185, 187, 188, 187, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3449.37 0.8121083 2561.163
stepwise_model_A$results
## parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 none 3449.37 0.8121083 2561.163 802.9063 0.08966347 427.2272
stepwise_model_A$finalModel
##
## Call:
## lm(formula = .outcome ~ width + height + engine_size + stroke +
## compression + horsepower + peak_rpm, data = dat)
##
## Coefficients:
## (Intercept) width height engine_size stroke compression
## -68275.619 698.961 187.970 112.213 -2729.523 272.478
## horsepower peak_rpm
## 51.880 2.314
summary(stepwise_model_A$finalModel)
##
## Call:
## lm(formula = .outcome ~ width + height + engine_size + stroke +
## compression + horsepower + peak_rpm, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11400.0 -1666.3 -17.2 1499.1 13720.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.828e+04 1.131e+04 -6.036 8.01e-09 ***
## width 6.990e+02 1.781e+02 3.924 0.000121 ***
## height 1.880e+02 1.086e+02 1.732 0.084954 .
## engine_size 1.122e+02 1.323e+01 8.483 5.81e-15 ***
## stroke -2.730e+03 7.735e+02 -3.529 0.000523 ***
## compression 2.725e+02 6.832e+01 3.988 9.46e-05 ***
## horsepower 5.188e+01 1.393e+01 3.723 0.000259 ***
## peak_rpm 2.314e+00 6.395e-01 3.618 0.000379 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3215 on 192 degrees of freedom
## Multiple R-squared: 0.8429, Adjusted R-squared: 0.8372
## F-statistic: 147.2 on 7 and 192 DF, p-value: < 2.2e-16
summarize_model_A = lm(price ~ peak_rpm + horsepower + compression + stroke + engine_size + height + width, data = clean_cars_numeric)
summary(summarize_model_A)
##
## Call:
## lm(formula = price ~ peak_rpm + horsepower + compression + stroke +
## engine_size + height + width, data = clean_cars_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11400.0 -1666.3 -17.2 1499.1 13720.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.828e+04 1.131e+04 -6.036 8.01e-09 ***
## peak_rpm 2.314e+00 6.395e-01 3.618 0.000379 ***
## horsepower 5.188e+01 1.393e+01 3.723 0.000259 ***
## compression 2.725e+02 6.832e+01 3.988 9.46e-05 ***
## stroke -2.730e+03 7.735e+02 -3.529 0.000523 ***
## engine_size 1.122e+02 1.323e+01 8.483 5.81e-15 ***
## height 1.880e+02 1.086e+02 1.732 0.084954 .
## width 6.990e+02 1.781e+02 3.924 0.000121 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3215 on 192 degrees of freedom
## Multiple R-squared: 0.8429, Adjusted R-squared: 0.8372
## F-statistic: 147.2 on 7 and 192 DF, p-value: < 2.2e-16
prediction_summarize_model_a = predict(summarize_model_A, newdata = testing_listings)
postResample(pred = prediction_summarize_model_a, obs = testing_listings$price) # RMSE = 3136.352
## RMSE Rsquared MAE
## 2523.8409698 0.8904233 1851.9866798
# Stepwise Regression Approach - Option C
stepwise_model_B
##
## Call:
## lm(formula = price ~ width + height + engine_size + stroke +
## compression + horsepower + peak_rpm, data = clean_cars_numeric)
##
## Coefficients:
## (Intercept) width height engine_size stroke compression
## -68275.619 698.961 187.970 112.213 -2729.523 272.478
## horsepower peak_rpm
## 51.880 2.314
stepwise_model_B
##
## Call:
## lm(formula = price ~ width + height + engine_size + stroke +
## compression + horsepower + peak_rpm, data = clean_cars_numeric)
##
## Coefficients:
## (Intercept) width height engine_size stroke compression
## -68275.619 698.961 187.970 112.213 -2729.523 272.478
## horsepower peak_rpm
## 51.880 2.314
summary(stepwise_model_B)
##
## Call:
## lm(formula = price ~ width + height + engine_size + stroke +
## compression + horsepower + peak_rpm, data = clean_cars_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11400.0 -1666.3 -17.2 1499.1 13720.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.828e+04 1.131e+04 -6.036 8.01e-09 ***
## width 6.990e+02 1.781e+02 3.924 0.000121 ***
## height 1.880e+02 1.086e+02 1.732 0.084954 .
## engine_size 1.122e+02 1.323e+01 8.483 5.81e-15 ***
## stroke -2.730e+03 7.735e+02 -3.529 0.000523 ***
## compression 2.725e+02 6.832e+01 3.988 9.46e-05 ***
## horsepower 5.188e+01 1.393e+01 3.723 0.000259 ***
## peak_rpm 2.314e+00 6.395e-01 3.618 0.000379 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3215 on 192 degrees of freedom
## Multiple R-squared: 0.8429, Adjusted R-squared: 0.8372
## F-statistic: 147.2 on 7 and 192 DF, p-value: < 2.2e-16
prediction_stepwise_model_B = predict(stepwise_model_B, newdata = testing_listings)
postResample(pred = prediction_stepwise_model_B, obs = testing_listings$price) # RMSE = 3136.352
## RMSE Rsquared MAE
## 2523.8409698 0.8904233 1851.9866798
Within the course of identifying the number of predictors to be included into the model, it appears as those the best model had included 7 variables.
These 7 variables include: width, height, engine size, stroke, compression, horsepower and torque.
Looking at this model, it seems that this particular model could explain ~ 83.72% variance of predicted car prices.
NOTE: it is interesting to note that whilst these last two models appeared to perform better compared to the educated-guess approach, it is marginally poorer in predicting car prices with the inclusion of city fuel consumption as a metric.
Overall looking at the impact of car characteristics influencing car prices, it seems as though the major players are: car width, car length, engine size, engine stroke, engine compression, horsepower, torque and city fuel consumption based on our model. It should be noted that further analysis should be performed when taking into consideration of variables that are categorical/nominal in nature such as branding, body type, etc.
train_indicesA = createDataPartition(y = cars_numeric_only_1[['price']],
p = 0.8,
list = FALSE)
cv_folds = 13 # it's the sqrt(n) where n = 160
knn_grid = expand.grid(k = 1:100)
train_control = trainControl(method = 'cv', number = cv_folds)
testing_listingsA = cars_numeric_only_1[-train_indicesA,] # Should be about 32
training_listingsA = cars_numeric_only_1[train_indicesA, ] # shoudl be 128
knn_model_1A = train(price ~ city_mpg + highway_mpg,
data = cars_numeric_only_1,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_2A = train(price ~ city_mpg + highway_mpg + engine_size,
data = cars_numeric_only_1,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_3A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower,
data = cars_numeric_only_1,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_4A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm,
data = cars_numeric_only_1,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_5A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight,
data = cars_numeric_only_1,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_6A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length,
data = cars_numeric_only_1,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_7A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length + width,
data = cars_numeric_only_1,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
knn_model_8A = train(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length + width + height,
data = cars_numeric_only_1,
method = 'knn',
trControl = train_control,
preProcess = c('center', 'scale'),
tuneGrid = knn_grid)
testingA = testing_listingsA %>%
mutate(
model_one_prediction = predict(knn_model_1A, newdata = testing_listingsA),
model_two_prediction = predict(knn_model_2A, newdata = testing_listingsA),
model_three_prediction = predict(knn_model_3A, newdata = testing_listingsA),
model_four_prediction = predict(knn_model_4A, newdata = testing_listingsA),
model_five_prediction = predict(knn_model_5A, newdata = testing_listingsA),
model_six_prediction = predict(knn_model_6A, newdata = testing_listingsA),
model_seven_prediction = predict(knn_model_7A, newdata = testing_listingsA),
model_eight_prediction = predict(knn_model_8A, newdata = testing_listingsA),
sq_error_model_one = (price - model_one_prediction)^2,
sq_error_model_two = (price - model_two_prediction)^2,
sq_error_model_three = (price - model_three_prediction)^2,
sq_error_model_four = (price - model_four_prediction)^2,
sq_error_model_five = (price - model_five_prediction)^2,
sq_error_model_six = (price - model_six_prediction)^2,
sq_error_model_seven = (price - model_seven_prediction)^2,
sq_error_model_eight = (price - model_eight_prediction)^2
)
long_testingA = testingA %>%
pivot_longer(
cols = sq_error_model_one:sq_error_model_eight,
names_to = 'model',
values_to = 'sq_error'
)
rmse_by_modelA = long_testingA %>%
group_by(model) %>%
summarize(rmse = sqrt(mean(sq_error)))
projected_model_a = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length + width + height, data = cars_numeric_only_1)
summary(projected_model_a)
##
## Call:
## lm(formula = price ~ city_mpg + highway_mpg + engine_size + horsepower +
## peak_rpm + curb_weight + length + width + height, data = cars_numeric_only_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5520.5 -1294.0 -358.8 1153.5 7937.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.925e+04 1.454e+04 -4.762 4.48e-06 ***
## city_mpg 7.893e+01 1.477e+02 0.534 0.5939
## highway_mpg -3.457e+01 1.388e+02 -0.249 0.8037
## engine_size 3.578e+01 1.786e+01 2.004 0.0469 *
## horsepower 1.550e+01 1.628e+01 0.952 0.3425
## peak_rpm 8.516e-01 5.369e-01 1.586 0.1148
## curb_weight 6.700e+00 1.493e+00 4.487 1.43e-05 ***
## length -6.733e+01 4.307e+01 -1.563 0.1201
## width 9.491e+02 2.231e+02 4.255 3.67e-05 ***
## height 4.569e+01 1.244e+02 0.367 0.7139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2436 on 150 degrees of freedom
## Multiple R-squared: 0.8371, Adjusted R-squared: 0.8274
## F-statistic: 85.66 on 9 and 150 DF, p-value: < 2.2e-16
projected_model_b = lm(price ~ city_mpg + highway_mpg + engine_size + horsepower + peak_rpm + curb_weight + length , data = cars_numeric_only_1)
summary(projected_model_b)
##
## Call:
## lm(formula = price ~ city_mpg + highway_mpg + engine_size + horsepower +
## peak_rpm + curb_weight + length, data = cars_numeric_only_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7121.2 -1225.3 -57.8 1013.9 8107.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.173e+04 7.512e+03 -2.893 0.00438 **
## city_mpg 1.560e+02 1.539e+02 1.014 0.31235
## highway_mpg -9.831e+01 1.447e+02 -0.679 0.49796
## engine_size 4.234e+01 1.764e+01 2.401 0.01756 *
## horsepower 1.285e+01 1.664e+01 0.772 0.44122
## peak_rpm 9.262e-01 5.626e-01 1.646 0.10173
## curb_weight 8.557e+00 1.422e+00 6.016 1.28e-08 ***
## length 6.698e-01 3.834e+01 0.017 0.98609
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2565 on 152 degrees of freedom
## Multiple R-squared: 0.8171, Adjusted R-squared: 0.8086
## F-statistic: 96.99 on 7 and 152 DF, p-value: < 2.2e-16
predictions_model_1 <- predict(projected_model_a, newdata = testing_listingsA)
predictions_model_2 <- predict(projected_model_b, newdata = testing_listingsA)
postResample(pred = predictions_model_1, obs = testing_listingsA$price) # RMSE = 1999.039
## RMSE Rsquared MAE
## 2050.4475753 0.8531872 1605.1406928
postResample(pred = predictions_model_2, obs = testing_listingsA$price) # RMSE = 2066.850
## RMSE Rsquared MAE
## 1924.0041129 0.8665596 1477.4063047
Looking at the above findings, it was found that the eighth model appeared to be better performing model based on RMSE score and R-squared where our model appeared to predict 82.74% of the variance in car prices using the completed-cases dataset.
This model showed all of the selected variables as a better measure of
stepwise_model_A = train(price ~.,
data = cars_numeric_only_1,
trControl = train_control,
method = 'leapSeq',
preProcess = c('center', 'scale'),
tuneGrid = data.frame(nvmax = 1:15))
stepwise_model_A
## Linear Regression with Stepwise Selection
##
## 160 samples
## 15 predictor
##
## Pre-processing: centered (15), scaled (15)
## Resampling: Cross-Validated (13 fold)
## Summary of sample sizes: 148, 148, 147, 148, 148, 148, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 2579.634 0.7929105 1833.293
## 2 2610.867 0.7845960 1889.033
## 3 2747.253 0.7668762 2004.186
## 4 2799.128 0.7703185 1974.953
## 5 2846.507 0.7650220 2014.433
## 6 2921.272 0.7627612 2008.727
## 7 2791.904 0.7703027 1987.699
## 8 2732.024 0.7772226 1934.562
## 9 2731.164 0.7751851 1949.991
## 10 2622.885 0.7928392 1866.182
## 11 2613.598 0.7937393 1859.052
## 12 2624.003 0.7926376 1872.347
## 13 2603.542 0.7962381 1870.056
## 14 2624.413 0.7934172 1882.241
## 15 2627.498 0.7930156 1886.600
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 1.
stepwise_model_A$results
## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 2579.634 0.7929105 1833.293 643.5565 0.1293629 404.9013
## 2 2 2610.867 0.7845960 1889.033 696.7756 0.1368709 475.3891
## 3 3 2747.253 0.7668762 2004.186 687.4821 0.1458420 478.5374
## 4 4 2799.128 0.7703185 1974.953 759.5961 0.1322827 413.9602
## 5 5 2846.507 0.7650220 2014.433 806.0254 0.1308851 447.9836
## 6 6 2921.272 0.7627612 2008.727 636.7451 0.1296065 363.1962
## 7 7 2791.904 0.7703027 1987.699 677.9448 0.1305660 428.5148
## 8 8 2732.024 0.7772226 1934.562 671.4970 0.1258543 405.6321
## 9 9 2731.164 0.7751851 1949.991 646.8468 0.1239406 420.2103
## 10 10 2622.885 0.7928392 1866.182 607.4639 0.1126485 397.7653
## 11 11 2613.598 0.7937393 1859.052 619.4499 0.1187760 395.7822
## 12 12 2624.003 0.7926376 1872.347 602.9018 0.1125727 395.9963
## 13 13 2603.542 0.7962381 1870.056 589.3961 0.1104432 380.4119
## 14 14 2624.413 0.7934172 1882.241 607.4069 0.1120866 387.4576
## 15 15 2627.498 0.7930156 1886.600 609.8408 0.1126217 387.6436
summary(stepwise_model_A$finalModel) # it appears only one predictor variable was retained
## Subset selection object
## 15 Variables (and intercept)
## Forced in Forced out
## symboling FALSE FALSE
## normalized_losses FALSE FALSE
## wheel_base FALSE FALSE
## length FALSE FALSE
## width FALSE FALSE
## height FALSE FALSE
## curb_weight FALSE FALSE
## engine_size FALSE FALSE
## bore FALSE FALSE
## stroke FALSE FALSE
## compression FALSE FALSE
## horsepower FALSE FALSE
## peak_rpm FALSE FALSE
## city_mpg FALSE FALSE
## highway_mpg FALSE FALSE
## 1 subsets of each size up to 1
## Selection Algorithm: 'sequential replacement'
## symboling normalized_losses wheel_base length width height curb_weight
## 1 ( 1 ) " " " " " " " " " " " " "*"
## engine_size bore stroke compression horsepower peak_rpm city_mpg
## 1 ( 1 ) " " " " " " " " " " " " " "
## highway_mpg
## 1 ( 1 ) " "
summarized_model_A1 = lm(price ~ curb_weight, data = cars_numeric_only_1)
summary(summarized_model_A1)
##
## Call:
## lm(formula = price ~ curb_weight, data = cars_numeric_only_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9739.8 -1447.2 -144.8 1108.9 10271.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15376.867 1090.042 -14.11 <2e-16 ***
## curb_weight 10.899 0.435 25.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2638 on 158 degrees of freedom
## Multiple R-squared: 0.7989, Adjusted R-squared: 0.7976
## F-statistic: 627.7 on 1 and 158 DF, p-value: < 2.2e-16
prediction_summarized_model_a1 = predict(summarized_model_A1, newdata = testing_listingsA)
postResample(pred = prediction_summarized_model_a1, obs = testing_listingsA$price) # RSME = 1985.155, R-squared = 0.8417
## RMSE Rsquared MAE
## 2055.5304142 0.8522134 1573.2709604
# second method
stepwise_model_B = train(price ~.,
data = cars_numeric_only_1,
method = 'lmStepAIC',
trControl = train_control,
trace = FALSE)
stepwise_model_B
## Linear Regression with Stepwise Selection
##
## 160 samples
## 15 predictor
##
## No pre-processing
## Resampling: Cross-Validated (13 fold)
## Summary of sample sizes: 148, 148, 148, 146, 148, 148, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2657.779 0.8023394 1915.337
stepwise_model_B$results
## parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 none 2657.779 0.8023394 1915.337 761.703 0.1091877 562.7411
summary(stepwise_model_B$finalModel) #
##
## Call:
## lm(formula = .outcome ~ wheel_base + length + width + curb_weight +
## engine_size + bore + stroke + compression + horsepower +
## peak_rpm, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5805.7 -1294.3 -283.1 1099.2 7460.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.582e+04 1.201e+04 -4.646 7.37e-06 ***
## wheel_base 1.736e+02 8.593e+01 2.020 0.045177 *
## length -8.955e+01 4.258e+01 -2.103 0.037157 *
## width 7.865e+02 2.196e+02 3.581 0.000462 ***
## curb_weight 5.153e+00 1.420e+00 3.628 0.000391 ***
## engine_size 5.201e+01 1.753e+01 2.967 0.003507 **
## bore -2.036e+03 1.049e+03 -1.942 0.054036 .
## stroke -1.929e+03 7.579e+02 -2.545 0.011931 *
## compression 1.109e+02 6.467e+01 1.714 0.088563 .
## horsepower 2.711e+01 1.534e+01 1.768 0.079099 .
## peak_rpm 8.618e-01 5.468e-01 1.576 0.117134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2352 on 149 degrees of freedom
## Multiple R-squared: 0.8492, Adjusted R-squared: 0.8391
## F-statistic: 83.9 on 10 and 149 DF, p-value: < 2.2e-16
summarized_model_B1 = lm(price ~ wheel_base + length + width + curb_weight + engine_size + bore + stroke + compression + horsepower + peak_rpm, data = cars_numeric_only_1)
summary(summarized_model_B1)
##
## Call:
## lm(formula = price ~ wheel_base + length + width + curb_weight +
## engine_size + bore + stroke + compression + horsepower +
## peak_rpm, data = cars_numeric_only_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5805.7 -1294.3 -283.1 1099.2 7460.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.582e+04 1.201e+04 -4.646 7.37e-06 ***
## wheel_base 1.736e+02 8.593e+01 2.020 0.045177 *
## length -8.955e+01 4.258e+01 -2.103 0.037157 *
## width 7.865e+02 2.196e+02 3.581 0.000462 ***
## curb_weight 5.153e+00 1.420e+00 3.628 0.000391 ***
## engine_size 5.201e+01 1.753e+01 2.967 0.003507 **
## bore -2.036e+03 1.049e+03 -1.942 0.054036 .
## stroke -1.929e+03 7.579e+02 -2.545 0.011931 *
## compression 1.109e+02 6.467e+01 1.714 0.088563 .
## horsepower 2.711e+01 1.534e+01 1.768 0.079099 .
## peak_rpm 8.618e-01 5.468e-01 1.576 0.117134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2352 on 149 degrees of freedom
## Multiple R-squared: 0.8492, Adjusted R-squared: 0.8391
## F-statistic: 83.9 on 10 and 149 DF, p-value: < 2.2e-16
prediction_summarized_model_B1 = predict(summarized_model_B1, newdata = testing_listingsA)
postResample(pred = prediction_summarized_model_B1, obs = testing_listingsA$price) #RSME = 2016.58, R-squared = 0.8415
## RMSE Rsquared MAE
## 2103.2021951 0.8440939 1617.9408960
Looking at the use of a stepwise regression analysis, it appeared to show that within complete-cases analysis, only curb_weight was retained as a significant predictor whereby the model appeared to predict ~ 84.17% of the variance in car price predictions.
Obviously in comparison to the previous method, it was found that within complete cases, the KNN method provided a better model prediction in car prices within the given dataset.