In this project we’ll explore how well the size of a condominium (measured in gross square feet) explains, or predicts, sale price in New York City. We will also explore how well the size of a condominium predicts sale price in each of the five boroughs of New York City: the Bronx, Brooklyn, Manhattan, Staten Island, and Queens.
Before we build linear regression models we will plot sale price versus gross square feet to see if the data exhibits any obvious visual patterns. Plotting the data will also allow us to visualize outliers, and we will investigate some of the outliers to determine if the data was recorded correctly. This property sales data is publicly available and contains sales records from a twelve-month period (November, 2018 through October, 2019).
The data used for this project originates from five separate Microsoft
Excel files, one for each borough in New York City. The data structure
is identical for all five files, which makes it possible to combine
all of the data into a single file. The programming steps below
outline the steps taken to load each dataset into R, combine the
datasets, format the data to facilitate ease of use, and export the
dataset as a csv file for later use. Because we are predicting sale
price on the basis of size, we deleted sale records with a
sale_price
less than $10,000 (we assumed these deals to
be between family members), and deleted
gross_square_feet
values of 0.
# Set `eval=FALSE` so that this code chunk is not run multiple times
# Load packages required for New York City property sales data linear modeling
library(readxl) # Load Excel files
library(magrittr) # Make all colnames lower case with no spaces
library(stringr) # String formatting and replacement
library(dplyr) # Data wrangling and manipulation
library(readr) # Load and write csv files
library(ggplot2) # Data visualization
library(tidyr) # Nesting and unnesting dataframes
# Data accessed November, 2019 from:
# https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page
# Data Used for this Project is from November 2018 to October 2019
# Set the function argument skip = 4 within the read_excel() function call to skip header rows that are not relevant.
brooklyn <- read_excel("rollingsales_brooklyn_Oct2019.xls", skip = 4)
bronx <- read_excel("rollingsales_bronx_Oct2019.xls", skip = 4)
manhattan <- read_excel("rollingsales_manhattan_Oct2019.xls", skip = 4)
staten_island <- read_excel("rollingsales_statenisland_Oct2019.xls", skip = 4)
queens <- read_excel("rollingsales_queens_Oct2019.xls", skip = 4)
# Numeric codes for each borough - to be replaced with names
# Manhattan = 1
# Bronx = 2
# Brooklyn = 3
# Queens = 4
# Staten Island = 5
# Bind all dataframes into one
NYC_property_sales <- bind_rows(manhattan, bronx, brooklyn, queens, staten_island)
# Remove individual dataframes for each neighborhood
rm(brooklyn, bronx, manhattan, staten_island, queens)
# Replace borough number with borough name, for clarity
NYC_property_sales <- NYC_property_sales %>%
mutate(BOROUGH =
case_when(BOROUGH == 1 ~ "Manhattan",
BOROUGH == 2 ~ "Bronx",
BOROUGH == 3 ~ "Brooklyn",
BOROUGH == 4 ~ "Queens",
BOROUGH == 5 ~ "Staten Island"))
# Convert all colnames to lower case with no spaces (use underscores instead of spaces)
colnames(NYC_property_sales) %<>% str_replace_all("\\s", "_") %>% tolower()
# Convert CAPITALIZED columns to Title Case
NYC_property_sales <- NYC_property_sales %>%
mutate(neighborhood = str_to_title(neighborhood)) %>%
mutate(building_class_category =
str_to_title(building_class_category)) %>%
mutate(address = str_to_title(address))
NYC_property_sales <- NYC_property_sales %>%
# Drop ease-ment column that contains no data
select(-`ease-ment`) %>%
# Select only distinct observations (drop duplicates)
distinct()
NYC_property_sales <- NYC_property_sales %>%
# Filter out property exchanges between family members
# We assume here that the threshold is $10,000 US DOllars
filter(sale_price > 10000) %>%
# Remove observations with gross square footage of zero
# NOTE: We are only doing this here because we are analyzing condominium sales
# If analyzing single family homes, we would also consider "land_square_feet"
filter(gross_square_feet > 0) %>%
# Drop na values in columns of interest
drop_na(c(gross_square_feet, sale_price))
# Arrange observations alphabetically by borough and neighborhood
NYC_property_sales <- NYC_property_sales %>%
arrange(borough, neighborhood)
# Save results to csv file for future use
# The code below is commented-out to avoid accidental overwriting of the file later on
# write_csv(NYC_property_sales, "NYC_property_sales.csv")
The readr
package is loaded so that the csv file can be
read into R.
library(readr)
# Read in the CSV file we generated above
NYC_property_sales <- read_csv('NYC_property_sales.csv')
A first glimpse of the data reveals that there are currently over 38,000 sale records in the dataset.
library(dplyr) # Data wrangling and manipulation
glimpse(NYC_property_sales)
## Rows: 38,177
## Columns: 20
## $ borough <chr> "Bronx", "Bronx", "Bronx", "Bronx", "Br…
## $ neighborhood <chr> "Bathgate", "Bathgate", "Bathgate", "Ba…
## $ building_class_category <chr> "01 One Family Dwellings", "01 One Fami…
## $ tax_class_at_present <chr> "1", "1", "1", "1", "1", "1", "1", "1",…
## $ block <dbl> 3030, 3030, 3039, 3043, 3046, 3046, 305…
## $ lot <dbl> 62, 70, 63, 55, 35, 39, 62, 152, 119, 1…
## $ building_class_at_present <chr> "A1", "A1", "A1", "A1", "A1", "A1", "S1…
## $ address <chr> "4463 Park Avenue", "4445 Park Avenue",…
## $ apartment_number <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ zip_code <dbl> 10457, 10457, 10458, 10457, 10457, 1045…
## $ residential_units <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, …
## $ commercial_units <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ total_units <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, …
## $ land_square_feet <dbl> 1578, 1694, 1650, 2356, 2050, 1986, 785…
## $ gross_square_feet <dbl> 1470, 1497, 1296, 2047, 1560, 1344, 154…
## $ year_built <dbl> 1899, 1899, 1910, 1901, 1899, 1899, 193…
## $ tax_class_at_time_of_sale <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ building_class_at_time_of_sale <chr> "A1", "A1", "A1", "A1", "A1", "A1", "S1…
## $ sale_price <dbl> 455000, 388500, 419000, 470000, 445000,…
## $ sale_date <dttm> 2018-11-28, 2019-07-23, 2018-12-20, 20…
For this project we will only work with a single type of building
class (“R4”), a condominium residential unit in a building with an
elevator. This building class is the most common building class in
this NYC_property_sales
dataframe.
NYC_condos <- NYC_property_sales %>%
# Filter to include only property type: CONDO; RESIDENTIAL UNIT IN ELEVATOR BLDG.
# https://www1.nyc.gov/assets/finance/jump/hlpbldgcode.html
filter(building_class_at_time_of_sale == "R4")
Now that the data is loaded, processed, and ready to analyze we will use scatterplots to visualize the relationships between condominium sale price and size. The scatterplot below depicts sale price versus size for all five New York City boroughs, combined. In general we see a trend that larger condominiums are associated with a higher sale price. The data follows a somewhat linear pattern. There is no obvious curvature with the shape of the data, but there is a fair amount of spread. The strength of the bivariate relationship is moderate.
library(ggplot2)
ggplot(data = NYC_condos,
aes(x = gross_square_feet, y = sale_price)) +
geom_point(aes(color = borough), alpha = 0.3) +
scale_y_continuous(labels = scales::comma, limits = c(0, 75000000)) +
xlim(0, 10000) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
x = "Size (Gross Square Feet)",
y = "Sale Price (USD)")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 5 rows containing missing values (geom_smooth).
Zooming in on a smaller subset of the data, we observe the same trend below that in general, as the size of a condominium increases, so does the sale price. The pattern is somewhat linear, but there is a fair amount of spread, or dispersion, that becomes more pronounced with an increase in condominium size.
library(ggplot2)
ggplot(data = NYC_condos,
aes(x = gross_square_feet, y = sale_price)) +
geom_point(aes(color = borough), alpha = 0.3) +
scale_y_continuous(labels = scales::comma, limits = c(0, 20000000)) +
xlim(0, 5000) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
x = "Size (Gross Square Feet)",
y = "Sale Price (USD)")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 153 rows containing non-finite values (stat_smooth).
## Warning: Removed 153 rows containing missing values (geom_point).
## Warning: Removed 8 rows containing missing values (geom_smooth).
To better visualize the spread of data for each borough, we use y-axis and x-axis scales that are specific to each borough. What neighborhoods have outliers that we should investigate?
ggplot(data = NYC_condos,
aes(x = gross_square_feet, y = sale_price)) +
geom_point(alpha = 0.3) +
facet_wrap(~ borough, scales = "free", ncol = 2) +
scale_y_continuous(labels = scales::comma) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
x = "Size (Gross Square Feet)",
y = "Sale Price (USD)")
## `geom_smooth()` using formula 'y ~ x'
Looking at the plot above, we see that, in general, larger condominiums are associated with a higher sale price in each borough. The data follows a somewhat linear pattern in each plot. But the spread is difficult to see with the Manhattan scatterplot, potentially because of the property sale of around $200 million visible to the far-right which may be impacting the visualization. There is no obvious curvature with the shape of the data, for any borough. The strength of the bivariate relationship is moderate for most boroughs, except for the Queens borough which looks to have a weaker relationship between sale price and size.
We begin our investigation of outliers by sorting all sale records by sale price, from high to low.
NYC_condos %>%
arrange(desc(sale_price)) %>%
head
## # A tibble: 6 × 20
## borough neighb…¹ build…² tax_c…³ block lot build…⁴ address apart…⁵ zip_c…⁶
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 Manhattan Midtown… 13 Con… 2 1030 1082 R4 220 Ce… 50 10019
## 2 Manhattan Upper E… 13 Con… 2 1401 1003 R4 165 Ea… RESI 10065
## 3 Manhattan Gramercy 13 Con… 2 901 1001 R4 200 Ea… 2A 10010
## 4 Manhattan Upper E… 13 Con… 2 1375 1441 R4 520 Pa… DPH60 10022
## 5 Manhattan Upper E… 13 Con… 2 1375 1438 R4 520 Pa… DPH56 10022
## 6 Manhattan Midtown… 13 Con… 2 1030 1117 R4 220 Ce… PH16 10019
## # … with 10 more variables: residential_units <dbl>, commercial_units <dbl>,
## # total_units <dbl>, land_square_feet <dbl>, gross_square_feet <dbl>,
## # year_built <dbl>, tax_class_at_time_of_sale <dbl>,
## # building_class_at_time_of_sale <chr>, sale_price <dbl>, sale_date <dttm>,
## # and abbreviated variable names ¹neighborhood, ²building_class_category,
## # ³tax_class_at_present, ⁴building_class_at_present, ⁵apartment_number,
## # ⁶zip_code
Research of the highest price listing in the dataset reveals that this property sale was actually the most expensive home ever sold in the United States at the time of the sale. The luxurious building that this particular unit is located in even has its own Wikipedia page.
The real estate transaction with the second-highest sale price in this dataset was also news worthy.
These two most expensive property sales also happen to be the two largest in terms of gross square footage. We will remove the second-highest listing at 165 East 66th Street because this transaction looks to be for an entire block of residences. We would like to limit this analysis to transactions of single units, if possible.
# Make copy of dataframe before removing any sale records
NYC_condos_original <- NYC_condos
# Remove 165 East 66th Street sale record
NYC_condos <- NYC_condos %>%
filter(address != "165 East 66th St, Resi")
We will leave the record-setting home sale observation in the dataset for now because we confirmed the sale price to be legitimate.
Next we’ll take a look at the highest sale price observations in Brooklyn. There are a number of sale records at a sale price of around $30 Million, but there is only a single observations in the range of $10 to $30 Million. Could this be correct?
NYC_condos %>%
filter(borough == "Brooklyn") %>%
arrange(desc(sale_price))
## # A tibble: 1,751 × 20
## borough neighb…¹ build…² tax_c…³ block lot build…⁴ address apart…⁵ zip_c…⁶
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 Brooklyn Gowanus 13 Con… 2 1046 1001 R4 554 4 … 3A 11215
## 2 Brooklyn Gowanus 13 Con… 2 1046 1002 R4 554 4 … 3B 11215
## 3 Brooklyn Gowanus 13 Con… 2 1046 1003 R4 554 4 … 3C 11215
## 4 Brooklyn Gowanus 13 Con… 2 1046 1004 R4 554 4 … 3D 11215
## 5 Brooklyn Gowanus 13 Con… 2 1046 1005 R4 554 4 … 3E 11215
## 6 Brooklyn Gowanus 13 Con… 2 1046 1006 R4 554 4 … 4A 11215
## 7 Brooklyn Gowanus 13 Con… 2 1046 1007 R4 554 4 … 4B 11215
## 8 Brooklyn Gowanus 13 Con… 2 1046 1008 R4 554 4 … 4C 11215
## 9 Brooklyn Gowanus 13 Con… 2 1046 1009 R4 554 4 … 4D 11215
## 10 Brooklyn Gowanus 13 Con… 2 1046 1010 R4 554 4 … 4E 11215
## # … with 1,741 more rows, 10 more variables: residential_units <dbl>,
## # commercial_units <dbl>, total_units <dbl>, land_square_feet <dbl>,
## # gross_square_feet <dbl>, year_built <dbl>, tax_class_at_time_of_sale <dbl>,
## # building_class_at_time_of_sale <chr>, sale_price <dbl>, sale_date <dttm>,
## # and abbreviated variable names ¹neighborhood, ²building_class_category,
## # ³tax_class_at_present, ⁴building_class_at_present, ⁵apartment_number,
## # ⁶zip_code
Looking through the results we see that there are approximately 40 sales records with a price of $29,620,207. This price point appears to be unusual for Brooklyn. Scrolling through the results using the viewer in R Studio we also see that all 40 property sales took place on the same day, 2019-04-08. This indicates that a transaction took place on this date where all 40 units were purchased for a TOTAL price of $29,620,207, not $29,620,207 per unit.
Thanks to the internet it does not take long for us to find information about this new building. Sure enough, this building contains 40 total units. But according to the website, the average price per unit for the 26 “active sales” is around $990,000 and the average price for the 14 previous sales is around $816,000, per unit.
For our purposes we will remove all 40 observations from the dataset because sale prices for each unit are erroneous. We could consider other ways of correcting the data. One option is to determine the price-per-square-foot by dividing the $29M sale price by the total number of square feet sold across all 40 units, and then using this number to assign a price to each unit based on its size. But that is not worth our time and we can’t be certain that method would yield valid results.
Fortunately, we have a programmatic option for surfacing potential
multi-unit sales where each sale record contains the sale price for
the entire real estate deal, not the price for the individual unit.
Below we build a grouped filter that returns all sale records with
three or more observations that have the same sale price and sale
date. In general, multi-unit sales contain the same price and sale
date across many sale records. When building a grouped filter we want
to be careful not to “over-filter” by making the criteria too
specific. In our case it looks like the filter effectively surfaces
multi-sale transactions using only two grouping parameters:
sale_price
and sale_date
.
multi_unit_sales <- NYC_condos %>%
group_by(sale_price, sale_date) %>%
filter(n() >= 3) %>%
arrange(desc(sale_price))
We researched many of the addresses listed in the
multi-unit-sales
dataframe and confirmed that most of the
sale records included here are part of a multi-unit transaction. We do
not expect this filter to be 100 percent accurate, for example there
may be a few property sales included here that are not part of a
multi-unit sale. But overall, this grouped filter appears to be
effective.
There are many ways to remove the multi-unit sales from the
NYC_condos
dataframe. Below are two identical methods:
(1) filter for only the sale records we wish to retain that
have two or less instances of sale_price
and
sale_date
, or (2) use an anti-join to drop all records
from NYC_condos
found in multi_unit_sales
.
NYC_condos <- NYC_condos %>%
group_by(sale_price, sale_date) %>%
filter(n() <= 2) %>%
ungroup()
# Alternative method
# NYC_condos <- NYC_condos %>%
# anti_join(multi_unit_sales)
Now that we’ve removed many multi-unit sales from the dataset, let’s
generate a linear regression model for all New York City neighborhoods
combined. As a reminder, we are predicting sale_price
on
the basis of gross_square_feet
.
NYC_condos_lm <- lm(sale_price ~ gross_square_feet, data = NYC_condos)
summary(NYC_condos_lm)
##
## Call:
## lm(formula = sale_price ~ gross_square_feet, data = NYC_condos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19865368 -840026 156635 938585 140321290
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.110e+06 5.709e+04 -54.47 <2e-16 ***
## gross_square_feet 4.462e+03 3.947e+01 113.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2945000 on 7944 degrees of freedom
## Multiple R-squared: 0.6166, Adjusted R-squared: 0.6166
## F-statistic: 1.278e+04 on 1 and 7944 DF, p-value: < 2.2e-16
How does this compare to the
NYC_condos_original
dataframe that includes multi-unit
sales?
NYC_condos_original_lm <- lm(sale_price ~ gross_square_feet, data = NYC_condos_original)
summary(NYC_condos_original_lm)
##
## Call:
## lm(formula = sale_price ~ gross_square_feet, data = NYC_condos_original)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79533546 -1211195 -860173 -251714 211550415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 940683.39 57703.86 16.30 <2e-16 ***
## gross_square_feet 1192.72 19.43 61.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4745000 on 8094 degrees of freedom
## Multiple R-squared: 0.3177, Adjusted R-squared: 0.3176
## F-statistic: 3769 on 1 and 8094 DF, p-value: < 2.2e-16
A bivariate linear regression of sale_price
(price)
explained by gross_square_feet
(size) was performed on
two different datasets containing condominium sale records for New
York City. One dataset, NYC_condos
, was cleaned to
remove multi-unit sale records (where the same sale price is
recorded for many units). The other dataset,
NYC_condos_original
, remained unaltered and contained
all original sale records. In each case, the hypothesis is that
there is a relationship between the size of a condominium
(gross_square_feet
) and the price
(sale_price
). We can declare there is a relationship
between condominium size and price when the slope is sufficiently
far from zero.
For each model, the t-statistic was high enough, and the p-value was
low enough, to declare that there is, in fact, a relationship
between gross_square_feet
and sale_price
.
The t-statistic for the cleaned dataset (NYC_condos
)
was nearly double that of the original dataset
(NYC_condos_original
) at 113.04 versus 61.39. In each
case the p-value was well below the 0.05 cutoff for significance
meaning that it is extremely unlikely that the relationship between
condominium size and sale price is due to random chance.
The confidence interval for the slope is [4384.254, 4538.999] for
the NYC_condos
dataset compared to only [1154.636,
1230.802] for the NYC_condos_original
dataset. This
difference can likely be attributed to the removal of many
multi-million dollar sale records for smaller units which impacted
price predictions in the original dataset. The measure for
lack of fit, or residual standard error (RSE) was lower for
the cleaned dataset at 2,945,000 compared to 4,745,000 for the
original dataset. However, it must be noted that the
NYC_condos
is smaller than the
NYC_condos_original
by 150 observations. Finally, the
R-squared, or the proportion of the variability in
sale_price
that can be explained by
gross_square_feet
is 0.6166 for the cleaned
NYC_condos
. This is nearly double the R-squared value
estimated for the NYC_condos_original
dataset at
0.3177.
Below is the updated scatterplot that uses the cleaned
NYC_condos
data. For the Brooklyn borough we are better
able to see the spread of the data and how the trend line fits the
data because we removed the $30 million outliers. The same is true
for the Manhattan borough because the $200 million multi-unit sale
was removed.
ggplot(data = NYC_condos,
aes(x = gross_square_feet, y = sale_price)) +
geom_point(alpha = 0.3) +
facet_wrap(~ borough, scales = "free", ncol = 2) +
scale_y_continuous(labels = scales::comma) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
x = "Size (Gross Square Feet)",
y = "Sale Price (USD)")
## `geom_smooth()` using formula 'y ~ x'
Now let’s apply the broom
workflow to compare coefficient
estimates across the five boroughs. The general workflow using broom
and tidyverse tools to generate many models involves 4 steps:
nest()
function from the tidyr
package -
we will nest by borough
.
map()
function
from the purrr
package.
broom
functions tidy()
,
augment()
, and/or glance()
using each
nested model - we’ll work with tidy()
first.
unnest()
function -
this allows us to see the results.
# Step 1: nest by the borough categorical variable
library(broom)
library(tidyr)
library(purrr)
NYC_nested <- NYC_condos %>%
group_by(borough) %>%
nest()
n the previous step, the NYC_condos
dataframe was
collapsed from 7,946 observations to only 5. The nesting process
isolated the sale records for each borough into separate dataframes.
# Inspect the format
print(NYC_nested)
## # A tibble: 5 × 2
## # Groups: borough [5]
## borough data
## <chr> <list>
## 1 Bronx <tibble [330 × 19]>
## 2 Brooklyn <tibble [1,697 × 19]>
## 3 Manhattan <tibble [4,819 × 19]>
## 4 Queens <tibble [1,004 × 19]>
## 5 Staten Island <tibble [96 × 19]>
We can extract and inspect the values of any nested dataframe. Below is a look at the first six rows for Manhattan.
# View first few rows for Manhattan
head(NYC_nested$data[[3]])
## # A tibble: 6 × 19
## neighbor…¹ build…² tax_c…³ block lot build…⁴ address apart…⁵ zip_c…⁶ resid…⁷
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 Alphabet … 13 Con… 2 373 1008 R4 324 Ea… 5B 10009 0
## 2 Alphabet … 13 Con… 2 373 1009 R4 324 Ea… 6A 10009 0
## 3 Alphabet … 13 Con… 2 375 1017 R4 754 Ea… 5A 10009 0
## 4 Alphabet … 13 Con… 2 378 1005 R4 399 Ea… 2C 10009 0
## 5 Alphabet … 13 Con… 2 384 1205 R4 310 Ea… 3C 10002 0
## 6 Alphabet … 13 Con… 2 384 1210 R4 310 Ea… 4B 10002 0
## # … with 9 more variables: commercial_units <dbl>, total_units <dbl>,
## # land_square_feet <dbl>, gross_square_feet <dbl>, year_built <dbl>,
## # tax_class_at_time_of_sale <dbl>, building_class_at_time_of_sale <chr>,
## # sale_price <dbl>, sale_date <dttm>, and abbreviated variable names
## # ¹neighborhood, ²building_class_category, ³tax_class_at_present,
## # ⁴building_class_at_present, ⁵apartment_number, ⁶zip_code,
## # ⁷residential_units
he next step in the process is to fit a linear model to each individual dataframe. What this means is that we are generating separate linear models for each borough individually.
# Step 2: fit linear models to each borough, individually
NYC_coefficients <- NYC_condos %>%
group_by(borough) %>%
nest() %>%
mutate(linear_model = map(.x = data,
.f = ~lm(sale_price ~ gross_square_feet,
data = .)))
Taking a look at the data structure we see that we have a new
list-column called linear_model
that contains a linear
model object for each borough.
# Inspect the data structure
print(NYC_coefficients)
## # A tibble: 5 × 3
## # Groups: borough [5]
## borough data linear_model
## <chr> <list> <list>
## 1 Bronx <tibble [330 × 19]> <lm>
## 2 Brooklyn <tibble [1,697 × 19]> <lm>
## 3 Manhattan <tibble [4,819 × 19]> <lm>
## 4 Queens <tibble [1,004 × 19]> <lm>
## 5 Staten Island <tibble [96 × 19]> <lm>
We can view the linear modeling results for any one of the nested
objects using the summary()
function. Below are the
linear regression statistics for Manhattan.
# Verify model results for Manhattan
summary(NYC_coefficients$linear_model[[3]])
##
## Call:
## lm(formula = sale_price ~ gross_square_feet, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21128191 -1001860 224235 1103543 134333578
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.254e+06 8.584e+04 -37.91 <2e-16 ***
## gross_square_feet 4.728e+03 5.178e+01 91.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3588000 on 4817 degrees of freedom
## Multiple R-squared: 0.6338, Adjusted R-squared: 0.6337
## F-statistic: 8337 on 1 and 4817 DF, p-value: < 2.2e-16
A quick look at the R-squared value for the Manhattan linear model
indicates that gross_square_feet
looks to be a fairly
good single predictor of sale_price
. Almost two-thirds of
the variability with sale_price
is explained by
gross_square_feet
.
The next step is to transform these linear model summary statistics into a tidy format.
# Step 3: generate a tidy dataframe of coefficient estimates that includes confidence intervals
NYC_coefficients <- NYC_condos %>%
group_by(borough) %>%
nest() %>%
mutate(linear_model = map(.x = data,
.f = ~lm(sale_price ~ gross_square_feet,
data = .))) %>%
mutate(tidy_coefficients = map(.x = linear_model,
.f = tidy,
conf.int = TRUE))
NYC_coefficients
## # A tibble: 5 × 4
## # Groups: borough [5]
## borough data linear_model tidy_coefficients
## <chr> <list> <list> <list>
## 1 Bronx <tibble [330 × 19]> <lm> <tibble [2 × 7]>
## 2 Brooklyn <tibble [1,697 × 19]> <lm> <tibble [2 × 7]>
## 3 Manhattan <tibble [4,819 × 19]> <lm> <tibble [2 × 7]>
## 4 Queens <tibble [1,004 × 19]> <lm> <tibble [2 × 7]>
## 5 Staten Island <tibble [96 × 19]> <lm> <tibble [2 × 7]>
Now we have a new variable called tidy_coefficients
that
contains tidy coefficient estimates for each of the five boroughs.
These tidy statistics are currently stored in five separate
dataframes. Below are the coefficient estimates for Manhattan.
# Inspect the results for Manhattan
print(NYC_coefficients$tidy_coefficients[[3]])
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3253992. 85837. -37.9 1.94e-275 -3422273. -3085712.
## 2 gross_square_feet 4728. 51.8 91.3 0 4626. 4829.
# Step 4: Unnest to a tidy dataframe of coefficient estimates
NYC_coefficients_tidy <- NYC_coefficients %>%
select(borough, tidy_coefficients) %>%
unnest(cols = tidy_coefficients)
print(NYC_coefficients_tidy)
## # A tibble: 10 × 8
## # Groups: borough [5]
## borough term estim…¹ std.e…² stati…³ p.value conf.…⁴ conf.…⁵
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Bronx (Intercept) -2.54e5 24668. -10.3 9.89e- 22 -3.03e5 -2.05e5
## 2 Bronx gross_square… 6.49e2 29.6 21.9 2.58e- 66 5.91e2 7.07e2
## 3 Brooklyn (Intercept) -2.72e5 34487. -7.88 5.64e- 15 -3.40e5 -2.04e5
## 4 Brooklyn gross_square… 1.29e3 29.7 43.2 1.06e-275 1.23e3 1.34e3
## 5 Manhattan (Intercept) -3.25e6 85837. -37.9 1.94e-275 -3.42e6 -3.09e6
## 6 Manhattan gross_square… 4.73e3 51.8 91.3 0 4.63e3 4.83e3
## 7 Queens (Intercept) 3.99e4 28311. 1.41 1.59e- 1 -1.56e4 9.55e4
## 8 Queens gross_square… 7.33e2 32.0 22.9 8.14e- 94 6.70e2 7.95e2
## 9 Staten Island (Intercept) 4.64e4 28256. 1.64 1.04e- 1 -9.67e3 1.03e5
## 10 Staten Island gross_square… 2.89e2 30.7 9.41 3.30e- 15 2.28e2 3.49e2
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
## # ⁴conf.low, ⁵conf.high
We’re mainly interested in the slope which explains the change in y (sale price) for each unit change in x (square footage). We can filter for the slope estimate only as follows.
# Filter to return the slope estimate only
NYC_slope <- NYC_coefficients_tidy %>%
filter(term == "gross_square_feet") %>%
arrange(estimate)
print(NYC_slope)
## # A tibble: 5 × 8
## # Groups: borough [5]
## borough term estim…¹ std.e…² stati…³ p.value conf.…⁴ conf.…⁵
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Staten Island gross_square_… 289. 30.7 9.41 3.30e- 15 228. 349.
## 2 Bronx gross_square_… 649. 29.6 21.9 2.58e- 66 591. 707.
## 3 Queens gross_square_… 733. 32.0 22.9 8.14e- 94 670. 795.
## 4 Brooklyn gross_square_… 1285. 29.7 43.2 1.06e-275 1227. 1344.
## 5 Manhattan gross_square_… 4728. 51.8 91.3 0 4626. 4829.
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
## # ⁴conf.low, ⁵conf.high
We’ve arranged the results in ascending order by the slope estimate.
For each of the five boroughs, the t-statistic and p-value indicate
that there is a relationship between sale_price
and
gross_square_feet
. In Staten Island, an increase in
square footage by one unit is estimated to increase the sale price by
about $288, on average. In contrast, an increase in total square
footage by one unit is estimated to result in an increase in sale
price of about $4,728, on average.
Now we will apply the same workflow using broom
tools to
generate tidy regression summary statistics for each of the five
boroughs. Below we follow the same process as we saw previously with
the tidy()
function, but instead we use the
glance()
function.
# Generate a tidy dataframe of regression summary statistics
NYC_summary_stats <- NYC_condos %>%
group_by(borough) %>%
nest() %>%
mutate(linear_model = map(.x = data,
.f = ~lm(sale_price ~ gross_square_feet,
data = .))) %>%
mutate(tidy_summary_stats = map(.x = linear_model,
.f = glance))
print(NYC_summary_stats)
## # A tibble: 5 × 4
## # Groups: borough [5]
## borough data linear_model tidy_summary_stats
## <chr> <list> <list> <list>
## 1 Bronx <tibble [330 × 19]> <lm> <tibble [1 × 12]>
## 2 Brooklyn <tibble [1,697 × 19]> <lm> <tibble [1 × 12]>
## 3 Manhattan <tibble [4,819 × 19]> <lm> <tibble [1 × 12]>
## 4 Queens <tibble [1,004 × 19]> <lm> <tibble [1 × 12]>
## 5 Staten Island <tibble [96 × 19]> <lm> <tibble [1 × 12]>
# Unnest to a tidy dataframe of
NYC_summary_stats_tidy <- NYC_summary_stats %>%
select(borough, tidy_summary_stats) %>%
unnest(cols = tidy_summary_stats) %>%
arrange(r.squared)
print(NYC_summary_stats_tidy)
## # A tibble: 5 × 13
## # Groups: borough [5]
## borough r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Queens 0.344 0.343 2.80e5 525. 8.14e- 94 1 -14018. 2.80e4 2.81e4
## 2 Staten I… 0.485 0.480 7.35e4 88.6 3.30e- 15 1 -1211. 2.43e3 2.44e3
## 3 Brooklyn 0.524 0.524 5.65e5 1868. 1.06e-275 1 -24883. 4.98e4 4.98e4
## 4 Bronx 0.595 0.594 1.51e5 481. 2.58e- 66 1 -4404. 8.81e3 8.82e3
## 5 Manhattan 0.634 0.634 3.59e6 8337. 0 1 -79570. 1.59e5 1.59e5
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>, and
## # abbreviated variable names ¹r.squared, ²adj.r.squared, ³statistic
These results will be summarized in our conclusion paragraph below.
Our analysis showed that, in general, the
gross_square_feet
variable is useful for explaining, or
estimating, sale_price
for condominiums in New York City.
We observed that removing multi-unit sales from the dataset increased
model accuracy. With linear models generated for New York City as a
whole, and with linear models generated for each borough individually,
we observed in all cases that the t-statistic was high enough, and the
p-value was low enough, to declare that there is a relationship
between gross_square_feet
and sale_price
.
For the linear models that we generated for each individual borough, we observed a wide range in slope estimates. The slope estimate for Manhattan was much higher than the estimate for any of the other boroughs. We did not remove the record-setting $240 million property sale from the dataset, but future analysis should investigate the impacts that this single listing has on modeling results.
Finally, regression summary statistics indicate that
gross_square_feet
is a better single predictor of
sale_price
in some boroughs versus others. For example,
the R-squared value was estimated at approximately 0.63 in Manhattan,
and 0.59 in Brooklyn, compared to an estimate of only 0.35 in Queens.
These differences in R-squared correspond with the scatterplots
generated for each borough; the strength of sale prices versus gross
square feet was higher, and the dispersion (spread), was lower for
Manhattan and Brooklyn as compared to Queens where the relationship
was noticeably weaker because the data was more spread out.