How well does the size of a condominium in New York City explain sale price?

In this project we’ll explore how well the size of a condominium (measured in gross square feet) explains, or predicts, sale price in New York City. We will also explore how well the size of a condominium predicts sale price in each of the five boroughs of New York City: the Bronx, Brooklyn, Manhattan, Staten Island, and Queens.

Before we build linear regression models we will plot sale price versus gross square feet to see if the data exhibits any obvious visual patterns. Plotting the data will also allow us to visualize outliers, and we will investigate some of the outliers to determine if the data was recorded correctly. This property sales data is publicly available and contains sales records from a twelve-month period (November, 2018 through October, 2019).

Understanding the Data

The data used for this project originates from five separate Microsoft Excel files, one for each borough in New York City. The data structure is identical for all five files, which makes it possible to combine all of the data into a single file. The programming steps below outline the steps taken to load each dataset into R, combine the datasets, format the data to facilitate ease of use, and export the dataset as a csv file for later use. Because we are predicting sale price on the basis of size, we deleted sale records with a sale_price less than $10,000 (we assumed these deals to be between family members), and deleted gross_square_feet values of 0.

# Set `eval=FALSE` so that this code chunk is not run multiple times
# Load packages required for New York City property sales data linear modeling
library(readxl) # Load Excel files
library(magrittr) # Make all colnames lower case with no spaces
library(stringr) # String formatting and replacement
library(dplyr) # Data wrangling and manipulation
library(readr) # Load and write csv files
library(ggplot2) # Data visualization
library(tidyr) # Nesting and unnesting dataframes
# Data accessed November, 2019 from: 
# https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page
# Data Used for this Project is from November 2018 to October 2019
# Set the function argument skip = 4 within the read_excel() function call to skip header rows that are not relevant.
brooklyn <- read_excel("rollingsales_brooklyn_Oct2019.xls", skip = 4)
bronx <- read_excel("rollingsales_bronx_Oct2019.xls", skip = 4)
manhattan <- read_excel("rollingsales_manhattan_Oct2019.xls", skip = 4)
staten_island <- read_excel("rollingsales_statenisland_Oct2019.xls", skip = 4)
queens <- read_excel("rollingsales_queens_Oct2019.xls", skip = 4)

# Numeric codes for each borough - to be replaced with names
# Manhattan = 1
# Bronx = 2
# Brooklyn = 3
# Queens = 4 
# Staten Island = 5
# Bind all dataframes into one
NYC_property_sales <- bind_rows(manhattan, bronx, brooklyn, queens, staten_island)

# Remove individual dataframes for each neighborhood
rm(brooklyn, bronx, manhattan, staten_island, queens)

# Replace borough number with borough name, for clarity
NYC_property_sales <- NYC_property_sales %>% 
  mutate(BOROUGH = 
  case_when(BOROUGH == 1 ~ "Manhattan",
            BOROUGH == 2 ~ "Bronx",
            BOROUGH == 3 ~ "Brooklyn",
            BOROUGH == 4 ~ "Queens",
            BOROUGH == 5 ~ "Staten Island"))

# Convert all colnames to lower case with no spaces (use underscores instead of spaces)
colnames(NYC_property_sales) %<>% str_replace_all("\\s", "_") %>% tolower()

# Convert CAPITALIZED columns to Title Case
NYC_property_sales <- NYC_property_sales %>% 
  mutate(neighborhood = str_to_title(neighborhood)) %>% 
  mutate(building_class_category = 
           str_to_title(building_class_category)) %>% 
  mutate(address = str_to_title(address)) 

NYC_property_sales <- NYC_property_sales %>% 
  # Drop ease-ment column that contains no data
  select(-`ease-ment`) %>%
  # Select only distinct observations (drop duplicates)
  distinct()

NYC_property_sales <- NYC_property_sales %>% 
  # Filter out property exchanges between family members
  # We assume here that the threshold is $10,000 US DOllars
  filter(sale_price > 10000) %>% 
  # Remove observations with gross square footage of zero
  # NOTE: We are only doing this here because we are analyzing condominium sales
  # If analyzing single family homes, we would also consider "land_square_feet"
  filter(gross_square_feet > 0) %>% 
  # Drop na values in columns of interest
  drop_na(c(gross_square_feet, sale_price))
# Arrange observations alphabetically by borough and neighborhood

NYC_property_sales <- NYC_property_sales %>% 
  arrange(borough, neighborhood)
# Save results to csv file for future use
# The code below is commented-out to avoid accidental overwriting of the file later on
# write_csv(NYC_property_sales, "NYC_property_sales.csv")

The readr package is loaded so that the csv file can be read into R.

library(readr)
# Read in the CSV file we generated above
NYC_property_sales <- read_csv('NYC_property_sales.csv')

A first glimpse of the data reveals that there are currently over 38,000 sale records in the dataset.

library(dplyr) # Data wrangling and manipulation
glimpse(NYC_property_sales)
## Rows: 38,177
## Columns: 20
## $ borough                        <chr> "Bronx", "Bronx", "Bronx", "Bronx", "Br…
## $ neighborhood                   <chr> "Bathgate", "Bathgate", "Bathgate", "Ba…
## $ building_class_category        <chr> "01 One Family Dwellings", "01 One Fami…
## $ tax_class_at_present           <chr> "1", "1", "1", "1", "1", "1", "1", "1",…
## $ block                          <dbl> 3030, 3030, 3039, 3043, 3046, 3046, 305…
## $ lot                            <dbl> 62, 70, 63, 55, 35, 39, 62, 152, 119, 1…
## $ building_class_at_present      <chr> "A1", "A1", "A1", "A1", "A1", "A1", "S1…
## $ address                        <chr> "4463 Park Avenue", "4445 Park Avenue",…
## $ apartment_number               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ zip_code                       <dbl> 10457, 10457, 10458, 10457, 10457, 1045…
## $ residential_units              <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, …
## $ commercial_units               <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ total_units                    <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, …
## $ land_square_feet               <dbl> 1578, 1694, 1650, 2356, 2050, 1986, 785…
## $ gross_square_feet              <dbl> 1470, 1497, 1296, 2047, 1560, 1344, 154…
## $ year_built                     <dbl> 1899, 1899, 1910, 1901, 1899, 1899, 193…
## $ tax_class_at_time_of_sale      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ building_class_at_time_of_sale <chr> "A1", "A1", "A1", "A1", "A1", "A1", "S1…
## $ sale_price                     <dbl> 455000, 388500, 419000, 470000, 445000,…
## $ sale_date                      <dttm> 2018-11-28, 2019-07-23, 2018-12-20, 20…

For this project we will only work with a single type of building class (“R4”), a condominium residential unit in a building with an elevator. This building class is the most common building class in this NYC_property_sales dataframe.

NYC_condos <- NYC_property_sales %>% 
  # Filter to include only property type: CONDO; RESIDENTIAL UNIT IN ELEVATOR BLDG.
  # https://www1.nyc.gov/assets/finance/jump/hlpbldgcode.html
  filter(building_class_at_time_of_sale == "R4")

Explore Bivariate Relationships with Scatterplots

Now that the data is loaded, processed, and ready to analyze we will use scatterplots to visualize the relationships between condominium sale price and size. The scatterplot below depicts sale price versus size for all five New York City boroughs, combined. In general we see a trend that larger condominiums are associated with a higher sale price. The data follows a somewhat linear pattern. There is no obvious curvature with the shape of the data, but there is a fair amount of spread. The strength of the bivariate relationship is moderate.

library(ggplot2)
ggplot(data = NYC_condos, 
       aes(x = gross_square_feet, y = sale_price)) +
  geom_point(aes(color = borough), alpha = 0.3) +
  scale_y_continuous(labels = scales::comma, limits = c(0, 75000000)) +
  xlim(0, 10000) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
       x = "Size (Gross Square Feet)",
       y = "Sale Price (USD)")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 5 rows containing missing values (geom_smooth).

Zooming in on a smaller subset of the data, we observe the same trend below that in general, as the size of a condominium increases, so does the sale price. The pattern is somewhat linear, but there is a fair amount of spread, or dispersion, that becomes more pronounced with an increase in condominium size.

library(ggplot2)
ggplot(data = NYC_condos, 
       aes(x = gross_square_feet, y = sale_price)) +
  geom_point(aes(color = borough), alpha = 0.3) +
  scale_y_continuous(labels = scales::comma, limits = c(0, 20000000)) +
  xlim(0, 5000) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
       x = "Size (Gross Square Feet)",
       y = "Sale Price (USD)")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 153 rows containing non-finite values (stat_smooth).
## Warning: Removed 153 rows containing missing values (geom_point).
## Warning: Removed 8 rows containing missing values (geom_smooth).

To better visualize the spread of data for each borough, we use y-axis and x-axis scales that are specific to each borough. What neighborhoods have outliers that we should investigate?

ggplot(data = NYC_condos, 
       aes(x = gross_square_feet, y = sale_price)) +
  geom_point(alpha = 0.3) +
  facet_wrap(~ borough, scales = "free", ncol = 2) +
  scale_y_continuous(labels = scales::comma) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
       x = "Size (Gross Square Feet)",
       y = "Sale Price (USD)")
## `geom_smooth()` using formula 'y ~ x'

Looking at the plot above, we see that, in general, larger condominiums are associated with a higher sale price in each borough. The data follows a somewhat linear pattern in each plot. But the spread is difficult to see with the Manhattan scatterplot, potentially because of the property sale of around $200 million visible to the far-right which may be impacting the visualization. There is no obvious curvature with the shape of the data, for any borough. The strength of the bivariate relationship is moderate for most boroughs, except for the Queens borough which looks to have a weaker relationship between sale price and size.

Outliers and Data Integrity Issues

We begin our investigation of outliers by sorting all sale records by sale price, from high to low.

NYC_condos %>% 
  arrange(desc(sale_price)) %>% 
  head
## # A tibble: 6 × 20
##   borough   neighb…¹ build…² tax_c…³ block   lot build…⁴ address apart…⁵ zip_c…⁶
##   <chr>     <chr>    <chr>   <chr>   <dbl> <dbl> <chr>   <chr>   <chr>     <dbl>
## 1 Manhattan Midtown… 13 Con… 2        1030  1082 R4      220 Ce… 50        10019
## 2 Manhattan Upper E… 13 Con… 2        1401  1003 R4      165 Ea… RESI      10065
## 3 Manhattan Gramercy 13 Con… 2         901  1001 R4      200 Ea… 2A        10010
## 4 Manhattan Upper E… 13 Con… 2        1375  1441 R4      520 Pa… DPH60     10022
## 5 Manhattan Upper E… 13 Con… 2        1375  1438 R4      520 Pa… DPH56     10022
## 6 Manhattan Midtown… 13 Con… 2        1030  1117 R4      220 Ce… PH16      10019
## # … with 10 more variables: residential_units <dbl>, commercial_units <dbl>,
## #   total_units <dbl>, land_square_feet <dbl>, gross_square_feet <dbl>,
## #   year_built <dbl>, tax_class_at_time_of_sale <dbl>,
## #   building_class_at_time_of_sale <chr>, sale_price <dbl>, sale_date <dttm>,
## #   and abbreviated variable names ¹​neighborhood, ²​building_class_category,
## #   ³​tax_class_at_present, ⁴​building_class_at_present, ⁵​apartment_number,
## #   ⁶​zip_code

Research of the highest price listing in the dataset reveals that this property sale was actually the most expensive home ever sold in the United States at the time of the sale. The luxurious building that this particular unit is located in even has its own Wikipedia page.

The real estate transaction with the second-highest sale price in this dataset was also news worthy.

These two most expensive property sales also happen to be the two largest in terms of gross square footage. We will remove the second-highest listing at 165 East 66th Street because this transaction looks to be for an entire block of residences. We would like to limit this analysis to transactions of single units, if possible.

# Make copy of dataframe before removing any sale records
NYC_condos_original <- NYC_condos
# Remove 165 East 66th Street sale record
NYC_condos <- NYC_condos %>% 
  filter(address != "165 East 66th St, Resi")

We will leave the record-setting home sale observation in the dataset for now because we confirmed the sale price to be legitimate.

How well does gross square feet explain sale price for all records combined?

Next we’ll take a look at the highest sale price observations in Brooklyn. There are a number of sale records at a sale price of around $30 Million, but there is only a single observations in the range of $10 to $30 Million. Could this be correct?

NYC_condos %>% 
  filter(borough == "Brooklyn") %>% 
  arrange(desc(sale_price))
## # A tibble: 1,751 × 20
##    borough  neighb…¹ build…² tax_c…³ block   lot build…⁴ address apart…⁵ zip_c…⁶
##    <chr>    <chr>    <chr>   <chr>   <dbl> <dbl> <chr>   <chr>   <chr>     <dbl>
##  1 Brooklyn Gowanus  13 Con… 2        1046  1001 R4      554 4 … 3A        11215
##  2 Brooklyn Gowanus  13 Con… 2        1046  1002 R4      554 4 … 3B        11215
##  3 Brooklyn Gowanus  13 Con… 2        1046  1003 R4      554 4 … 3C        11215
##  4 Brooklyn Gowanus  13 Con… 2        1046  1004 R4      554 4 … 3D        11215
##  5 Brooklyn Gowanus  13 Con… 2        1046  1005 R4      554 4 … 3E        11215
##  6 Brooklyn Gowanus  13 Con… 2        1046  1006 R4      554 4 … 4A        11215
##  7 Brooklyn Gowanus  13 Con… 2        1046  1007 R4      554 4 … 4B        11215
##  8 Brooklyn Gowanus  13 Con… 2        1046  1008 R4      554 4 … 4C        11215
##  9 Brooklyn Gowanus  13 Con… 2        1046  1009 R4      554 4 … 4D        11215
## 10 Brooklyn Gowanus  13 Con… 2        1046  1010 R4      554 4 … 4E        11215
## # … with 1,741 more rows, 10 more variables: residential_units <dbl>,
## #   commercial_units <dbl>, total_units <dbl>, land_square_feet <dbl>,
## #   gross_square_feet <dbl>, year_built <dbl>, tax_class_at_time_of_sale <dbl>,
## #   building_class_at_time_of_sale <chr>, sale_price <dbl>, sale_date <dttm>,
## #   and abbreviated variable names ¹​neighborhood, ²​building_class_category,
## #   ³​tax_class_at_present, ⁴​building_class_at_present, ⁵​apartment_number,
## #   ⁶​zip_code

Looking through the results we see that there are approximately 40 sales records with a price of $29,620,207. This price point appears to be unusual for Brooklyn. Scrolling through the results using the viewer in R Studio we also see that all 40 property sales took place on the same day, 2019-04-08. This indicates that a transaction took place on this date where all 40 units were purchased for a TOTAL price of $29,620,207, not $29,620,207 per unit.

Thanks to the internet it does not take long for us to find information about this new building. Sure enough, this building contains 40 total units. But according to the website, the average price per unit for the 26 “active sales” is around $990,000 and the average price for the 14 previous sales is around $816,000, per unit.

For our purposes we will remove all 40 observations from the dataset because sale prices for each unit are erroneous. We could consider other ways of correcting the data. One option is to determine the price-per-square-foot by dividing the $29M sale price by the total number of square feet sold across all 40 units, and then using this number to assign a price to each unit based on its size. But that is not worth our time and we can’t be certain that method would yield valid results.

Fortunately, we have a programmatic option for surfacing potential multi-unit sales where each sale record contains the sale price for the entire real estate deal, not the price for the individual unit. Below we build a grouped filter that returns all sale records with three or more observations that have the same sale price and sale date. In general, multi-unit sales contain the same price and sale date across many sale records. When building a grouped filter we want to be careful not to “over-filter” by making the criteria too specific. In our case it looks like the filter effectively surfaces multi-sale transactions using only two grouping parameters: sale_price and sale_date.

multi_unit_sales <- NYC_condos %>% 
  group_by(sale_price, sale_date) %>% 
  filter(n() >= 3) %>% 
  arrange(desc(sale_price))

We researched many of the addresses listed in the multi-unit-sales dataframe and confirmed that most of the sale records included here are part of a multi-unit transaction. We do not expect this filter to be 100 percent accurate, for example there may be a few property sales included here that are not part of a multi-unit sale. But overall, this grouped filter appears to be effective.

There are many ways to remove the multi-unit sales from the NYC_condos dataframe. Below are two identical methods: (1) filter for only the sale records we wish to retain that have two or less instances of sale_price and sale_date, or (2) use an anti-join to drop all records from NYC_condos found in multi_unit_sales.

NYC_condos <- NYC_condos %>%
  group_by(sale_price, sale_date) %>%
  filter(n() <= 2) %>%
  ungroup()
# Alternative method
# NYC_condos <- NYC_condos %>% 
#  anti_join(multi_unit_sales)

Linear Regression Model for Boroughs in New York City Combined

Now that we’ve removed many multi-unit sales from the dataset, let’s generate a linear regression model for all New York City neighborhoods combined. As a reminder, we are predicting sale_price on the basis of gross_square_feet.

NYC_condos_lm <- lm(sale_price ~ gross_square_feet, data = NYC_condos)  
summary(NYC_condos_lm)
## 
## Call:
## lm(formula = sale_price ~ gross_square_feet, data = NYC_condos)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -19865368   -840026    156635    938585 140321290 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -3.110e+06  5.709e+04  -54.47   <2e-16 ***
## gross_square_feet  4.462e+03  3.947e+01  113.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2945000 on 7944 degrees of freedom
## Multiple R-squared:  0.6166, Adjusted R-squared:  0.6166 
## F-statistic: 1.278e+04 on 1 and 7944 DF,  p-value: < 2.2e-16

How does this compare to the NYC_condos_original dataframe that includes multi-unit sales?

NYC_condos_original_lm <- lm(sale_price ~ gross_square_feet, data = NYC_condos_original)  
summary(NYC_condos_original_lm)
## 
## Call:
## lm(formula = sale_price ~ gross_square_feet, data = NYC_condos_original)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -79533546  -1211195   -860173   -251714 211550415 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       940683.39   57703.86   16.30   <2e-16 ***
## gross_square_feet   1192.72      19.43   61.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4745000 on 8094 degrees of freedom
## Multiple R-squared:  0.3177, Adjusted R-squared:  0.3176 
## F-statistic:  3769 on 1 and 8094 DF,  p-value: < 2.2e-16

Comparison of linear modeling results

A bivariate linear regression of sale_price (price) explained by gross_square_feet (size) was performed on two different datasets containing condominium sale records for New York City. One dataset, NYC_condos, was cleaned to remove multi-unit sale records (where the same sale price is recorded for many units). The other dataset, NYC_condos_original, remained unaltered and contained all original sale records. In each case, the hypothesis is that there is a relationship between the size of a condominium (gross_square_feet) and the price (sale_price). We can declare there is a relationship between condominium size and price when the slope is sufficiently far from zero.

For each model, the t-statistic was high enough, and the p-value was low enough, to declare that there is, in fact, a relationship between gross_square_feet and sale_price. The t-statistic for the cleaned dataset (NYC_condos) was nearly double that of the original dataset (NYC_condos_original) at 113.04 versus 61.39. In each case the p-value was well below the 0.05 cutoff for significance meaning that it is extremely unlikely that the relationship between condominium size and sale price is due to random chance.

The confidence interval for the slope is [4384.254, 4538.999] for the NYC_condos dataset compared to only [1154.636, 1230.802] for the NYC_condos_original dataset. This difference can likely be attributed to the removal of many multi-million dollar sale records for smaller units which impacted price predictions in the original dataset. The measure for lack of fit, or residual standard error (RSE) was lower for the cleaned dataset at 2,945,000 compared to 4,745,000 for the original dataset. However, it must be noted that the NYC_condos is smaller than the NYC_condos_original by 150 observations. Finally, the R-squared, or the proportion of the variability in sale_price that can be explained by gross_square_feet is 0.6166 for the cleaned NYC_condos. This is nearly double the R-squared value estimated for the NYC_condos_original dataset at 0.3177.

Below is the updated scatterplot that uses the cleaned NYC_condos data. For the Brooklyn borough we are better able to see the spread of the data and how the trend line fits the data because we removed the $30 million outliers. The same is true for the Manhattan borough because the $200 million multi-unit sale was removed.

ggplot(data = NYC_condos, 
       aes(x = gross_square_feet, y = sale_price)) +
  geom_point(alpha = 0.3) +
  facet_wrap(~ borough, scales = "free", ncol = 2) +
  scale_y_continuous(labels = scales::comma) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
       x = "Size (Gross Square Feet)",
       y = "Sale Price (USD)")
## `geom_smooth()` using formula 'y ~ x'

Linear Regression Models for each Borough - Coefficient Estimates

Now let’s apply the broom workflow to compare coefficient estimates across the five boroughs. The general workflow using broom and tidyverse tools to generate many models involves 4 steps:

  1. Nest a dataframe by a categorical variable with the nest() function from the tidyr package - we will nest by borough.
  2. Fit models to nested dataframes with the map() function from the purrr package.
  3. Apply the broom functions tidy(), augment(), and/or glance() using each nested model - we’ll work with tidy() first.
  4. Return a tidy dataframe with the unnest() function - this allows us to see the results.
# Step 1: nest by the borough categorical variable
library(broom)
library(tidyr)
library(purrr)
NYC_nested <- NYC_condos %>% 
  group_by(borough) %>% 
  nest()

n the previous step, the NYC_condos dataframe was collapsed from 7,946 observations to only 5. The nesting process isolated the sale records for each borough into separate dataframes.

# Inspect the format
print(NYC_nested)
## # A tibble: 5 × 2
## # Groups:   borough [5]
##   borough       data                 
##   <chr>         <list>               
## 1 Bronx         <tibble [330 × 19]>  
## 2 Brooklyn      <tibble [1,697 × 19]>
## 3 Manhattan     <tibble [4,819 × 19]>
## 4 Queens        <tibble [1,004 × 19]>
## 5 Staten Island <tibble [96 × 19]>

We can extract and inspect the values of any nested dataframe. Below is a look at the first six rows for Manhattan.

# View first few rows for Manhattan
head(NYC_nested$data[[3]])
## # A tibble: 6 × 19
##   neighbor…¹ build…² tax_c…³ block   lot build…⁴ address apart…⁵ zip_c…⁶ resid…⁷
##   <chr>      <chr>   <chr>   <dbl> <dbl> <chr>   <chr>   <chr>     <dbl>   <dbl>
## 1 Alphabet … 13 Con… 2         373  1008 R4      324 Ea… 5B        10009       0
## 2 Alphabet … 13 Con… 2         373  1009 R4      324 Ea… 6A        10009       0
## 3 Alphabet … 13 Con… 2         375  1017 R4      754 Ea… 5A        10009       0
## 4 Alphabet … 13 Con… 2         378  1005 R4      399 Ea… 2C        10009       0
## 5 Alphabet … 13 Con… 2         384  1205 R4      310 Ea… 3C        10002       0
## 6 Alphabet … 13 Con… 2         384  1210 R4      310 Ea… 4B        10002       0
## # … with 9 more variables: commercial_units <dbl>, total_units <dbl>,
## #   land_square_feet <dbl>, gross_square_feet <dbl>, year_built <dbl>,
## #   tax_class_at_time_of_sale <dbl>, building_class_at_time_of_sale <chr>,
## #   sale_price <dbl>, sale_date <dttm>, and abbreviated variable names
## #   ¹​neighborhood, ²​building_class_category, ³​tax_class_at_present,
## #   ⁴​building_class_at_present, ⁵​apartment_number, ⁶​zip_code,
## #   ⁷​residential_units

he next step in the process is to fit a linear model to each individual dataframe. What this means is that we are generating separate linear models for each borough individually.

# Step 2: fit linear models to each borough, individually
NYC_coefficients <- NYC_condos %>% 
  group_by(borough) %>% 
  nest() %>% 
  mutate(linear_model = map(.x = data, 
                            .f = ~lm(sale_price ~ gross_square_feet, 
                                     data = .)))

Taking a look at the data structure we see that we have a new list-column called linear_model that contains a linear model object for each borough.

# Inspect the data structure
print(NYC_coefficients)
## # A tibble: 5 × 3
## # Groups:   borough [5]
##   borough       data                  linear_model
##   <chr>         <list>                <list>      
## 1 Bronx         <tibble [330 × 19]>   <lm>        
## 2 Brooklyn      <tibble [1,697 × 19]> <lm>        
## 3 Manhattan     <tibble [4,819 × 19]> <lm>        
## 4 Queens        <tibble [1,004 × 19]> <lm>        
## 5 Staten Island <tibble [96 × 19]>    <lm>

We can view the linear modeling results for any one of the nested objects using the summary() function. Below are the linear regression statistics for Manhattan.

# Verify model results for Manhattan
summary(NYC_coefficients$linear_model[[3]])
## 
## Call:
## lm(formula = sale_price ~ gross_square_feet, data = .)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -21128191  -1001860    224235   1103543 134333578 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -3.254e+06  8.584e+04  -37.91   <2e-16 ***
## gross_square_feet  4.728e+03  5.178e+01   91.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3588000 on 4817 degrees of freedom
## Multiple R-squared:  0.6338, Adjusted R-squared:  0.6337 
## F-statistic:  8337 on 1 and 4817 DF,  p-value: < 2.2e-16

A quick look at the R-squared value for the Manhattan linear model indicates that gross_square_feet looks to be a fairly good single predictor of sale_price. Almost two-thirds of the variability with sale_price is explained by gross_square_feet.

The next step is to transform these linear model summary statistics into a tidy format.

# Step 3: generate a tidy dataframe of coefficient estimates that includes confidence intervals
NYC_coefficients <- NYC_condos %>% 
  group_by(borough) %>% 
  nest() %>% 
  mutate(linear_model = map(.x = data, 
                            .f = ~lm(sale_price ~ gross_square_feet, 
                                     data = .))) %>%
  mutate(tidy_coefficients = map(.x = linear_model, 
                                 .f = tidy, 
                                 conf.int = TRUE))
NYC_coefficients
## # A tibble: 5 × 4
## # Groups:   borough [5]
##   borough       data                  linear_model tidy_coefficients
##   <chr>         <list>                <list>       <list>           
## 1 Bronx         <tibble [330 × 19]>   <lm>         <tibble [2 × 7]> 
## 2 Brooklyn      <tibble [1,697 × 19]> <lm>         <tibble [2 × 7]> 
## 3 Manhattan     <tibble [4,819 × 19]> <lm>         <tibble [2 × 7]> 
## 4 Queens        <tibble [1,004 × 19]> <lm>         <tibble [2 × 7]> 
## 5 Staten Island <tibble [96 × 19]>    <lm>         <tibble [2 × 7]>

Now we have a new variable called tidy_coefficients that contains tidy coefficient estimates for each of the five boroughs. These tidy statistics are currently stored in five separate dataframes. Below are the coefficient estimates for Manhattan.

# Inspect the results for Manhattan
print(NYC_coefficients$tidy_coefficients[[3]])
## # A tibble: 2 × 7
##   term               estimate std.error statistic   p.value  conf.low conf.high
##   <chr>                 <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)       -3253992.   85837.      -37.9 1.94e-275 -3422273. -3085712.
## 2 gross_square_feet     4728.      51.8      91.3 0             4626.     4829.
# Step 4: Unnest to a tidy dataframe of coefficient estimates
NYC_coefficients_tidy <- NYC_coefficients %>% 
  select(borough, tidy_coefficients) %>% 
  unnest(cols = tidy_coefficients)
print(NYC_coefficients_tidy)
## # A tibble: 10 × 8
## # Groups:   borough [5]
##    borough       term          estim…¹ std.e…² stati…³   p.value conf.…⁴ conf.…⁵
##    <chr>         <chr>           <dbl>   <dbl>   <dbl>     <dbl>   <dbl>   <dbl>
##  1 Bronx         (Intercept)   -2.54e5 24668.   -10.3  9.89e- 22 -3.03e5 -2.05e5
##  2 Bronx         gross_square…  6.49e2    29.6   21.9  2.58e- 66  5.91e2  7.07e2
##  3 Brooklyn      (Intercept)   -2.72e5 34487.    -7.88 5.64e- 15 -3.40e5 -2.04e5
##  4 Brooklyn      gross_square…  1.29e3    29.7   43.2  1.06e-275  1.23e3  1.34e3
##  5 Manhattan     (Intercept)   -3.25e6 85837.   -37.9  1.94e-275 -3.42e6 -3.09e6
##  6 Manhattan     gross_square…  4.73e3    51.8   91.3  0          4.63e3  4.83e3
##  7 Queens        (Intercept)    3.99e4 28311.     1.41 1.59e-  1 -1.56e4  9.55e4
##  8 Queens        gross_square…  7.33e2    32.0   22.9  8.14e- 94  6.70e2  7.95e2
##  9 Staten Island (Intercept)    4.64e4 28256.     1.64 1.04e-  1 -9.67e3  1.03e5
## 10 Staten Island gross_square…  2.89e2    30.7    9.41 3.30e- 15  2.28e2  3.49e2
## # … with abbreviated variable names ¹​estimate, ²​std.error, ³​statistic,
## #   ⁴​conf.low, ⁵​conf.high

We’re mainly interested in the slope which explains the change in y (sale price) for each unit change in x (square footage). We can filter for the slope estimate only as follows.

# Filter to return the slope estimate only 
NYC_slope <- NYC_coefficients_tidy %>%   
  filter(term == "gross_square_feet") %>% 
  arrange(estimate)
print(NYC_slope)
## # A tibble: 5 × 8
## # Groups:   borough [5]
##   borough       term           estim…¹ std.e…² stati…³   p.value conf.…⁴ conf.…⁵
##   <chr>         <chr>            <dbl>   <dbl>   <dbl>     <dbl>   <dbl>   <dbl>
## 1 Staten Island gross_square_…    289.    30.7    9.41 3.30e- 15    228.    349.
## 2 Bronx         gross_square_…    649.    29.6   21.9  2.58e- 66    591.    707.
## 3 Queens        gross_square_…    733.    32.0   22.9  8.14e- 94    670.    795.
## 4 Brooklyn      gross_square_…   1285.    29.7   43.2  1.06e-275   1227.   1344.
## 5 Manhattan     gross_square_…   4728.    51.8   91.3  0           4626.   4829.
## # … with abbreviated variable names ¹​estimate, ²​std.error, ³​statistic,
## #   ⁴​conf.low, ⁵​conf.high

We’ve arranged the results in ascending order by the slope estimate. For each of the five boroughs, the t-statistic and p-value indicate that there is a relationship between sale_price and gross_square_feet. In Staten Island, an increase in square footage by one unit is estimated to increase the sale price by about $288, on average. In contrast, an increase in total square footage by one unit is estimated to result in an increase in sale price of about $4,728, on average.

Linear Regression Models for each Borough - Regression Summary Statistics

Now we will apply the same workflow using broom tools to generate tidy regression summary statistics for each of the five boroughs. Below we follow the same process as we saw previously with the tidy() function, but instead we use the glance() function.

# Generate a tidy dataframe of regression summary statistics
NYC_summary_stats <- NYC_condos %>% 
  group_by(borough) %>% 
  nest() %>% 
  mutate(linear_model = map(.x = data, 
                            .f = ~lm(sale_price ~ gross_square_feet, 
                                     data = .))) %>%
  mutate(tidy_summary_stats = map(.x = linear_model,
                                  .f = glance))
print(NYC_summary_stats)
## # A tibble: 5 × 4
## # Groups:   borough [5]
##   borough       data                  linear_model tidy_summary_stats
##   <chr>         <list>                <list>       <list>            
## 1 Bronx         <tibble [330 × 19]>   <lm>         <tibble [1 × 12]> 
## 2 Brooklyn      <tibble [1,697 × 19]> <lm>         <tibble [1 × 12]> 
## 3 Manhattan     <tibble [4,819 × 19]> <lm>         <tibble [1 × 12]> 
## 4 Queens        <tibble [1,004 × 19]> <lm>         <tibble [1 × 12]> 
## 5 Staten Island <tibble [96 × 19]>    <lm>         <tibble [1 × 12]>
# Unnest to a tidy dataframe of
NYC_summary_stats_tidy <- NYC_summary_stats %>% 
  select(borough, tidy_summary_stats) %>% 
  unnest(cols = tidy_summary_stats) %>% 
  arrange(r.squared)
print(NYC_summary_stats_tidy)
## # A tibble: 5 × 13
## # Groups:   borough [5]
##   borough   r.squ…¹ adj.r…²  sigma stati…³   p.value    df  logLik    AIC    BIC
##   <chr>       <dbl>   <dbl>  <dbl>   <dbl>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1 Queens      0.344   0.343 2.80e5   525.  8.14e- 94     1 -14018. 2.80e4 2.81e4
## 2 Staten I…   0.485   0.480 7.35e4    88.6 3.30e- 15     1  -1211. 2.43e3 2.44e3
## 3 Brooklyn    0.524   0.524 5.65e5  1868.  1.06e-275     1 -24883. 4.98e4 4.98e4
## 4 Bronx       0.595   0.594 1.51e5   481.  2.58e- 66     1  -4404. 8.81e3 8.82e3
## 5 Manhattan   0.634   0.634 3.59e6  8337.  0             1 -79570. 1.59e5 1.59e5
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>, and
## #   abbreviated variable names ¹​r.squared, ²​adj.r.squared, ³​statistic

These results will be summarized in our conclusion paragraph below.

Conclusion

Our analysis showed that, in general, the gross_square_feet variable is useful for explaining, or estimating, sale_price for condominiums in New York City. We observed that removing multi-unit sales from the dataset increased model accuracy. With linear models generated for New York City as a whole, and with linear models generated for each borough individually, we observed in all cases that the t-statistic was high enough, and the p-value was low enough, to declare that there is a relationship between gross_square_feet and sale_price.

For the linear models that we generated for each individual borough, we observed a wide range in slope estimates. The slope estimate for Manhattan was much higher than the estimate for any of the other boroughs. We did not remove the record-setting $240 million property sale from the dataset, but future analysis should investigate the impacts that this single listing has on modeling results.

Finally, regression summary statistics indicate that gross_square_feet is a better single predictor of sale_price in some boroughs versus others. For example, the R-squared value was estimated at approximately 0.63 in Manhattan, and 0.59 in Brooklyn, compared to an estimate of only 0.35 in Queens. These differences in R-squared correspond with the scatterplots generated for each borough; the strength of sale prices versus gross square feet was higher, and the dispersion (spread), was lower for Manhattan and Brooklyn as compared to Queens where the relationship was noticeably weaker because the data was more spread out.