Introduction

This regression model seeks to determine if the sales price of a single-family home can be predicted from the house’s characteristics. My analysis will use over 3 months of sales data from my county’s appraisal website. Although it is publicly available, I removed the owner, address, parcel ID, and date of sale from data before importing into R. The two continuous predictors will be the year built and heated area of the house. A categorical predictor will be whether the house has an attached garage. The other categorical predictor will be based on the number of bedrooms with houses split into groups - 1, 2, 3, and 4 or more bedrooms.

Importing Data

The dataset will be composed of one file that contains the sales price, year built, heated area, number of bedrooms, and if the house has an attached garage.

sales_raw <- read.csv("~/sales.csv") %>%
  as_tibble

sales_raw
## # A tibble: 876 × 12
##    nh_cd   price qualified use_code acreage vacant   year_built heated_area
##    <dbl>   <int> <chr>        <int>   <dbl> <chr>         <int>       <int>
##  1  1240  230000 Qualified        1    0.43 Improved       2006        1819
##  2  3031  325000 Qualified        1    1.63 Improved       1974        1882
##  3  1115  274300 Qualified        1    1    Improved       1984        1436
##  4  1228  487000 Qualified        1    0.79 Improved       1988        3115
##  5  3209 1100000 Qualified        1    0.54 Improved       2019        3979
##  6  3202  285000 Qualified        1    0.37 Improved       1979        1229
##  7  4593   63000 Qualified        1    0.01 Improved       1973         575
##  8  1083  335000 Qualified        1    0.29 Improved       1995        2213
##  9  4491  170000 Qualified        1    0.04 Improved       2006        1278
## 10  4485  225000 Qualified        1    0.25 Improved       1986        1681
## # … with 866 more rows, and 4 more variables: bedrooms <int>,
## #   bathroom_fixtures <int>, has_garage <int>, millage <int>

Number of Bedrooms Histogram

ggplot(sales_raw, aes(bedrooms)) + 
  geom_histogram(binwidth = 1, color = "#000000", fill = "#0099F8") +
  geom_text(aes(label = ..count..), stat = "count", vjust = -0.25) +
  scale_x_continuous(breaks=seq(0, 7, by = 1)) +
  xlab("# of Bedrooms") + 
  theme_classic()

Number of Bedrooms Frequency

bedroom_frequency <- sales_raw %>%
  group_by(bedrooms) %>%
  summarise(cnt = n()) %>%
  mutate(pct = round(cnt / sum(cnt) * 100, 1)) %>% 
  arrange(desc(pct))

bedroom_frequency
## # A tibble: 7 × 3
##   bedrooms   cnt   pct
##      <int> <int> <dbl>
## 1        3   402  45.9
## 2        2   382  43.6
## 3        4    54   6.2
## 4        1    19   2.2
## 5        5    13   1.5
## 6        0     5   0.6
## 7        7     1   0.1
top_2 <- slice_max(bedroom_frequency, cnt, n = 2)

The data show that 3- and 2-bedroom houses accounted for a combined 89.5% of houses sold. We can also see 5 houses sold with no bedrooms.

filter(sales_raw, bedrooms == 0)
## # A tibble: 5 × 12
##   nh_cd  price qualified use_code acreage vacant year_built heated_area bedrooms
##   <dbl>  <int> <chr>        <int>   <dbl> <chr>       <int>       <int>    <int>
## 1   561 100000 Qualified        1    4.79 Impro…       1994         784        0
## 2   561 100000 Qualified        1    4.79 Impro…       1977        1440        0
## 3  1241 380000 Qualified        1    0.5  Impro…       2021         900        0
## 4  1241 380000 Qualified        1    0.5  Impro…       2021        1498        0
## 5  5130 499900 Qualified        1    1.37 Impro…       1957        1914        0
## # … with 3 more variables: bathroom_fixtures <int>, has_garage <int>,
## #   millage <int>

No-Bedroom Houses

Upon inspection, 4 of the “no bedroom” houses are composed of 2 houses with buildings on adjacent parcels. The last “house” appears to be a single building on a large parcel. I will exclude the “no bedroom” houses since the sales price listed for these is actually based on the sales price of the adjacent house.

Creating Categories for Bedrooms

sales <- sales_raw %>% 
  # Remove 0 bedroom houses
  filter(bedrooms != 0) %>%
  # Group together houses with 4 or more bedrooms
  mutate(has_br = if_else(bedrooms >= 4, '4_or_more', paste(bedrooms))) %>%
  # Select outcome and predictor columns
  select(price,
         year_built,
         heated_area,
         bedrooms,
         has_br,
         has_garage) %>% 
  # Create Dummy Columns for Bedroom Groups and Garage
  dummy_cols('has_br')

sales
## # A tibble: 871 × 10
##      price year_built heated_area bedrooms has_br   has_garage has_br_1 has_br_2
##      <int>      <int>       <int>    <int> <chr>         <int>    <int>    <int>
##  1  230000       2006        1819        3 3                 1        0        0
##  2  325000       1974        1882        3 3                 1        0        0
##  3  274300       1984        1436        3 3                 1        0        0
##  4  487000       1988        3115        5 4_or_mo…          1        0        0
##  5 1100000       2019        3979        5 4_or_mo…          1        0        0
##  6  285000       1979        1229        3 3                 0        0        0
##  7   63000       1973         575        1 1                 0        1        0
##  8  335000       1995        2213        3 3                 1        0        0
##  9  170000       2006        1278        2 2                 1        0        1
## 10  225000       1986        1681        3 3                 1        0        0
## # … with 861 more rows, and 2 more variables: has_br_3 <int>,
## #   has_br_4_or_more <int>

Scatter Plot of Continuous Predictors

ggplot(sales, aes(x=year_built, y=price)) + 
  geom_point(size=0.5) + 
  xlab("Year Built") + 
  ylab("Sales Price") +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",",
                                                       scientific = FALSE)) +
  theme_minimal()

ggplot(sales, aes(x=heated_area, y=price)) + 
  geom_point(size=0.5) + 
  xlab("Heated Area") + 
  ylab("Sales Price") +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",",
                                                       scientific = FALSE)) +
  theme_minimal()

Data Summary

Year Built

# Histogram of Year Built
ggplot(sales, aes(year_built)) + 
  geom_histogram(binwidth = 5) +
  xlab("Year Built") + 
  theme_minimal()

# Year Built Summary Stats
year_summary <- summarize(sales, mean = mean(year_built, na.rm=TRUE), 
                          max = max(year_built, na.rm=TRUE),  
                          median = median(year_built, na.rm=TRUE), 
                          min = min(year_built, na.rm=TRUE))

The average year built for houses sold is around 1984, and the median is 1987. The most recent home was built in 2021, and the oldest was built in 1914.

Heated Area

# Histogram of Heated Area
ggplot(sales, aes(heated_area)) + 
  geom_histogram(binwidth = 100) +
  xlab("Heated Area") + 
  theme_minimal()

# Year Built Summary Stats
area_summary <- summarize(sales, mean = mean(heated_area, na.rm=TRUE), 
                          sd = sd(heated_area, na.rm=TRUE),
                          min = min(heated_area, na.rm=TRUE),
                          max = max(heated_area, na.rm=TRUE),
                          median = median(heated_area, na.rm=TRUE), 
                          IQR = IQR(heated_area, na.rm=TRUE))

The heated area has a mean of 1,428 sq-ft with a standard deviation of 607 sq-ft. The median is 1,361 sq-ft with an interquartile range of 650 sq-ft.

Attached Garage

# Frequency of houses that have a garage
garage_frequency <- sales %>%
  group_by(has_garage) %>%
  summarise(cnt = n()) %>%
  mutate(pct = round(cnt / sum(cnt) * 100, 1)) %>% 
  arrange(desc(pct))

# Pie Chart 
ggplot(garage_frequency, aes(x = "", y = pct, fill = factor(has_garage))) +
  geom_col() +
  geom_text(aes(label = paste(pct,"%")),
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) +
  coord_polar(theta = "y") + 
  guides(fill = guide_legend(title = "Has Garage (1=Yes)")) +
  theme_void()

About 64% of the houses that sold had attached garages.

Constructing Regression Line

Constructing Model with All Main Effects

I will construct a model with both continuous and both categorical predictors while removing the reference group (houses with 1 bedroom) from the model.

# Constructing the model with dummy columns
model_dummy <- lm(price ~ year_built + heated_area + has_garage + has_br_2 + has_br_3 + has_br_4_or_more,
                 data = sales)

summary(model_dummy)
## 
## Call:
## lm(formula = price ~ year_built + heated_area + has_garage + 
##     has_br_2 + has_br_3 + has_br_4_or_more, data = sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -368804  -62904  -20153   35024 1027678 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.706e+06  4.976e+05  -5.438 7.00e-08 ***
## year_built        1.401e+03  2.522e+02   5.556 3.68e-08 ***
## heated_area       1.141e+02  8.340e+00  13.680  < 2e-16 ***
## has_garage        5.536e+03  1.066e+04   0.519  0.60358    
## has_br_2         -1.686e+04  2.893e+04  -0.583  0.56028    
## has_br_3          2.271e+04  2.954e+04   0.769  0.44223    
## has_br_4_or_more  1.045e+05  3.336e+04   3.134  0.00178 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 121900 on 864 degrees of freedom
## Multiple R-squared:  0.4051, Adjusted R-squared:  0.401 
## F-statistic: 98.07 on 6 and 864 DF,  p-value: < 2.2e-16
# Constructing the same model without dummy columns (will use this method to avoid typos)
model_1 <- lm(price ~ year_built + heated_area + has_garage + has_br,
                 data = sales)

summary(model_1)
## 
## Call:
## lm(formula = price ~ year_built + heated_area + has_garage + 
##     has_br, data = sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -368804  -62904  -20153   35024 1027678 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.706e+06  4.976e+05  -5.438 7.00e-08 ***
## year_built       1.401e+03  2.522e+02   5.556 3.68e-08 ***
## heated_area      1.141e+02  8.340e+00  13.680  < 2e-16 ***
## has_garage       5.536e+03  1.066e+04   0.519  0.60358    
## has_br2         -1.686e+04  2.893e+04  -0.583  0.56028    
## has_br3          2.271e+04  2.954e+04   0.769  0.44223    
## has_br4_or_more  1.045e+05  3.336e+04   3.134  0.00178 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 121900 on 864 degrees of freedom
## Multiple R-squared:  0.4051, Adjusted R-squared:  0.401 
## F-statistic: 98.07 on 6 and 864 DF,  p-value: < 2.2e-16

Constructing Model with Three-Way Interactions

Now, I will construct a model with three-way interactions. I will avoid including the four-way interaction since it would be difficult to interpret. It also creates a singularity. See below.

# Example of creating 4-way interaction (will not be used)
model_4way <- lm(price ~ year_built * heated_area * has_garage * has_br,
                 data = sales)

summary(model_4way)
## 
## Call:
## lm(formula = price ~ year_built * heated_area * has_garage * 
##     has_br, data = sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -360825  -57773  -12565   33537  891915 
## 
## Coefficients: (1 not defined because of singularities)
##                                                     Estimate Std. Error t value
## (Intercept)                                       -1.171e+07  1.552e+07  -0.755
## year_built                                         5.948e+03  7.866e+03   0.756
## heated_area                                        1.593e+04  2.083e+04   0.765
## has_garage                                         2.697e+07  7.169e+07   0.376
## has_br2                                            9.127e+06  1.574e+07   0.580
## has_br3                                            4.101e+06  1.577e+07   0.260
## has_br4_or_more                                   -2.512e+06  1.702e+07  -0.148
## year_built:heated_area                            -7.993e+00  1.055e+01  -0.757
## year_built:has_garage                             -1.364e+04  3.591e+04  -0.380
## heated_area:has_garage                            -5.547e+03  4.112e+03  -1.349
## year_built:has_br2                                -4.590e+03  7.978e+03  -0.575
## year_built:has_br3                                -2.007e+03  7.994e+03  -0.251
## year_built:has_br4_or_more                         1.393e+03  8.625e+03   0.162
## heated_area:has_br2                               -1.368e+04  2.097e+04  -0.652
## heated_area:has_br3                               -1.500e+04  2.094e+04  -0.716
## heated_area:has_br4_or_more                       -1.141e+04  2.117e+04  -0.539
## has_garage:has_br2                                -2.984e+07  7.194e+07  -0.415
## has_garage:has_br3                                -1.456e+07  7.182e+07  -0.203
## has_garage:has_br4_or_more                        -1.333e+07  7.164e+07  -0.186
## year_built:heated_area:has_garage                  2.811e+00  2.069e+00   1.359
## year_built:heated_area:has_br2                     6.873e+00  1.062e+01   0.647
## year_built:heated_area:has_br3                     7.554e+00  1.061e+01   0.712
## year_built:heated_area:has_br4_or_more             5.724e+00  1.072e+01   0.534
## year_built:has_garage:has_br2                      1.507e+04  3.604e+04   0.418
## year_built:has_garage:has_br3                      7.331e+03  3.598e+04   0.204
## year_built:has_garage:has_br4_or_more              6.626e+03  3.589e+04   0.185
## heated_area:has_garage:has_br2                     7.170e+03  6.287e+03   1.140
## heated_area:has_garage:has_br3                     1.174e+03  4.971e+03   0.236
## heated_area:has_garage:has_br4_or_more             1.125e+02  5.509e+02   0.204
## year_built:heated_area:has_garage:has_br2         -3.594e+00  3.167e+00  -1.135
## year_built:heated_area:has_garage:has_br3         -5.830e-01  2.502e+00  -0.233
## year_built:heated_area:has_garage:has_br4_or_more         NA         NA      NA
##                                                   Pr(>|t|)
## (Intercept)                                          0.451
## year_built                                           0.450
## heated_area                                          0.445
## has_garage                                           0.707
## has_br2                                              0.562
## has_br3                                              0.795
## has_br4_or_more                                      0.883
## year_built:heated_area                               0.449
## year_built:has_garage                                0.704
## heated_area:has_garage                               0.178
## year_built:has_br2                                   0.565
## year_built:has_br3                                   0.802
## year_built:has_br4_or_more                           0.872
## heated_area:has_br2                                  0.514
## heated_area:has_br3                                  0.474
## heated_area:has_br4_or_more                          0.590
## has_garage:has_br2                                   0.678
## has_garage:has_br3                                   0.839
## has_garage:has_br4_or_more                           0.852
## year_built:heated_area:has_garage                    0.175
## year_built:heated_area:has_br2                       0.518
## year_built:heated_area:has_br3                       0.477
## year_built:heated_area:has_br4_or_more               0.594
## year_built:has_garage:has_br2                        0.676
## year_built:has_garage:has_br3                        0.839
## year_built:has_garage:has_br4_or_more                0.854
## heated_area:has_garage:has_br2                       0.254
## heated_area:has_garage:has_br3                       0.813
## heated_area:has_garage:has_br4_or_more               0.838
## year_built:heated_area:has_garage:has_br2            0.257
## year_built:heated_area:has_garage:has_br3            0.816
## year_built:heated_area:has_garage:has_br4_or_more       NA
## 
## Residual standard error: 118100 on 840 degrees of freedom
## Multiple R-squared:  0.4575, Adjusted R-squared:  0.4381 
## F-statistic: 23.61 on 30 and 840 DF,  p-value: < 2.2e-16
# Three-way interaction model
model_3way <- lm(price ~ (year_built + heated_area + has_garage + has_br)^3,
                 data = sales)

summary(model_3way)
## 
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage + 
##     has_br)^3, data = sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -361812  -57526  -13227   33497  896580 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                            -1.171e+07  1.552e+07  -0.755   0.4505  
## year_built                              5.948e+03  7.864e+03   0.756   0.4496  
## heated_area                             1.593e+04  2.083e+04   0.765   0.4445  
## has_garage                              2.468e+07  7.155e+07   0.345   0.7303  
## has_br2                                 7.833e+06  1.570e+07   0.499   0.6179  
## has_br3                                 4.690e+06  1.572e+07   0.298   0.7654  
## has_br4_or_more                         4.743e+05  1.618e+07   0.029   0.9766  
## year_built:heated_area                 -7.993e+00  1.055e+01  -0.757   0.4490  
## year_built:has_garage                  -1.247e+04  3.584e+04  -0.348   0.7279  
## year_built:has_br2                     -3.930e+03  7.956e+03  -0.494   0.6215  
## year_built:has_br3                     -2.304e+03  7.964e+03  -0.289   0.7724  
## year_built:has_br4_or_more             -1.168e+02  8.197e+03  -0.014   0.9886  
## heated_area:has_garage                 -3.562e+03  2.134e+03  -1.669   0.0954 .
## heated_area:has_br2                    -1.242e+04  2.093e+04  -0.593   0.5531  
## heated_area:has_br3                    -1.546e+04  2.091e+04  -0.740   0.4598  
## heated_area:has_br4_or_more            -1.307e+04  2.095e+04  -0.624   0.5331  
## has_garage:has_br2                     -2.129e+07  7.153e+07  -0.298   0.7661  
## has_garage:has_br3                     -1.349e+07  7.153e+07  -0.189   0.8504  
## has_garage:has_br4_or_more             -1.469e+07  7.157e+07  -0.205   0.8375  
## year_built:heated_area:has_garage       1.803e+00  1.047e+00   1.722   0.0854 .
## year_built:heated_area:has_br2          6.232e+00  1.061e+01   0.588   0.5569  
## year_built:heated_area:has_br3          7.789e+00  1.059e+01   0.735   0.4622  
## year_built:heated_area:has_br4_or_more  6.562e+00  1.062e+01   0.618   0.5366  
## year_built:has_garage:has_br2           1.076e+04  3.584e+04   0.300   0.7641  
## year_built:has_garage:has_br3           6.783e+03  3.583e+04   0.189   0.8499  
## year_built:has_garage:has_br4_or_more   7.306e+03  3.586e+04   0.204   0.8386  
## heated_area:has_garage:has_br2          5.730e+01  5.516e+02   0.104   0.9173  
## heated_area:has_garage:has_br3          3.168e+01  5.503e+02   0.058   0.9541  
## heated_area:has_garage:has_br4_or_more  1.216e+02  5.504e+02   0.221   0.8251  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 118000 on 842 degrees of freedom
## Multiple R-squared:  0.4565, Adjusted R-squared:  0.4384 
## F-statistic: 25.26 on 28 and 842 DF,  p-value: < 2.2e-16

Test for Significant Three-Way Interactions

Before continuing, let us check if any of the three-way interactions are worth keeping in the model.

full <- model_3way

# Reduced model with main effects and two-way interactions
model_2way <- lm(price ~ (year_built + heated_area + has_garage + has_br)^2, 
              data = sales)
reduced <- model_2way

summary(reduced)
## 
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage + 
##     has_br)^2, data = sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -362994  -58395  -16323   31733  906634 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.824e+04  3.073e+06   0.006 0.995265    
## year_built                  -1.084e+01  1.557e+03  -0.007 0.994447    
## heated_area                  3.331e+02  9.024e+02   0.369 0.712077    
## has_garage                   3.228e+06  1.071e+06   3.015 0.002647 ** 
## has_br2                     -1.490e+06  3.133e+06  -0.476 0.634519    
## has_br3                     -5.552e+06  3.160e+06  -1.757 0.079297 .  
## has_br4_or_more             -6.658e+06  3.430e+06  -1.941 0.052567 .  
## year_built:heated_area      -7.135e-02  4.569e-01  -0.156 0.875948    
## year_built:has_garage       -1.686e+03  5.366e+02  -3.143 0.001732 ** 
## year_built:has_br2           8.081e+02  1.587e+03   0.509 0.610763    
## year_built:has_br3           2.910e+03  1.600e+03   1.818 0.069344 .  
## year_built:has_br4_or_more   3.484e+03  1.737e+03   2.006 0.045185 *  
## heated_area:has_garage       7.894e+01  2.102e+01   3.755 0.000185 ***
## heated_area:has_br2         -1.542e+02  8.916e+01  -1.730 0.083976 .  
## heated_area:has_br3         -1.394e+02  8.842e+01  -1.576 0.115296    
## heated_area:has_br4_or_more -1.147e+02  8.868e+01  -1.293 0.196258    
## has_garage:has_br2           8.375e+04  9.486e+04   0.883 0.377552    
## has_garage:has_br3          -3.308e+04  9.516e+04  -0.348 0.728177    
## has_garage:has_br4_or_more  -4.208e+04  1.006e+05  -0.418 0.675829    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 118600 on 852 degrees of freedom
## Multiple R-squared:  0.4449, Adjusted R-squared:  0.4332 
## F-statistic: 37.94 on 18 and 852 DF,  p-value: < 2.2e-16
test_stat <- anova(reduced, full)

test_stat
## Analysis of Variance Table
## 
## Model 1: price ~ (year_built + heated_area + has_garage + has_br)^2
## Model 2: price ~ (year_built + heated_area + has_garage + has_br)^3
##   Res.Df        RSS Df  Sum of Sq      F  Pr(>F)  
## 1    852 1.1983e+13                               
## 2    842 1.1733e+13 10 2.4924e+11 1.7885 0.05876 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hypothesis Test

  • \(H_0:\) all slopes of Three-Way Interactions \(= 0\)
  • \(H_1:\) at least one of slopes of Three-Way Interactions \(\ne 0\)

Test Statistic

  • \(F_0:\) 1.7885

p-value

  • p = 0.0588

Reject Region

  • Reject \(H_0\) if \(p \lt \alpha; \alpha = 0.05\)

Conclusion / Interpretation

  • Fail to Reject \(H_0\). There is not sufficient evidence to suggest that the three-way interactions are significant. Based on this, I can drop the three-way interaction terms from the model. The new model has the same p-value as the model with three-way interactions, but the adjusted \(R^2\) dropped slightly by 0.005. I feel this is a reasonable sacrifice since three-way interactions can be difficult to interpret.
summary(model_2way)
## 
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage + 
##     has_br)^2, data = sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -362994  -58395  -16323   31733  906634 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.824e+04  3.073e+06   0.006 0.995265    
## year_built                  -1.084e+01  1.557e+03  -0.007 0.994447    
## heated_area                  3.331e+02  9.024e+02   0.369 0.712077    
## has_garage                   3.228e+06  1.071e+06   3.015 0.002647 ** 
## has_br2                     -1.490e+06  3.133e+06  -0.476 0.634519    
## has_br3                     -5.552e+06  3.160e+06  -1.757 0.079297 .  
## has_br4_or_more             -6.658e+06  3.430e+06  -1.941 0.052567 .  
## year_built:heated_area      -7.135e-02  4.569e-01  -0.156 0.875948    
## year_built:has_garage       -1.686e+03  5.366e+02  -3.143 0.001732 ** 
## year_built:has_br2           8.081e+02  1.587e+03   0.509 0.610763    
## year_built:has_br3           2.910e+03  1.600e+03   1.818 0.069344 .  
## year_built:has_br4_or_more   3.484e+03  1.737e+03   2.006 0.045185 *  
## heated_area:has_garage       7.894e+01  2.102e+01   3.755 0.000185 ***
## heated_area:has_br2         -1.542e+02  8.916e+01  -1.730 0.083976 .  
## heated_area:has_br3         -1.394e+02  8.842e+01  -1.576 0.115296    
## heated_area:has_br4_or_more -1.147e+02  8.868e+01  -1.293 0.196258    
## has_garage:has_br2           8.375e+04  9.486e+04   0.883 0.377552    
## has_garage:has_br3          -3.308e+04  9.516e+04  -0.348 0.728177    
## has_garage:has_br4_or_more  -4.208e+04  1.006e+05  -0.418 0.675829    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 118600 on 852 degrees of freedom
## Multiple R-squared:  0.4449, Adjusted R-squared:  0.4332 
## F-statistic: 37.94 on 18 and 852 DF,  p-value: < 2.2e-16

Resulting Regression Line

c <- coefficients(model_2way)
c
##                 (Intercept)                  year_built 
##                1.823913e+04               -1.083981e+01 
##                 heated_area                  has_garage 
##                3.331486e+02                3.228399e+06 
##                     has_br2                     has_br3 
##               -1.489779e+06               -5.551769e+06 
##             has_br4_or_more      year_built:heated_area 
##               -6.657537e+06               -7.135289e-02 
##       year_built:has_garage          year_built:has_br2 
##               -1.686329e+03                8.081133e+02 
##          year_built:has_br3  year_built:has_br4_or_more 
##                2.909934e+03                3.483509e+03 
##      heated_area:has_garage         heated_area:has_br2 
##                7.894349e+01               -1.542497e+02 
##         heated_area:has_br3 heated_area:has_br4_or_more 
##               -1.393855e+02               -1.146905e+02 
##          has_garage:has_br2          has_garage:has_br3 
##                8.375274e+04               -3.308471e+04 
##  has_garage:has_br4_or_more 
##               -4.208227e+04

\[\begin{split} \hat{\text{price}} = 1.824\times 10^{4} - 10.84*\text{year} \\ + 333.1*\text{area} \\ + 3.228\times 10^{6}*\text{garage} \\ - 1.49\times 10^{6}*\text{br}_{2} \\ - 5.552\times 10^{6}*\text{br}_{3} \\ - 6.658\times 10^{6}*\text{br}_{4} \\ - 0.07135*\text{(year:area)} \\ - 1686*\text{(year:garage)} \\ + 808.1*\text{year:br}_{2} \\ + 2910*\text{year:br}_{3} \\ + 3484*\text{year:br}_{4} \\ + 78.94*\text{(area:garage)} \\ - 154.2*\text{area:br}_{2} \\ - 139.4*\text{area:br}_{3} \\ - 114.7*\text{area:br}_{4} \\ + 8.375\times 10^{4}*\text{garage:br}_{2} \\ - 3.308\times 10^{4}*\text{garage:br}_{3} \\ - 4.208\times 10^{4}*\text{garage:br}_{4} \end{split}\]

Test for All Two-Way Significant Interactions

I will check if any of the two-way interactions are significant to determine if I can further simplify the model.

full <- model_2way
reduced <- model_1

test_stat <- anova(reduced, full)

test_stat
## Analysis of Variance Table
## 
## Model 1: price ~ year_built + heated_area + has_garage + has_br
## Model 2: price ~ (year_built + heated_area + has_garage + has_br)^2
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    864 1.2842e+13                                   
## 2    852 1.1983e+13 12 8.5935e+11 5.0918 2.959e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hypothesis Test

  • \(H_0:\) all slopes of Two-Way Interactions \(= 0\)
  • \(H_1:\) at least one of slopes of Two-Way Interactions \(\ne 0\)

Test Statistic

  • \(F_0:\) 5.0918

p-value

  • p < 0.0001

Reject Region

  • Reject \(H_0\) if \(p \lt \alpha; \alpha = 0.05\)

Conclusion / Interpretation

  • Reject \(H_0\). There is sufficient evidence to suggest that at least one of the two-way interactions is significant. I will leave all the interactions in the model for now.

Test for Year Built/Heated Area Interaction

Although I cannot remove all interactions, I can remove one interaction (year built and heated area) without affecting the rest of the model. Below is the t-test for the year built and heated area interaction.

s2 <- summary(model_2way)
s2
## 
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage + 
##     has_br)^2, data = sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -362994  -58395  -16323   31733  906634 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.824e+04  3.073e+06   0.006 0.995265    
## year_built                  -1.084e+01  1.557e+03  -0.007 0.994447    
## heated_area                  3.331e+02  9.024e+02   0.369 0.712077    
## has_garage                   3.228e+06  1.071e+06   3.015 0.002647 ** 
## has_br2                     -1.490e+06  3.133e+06  -0.476 0.634519    
## has_br3                     -5.552e+06  3.160e+06  -1.757 0.079297 .  
## has_br4_or_more             -6.658e+06  3.430e+06  -1.941 0.052567 .  
## year_built:heated_area      -7.135e-02  4.569e-01  -0.156 0.875948    
## year_built:has_garage       -1.686e+03  5.366e+02  -3.143 0.001732 ** 
## year_built:has_br2           8.081e+02  1.587e+03   0.509 0.610763    
## year_built:has_br3           2.910e+03  1.600e+03   1.818 0.069344 .  
## year_built:has_br4_or_more   3.484e+03  1.737e+03   2.006 0.045185 *  
## heated_area:has_garage       7.894e+01  2.102e+01   3.755 0.000185 ***
## heated_area:has_br2         -1.542e+02  8.916e+01  -1.730 0.083976 .  
## heated_area:has_br3         -1.394e+02  8.842e+01  -1.576 0.115296    
## heated_area:has_br4_or_more -1.147e+02  8.868e+01  -1.293 0.196258    
## has_garage:has_br2           8.375e+04  9.486e+04   0.883 0.377552    
## has_garage:has_br3          -3.308e+04  9.516e+04  -0.348 0.728177    
## has_garage:has_br4_or_more  -4.208e+04  1.006e+05  -0.418 0.675829    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 118600 on 852 degrees of freedom
## Multiple R-squared:  0.4449, Adjusted R-squared:  0.4332 
## F-statistic: 37.94 on 18 and 852 DF,  p-value: < 2.2e-16

Hypothesis Test

  • \(H_0:\beta_{(year:area)} = 0\)
  • \(H_1:\beta_{(year:area)} \ne 0\)

Test Statistic

  • \(t_0:\) -0.1562

p-value

  • p = 0.8759

Reject Region

  • Reject \(H_0\) if \(p \lt \alpha; \alpha = 0.05\)

Conclusion / Interpretation

  • Fail to Reject \(H_0\). There is not sufficient evidence to suggest that the interaction between year built and heated area is a significant predictor of sales price. I can remove the interaction of year built and heated area from the model. Removing the interaction does not affect the p-value of the entire model and very slightly improves the Adjusted \(R^2\) of the new model.
#Constructing new model
model_2 <- lm(price ~ (year_built + heated_area + has_garage + has_br)^2 - year_built:heated_area, 
              data = sales)

summary(model_2)
## 
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage + 
##     has_br)^2 - year_built:heated_area, data = sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -362688  -58292  -16329   31438  908343 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.253e+05  2.993e+06   0.042 0.966613    
## year_built                  -6.532e+01  1.517e+03  -0.043 0.965656    
## heated_area                  1.929e+02  8.722e+01   2.212 0.027254 *  
## has_garage                   3.278e+06  1.023e+06   3.204 0.001404 ** 
## has_br2                     -1.454e+06  3.123e+06  -0.466 0.641620    
## has_br3                     -5.475e+06  3.120e+06  -1.755 0.079631 .  
## has_br4_or_more             -6.523e+06  3.317e+06  -1.966 0.049595 *  
## year_built:has_garage       -1.710e+03  5.138e+02  -3.329 0.000909 ***
## year_built:has_br2           7.899e+02  1.582e+03   0.499 0.617691    
## year_built:has_br3           2.871e+03  1.580e+03   1.817 0.069539 .  
## year_built:has_br4_or_more   3.416e+03  1.680e+03   2.033 0.042400 *  
## heated_area:has_garage       7.747e+01  1.879e+01   4.124 4.09e-05 ***
## heated_area:has_br2         -1.542e+02  8.911e+01  -1.731 0.083868 .  
## heated_area:has_br3         -1.399e+02  8.830e+01  -1.585 0.113420    
## heated_area:has_br4_or_more -1.153e+02  8.853e+01  -1.303 0.193035    
## has_garage:has_br2           8.367e+04  9.481e+04   0.883 0.377725    
## has_garage:has_br3          -3.255e+04  9.505e+04  -0.342 0.732087    
## has_garage:has_br4_or_more  -4.118e+04  1.004e+05  -0.410 0.681712    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 118500 on 853 degrees of freedom
## Multiple R-squared:  0.4449, Adjusted R-squared:  0.4339 
## F-statistic: 40.22 on 17 and 853 DF,  p-value: < 2.2e-16

Test if Bedroom/Garage Interaction is Significant

I reviewed the summary table for the model that includes all two-way interactions. Two-way interactions that consider if the house has a garage are shown to be significant when looking at the p-values. However, the p-values that include interaction between bedroom and garage are greater than our reject region. Let us see if the bedroom-garage interaction helps in the prediction of house sales price.

full <- model_2

# Model with Bedroom-Garage interactions removed
reduced <- lm(price ~ (year_built + heated_area + has_garage + has_br)^2 - year_built:heated_area - (has_garage:has_br),
              data = sales)

summary(reduced)
## 
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage + 
##     has_br)^2 - year_built:heated_area - (has_garage:has_br), 
##     data = sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -387631  -63674  -15902   31533  959313 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  2.794e+05  3.017e+06   0.093 0.926222    
## year_built                  -1.429e+02  1.530e+03  -0.093 0.925627    
## heated_area                  1.874e+02  7.118e+01   2.633 0.008611 ** 
## has_garage                   3.574e+06  1.032e+06   3.463 0.000561 ***
## has_br2                     -3.263e+06  3.128e+06  -1.043 0.297131    
## has_br3                     -4.890e+06  3.138e+06  -1.558 0.119582    
## has_br4_or_more             -6.291e+06  3.316e+06  -1.897 0.058144 .  
## year_built:has_garage       -1.831e+03  5.202e+02  -3.519 0.000456 ***
## year_built:has_br2           1.696e+03  1.586e+03   1.069 0.285387    
## year_built:has_br3           2.551e+03  1.591e+03   1.604 0.109078    
## year_built:has_br4_or_more   3.273e+03  1.680e+03   1.948 0.051724 .  
## heated_area:has_garage       4.849e+01  1.778e+01   2.727 0.006518 ** 
## heated_area:has_br2         -1.049e+02  7.303e+01  -1.437 0.151173    
## heated_area:has_br3         -1.219e+02  7.262e+01  -1.679 0.093614 .  
## heated_area:has_br4_or_more -9.401e+01  7.302e+01  -1.288 0.198236    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 120200 on 856 degrees of freedom
## Multiple R-squared:  0.427,  Adjusted R-squared:  0.4177 
## F-statistic: 45.57 on 14 and 856 DF,  p-value: < 2.2e-16
test_stat <- anova(reduced, full)

test_stat
## Analysis of Variance Table
## 
## Model 1: price ~ (year_built + heated_area + has_garage + has_br)^2 - 
##     year_built:heated_area - (has_garage:has_br)
## Model 2: price ~ (year_built + heated_area + has_garage + has_br)^2 - 
##     year_built:heated_area
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    856 1.2370e+13                                   
## 2    853 1.1983e+13  3 3.8652e+11 9.1713 5.622e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hypothesis Test

  • \(H_0:\) all slopes of Bedroom-Garage Two-Way Interaction \(= 0\)
  • \(H_1:\) at least one of the slopes \(\ne 0\)

Test Statistic

  • \(F_0:\) 9.1713

p-value

  • p < 0.0001

Reject Region

  • Reject \(H_0\) if \(p \lt \alpha; \alpha = 0.05\)

Conclusion / Interpretation

  • Reject \(H_0\). There is sufficient evidence to suggest that the bedroom and garage interaction is significant. This interaction will stay in the model.

Test for Dropping Continuous Predictors (Year Built and Heated Area)

Now I will check if the continuous predictors are significant. If the continuous predictors are found to be insignificant, then their corresponding interactions will also have to be removed from the model.

full <- model_2

# Model with Year Built, Heated Area, and their corresponding interactions removed
reduced <- lm(price ~ (has_garage + has_br)^2,
              data = sales)

summary(reduced)
## 
## Call:
## lm(formula = price ~ (has_garage + has_br)^2, data = sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -441893  -73447  -19027   40923  988107 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  129631      33901   3.824 0.000141 ***
## has_garage                   121602      85316   1.425 0.154428    
## has_br2                        6815      35443   0.192 0.847562    
## has_br3                      128177      36645   3.498 0.000493 ***
## has_br4_or_more              206769      42783   4.833 1.59e-06 ***
## has_garage:has_br2           -32012      86448  -0.370 0.711245    
## has_garage:has_br3           -80384      86789  -0.926 0.354601    
## has_garage:has_br4_or_more    23891      91697   0.261 0.794511    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135600 on 863 degrees of freedom
## Multiple R-squared:  0.2649, Adjusted R-squared:  0.2589 
## F-statistic: 44.43 on 7 and 863 DF,  p-value: < 2.2e-16
test_stat <- anova(reduced, full)

test_stat
## Analysis of Variance Table
## 
## Model 1: price ~ (has_garage + has_br)^2
## Model 2: price ~ (year_built + heated_area + has_garage + has_br)^2 - 
##     year_built:heated_area
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    863 1.5869e+13                                   
## 2    853 1.1983e+13 10 3.8864e+12 27.665 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hypothesis Test

  • \(H_0:\) all slopes of Year Built/Heated Area Two-Way Interactions \(= \beta_{year} = \beta_{area} = 0\)
  • \(H_1:\) at least one of the slopes \(\ne 0\)

Test Statistic

  • \(F_0:\) 27.665

p-value

  • p < 0.0001

Reject Region

  • Reject \(H_0\) if \(p \lt \alpha; \alpha = 0.05\)

Conclusion / Interpretation

  • Reject \(H_0\). There is sufficient evidence to suggest the year built and heated area are significant predictors of sales price. I will be keep both in the model for now.

Reconstructing Regression Line based on Tests

After performing tests above, the regression line for the model is below.

c <- coefficients(model_2)
c
##                 (Intercept)                  year_built 
##                1.253312e+05               -6.531881e+01 
##                 heated_area                  has_garage 
##                1.928996e+02                3.277593e+06 
##                     has_br2                     has_br3 
##               -1.453852e+06               -5.475093e+06 
##             has_br4_or_more       year_built:has_garage 
##               -6.522773e+06               -1.710359e+03 
##          year_built:has_br2          year_built:has_br3 
##                7.898488e+02                2.871321e+03 
##  year_built:has_br4_or_more      heated_area:has_garage 
##                3.415623e+03                7.747259e+01 
##         heated_area:has_br2         heated_area:has_br3 
##               -1.542150e+02               -1.399237e+02 
## heated_area:has_br4_or_more          has_garage:has_br2 
##               -1.153323e+02                8.367309e+04 
##          has_garage:has_br3  has_garage:has_br4_or_more 
##               -3.254989e+04               -4.118238e+04

\[\begin{split} \hat{\text{price}} = 1.253\times 10^{5} - 65.32*\text{year} \\ + 192.9*\text{area} \\ + 3.278\times 10^{6}*\text{garage} \\ - 1.454\times 10^{6}*\text{br}_{2} \\ - 5.475\times 10^{6}*\text{br}_{3} \\ - 6.523\times 10^{6}*\text{br}_{4} \\ - 1710*\text{(year:garage)} \\ + 789.8*\text{year:br}_{2} \\ + 2871*\text{year:br}_{3} \\ + 3416*\text{year:br}_{4} \\ + 77.47*\text{(area:garage)} \\ - 154.2*\text{area:br}_{2} \\ - 139.9*\text{area:br}_{3} \\ - 115.3*\text{area:br}_{4} \\ + 8.367\times 10^{4}*\text{garage:br}_{2} \\ - 3.255\times 10^{4}*\text{garage:br}_{3} \\ - 4.118\times 10^{4}*\text{garage:br}_{4} \end{split}\]

Test for Significant Predictors

Since the interactions are significant, the remaining predictors are also significant.

summary(model_2)
## 
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage + 
##     has_br)^2 - year_built:heated_area, data = sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -362688  -58292  -16329   31438  908343 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.253e+05  2.993e+06   0.042 0.966613    
## year_built                  -6.532e+01  1.517e+03  -0.043 0.965656    
## heated_area                  1.929e+02  8.722e+01   2.212 0.027254 *  
## has_garage                   3.278e+06  1.023e+06   3.204 0.001404 ** 
## has_br2                     -1.454e+06  3.123e+06  -0.466 0.641620    
## has_br3                     -5.475e+06  3.120e+06  -1.755 0.079631 .  
## has_br4_or_more             -6.523e+06  3.317e+06  -1.966 0.049595 *  
## year_built:has_garage       -1.710e+03  5.138e+02  -3.329 0.000909 ***
## year_built:has_br2           7.899e+02  1.582e+03   0.499 0.617691    
## year_built:has_br3           2.871e+03  1.580e+03   1.817 0.069539 .  
## year_built:has_br4_or_more   3.416e+03  1.680e+03   2.033 0.042400 *  
## heated_area:has_garage       7.747e+01  1.879e+01   4.124 4.09e-05 ***
## heated_area:has_br2         -1.542e+02  8.911e+01  -1.731 0.083868 .  
## heated_area:has_br3         -1.399e+02  8.830e+01  -1.585 0.113420    
## heated_area:has_br4_or_more -1.153e+02  8.853e+01  -1.303 0.193035    
## has_garage:has_br2           8.367e+04  9.481e+04   0.883 0.377725    
## has_garage:has_br3          -3.255e+04  9.505e+04  -0.342 0.732087    
## has_garage:has_br4_or_more  -4.118e+04  1.004e+05  -0.410 0.681712    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 118500 on 853 degrees of freedom
## Multiple R-squared:  0.4449, Adjusted R-squared:  0.4339 
## F-statistic: 40.22 on 17 and 853 DF,  p-value: < 2.2e-16

Visualizations

For the first visualization, I will model houses with attached garages but different amount of bedrooms. Price is the outcome and will be on the Y-axis. The continuous predictor of Heated Area will be on the X-axis. I will use different line for the different amount of bedrooms. I will assume this house is average age and was built in 1984.

Graphing Heated Area vs Price (year built: 1984)

sales <- sales %>%
  mutate(ya_g1_b1 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*1 
         + c["has_br2"]*0
         + c["has_br3"]*0
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*round(year_summary$mean, 0)*1
         + c["year_built:has_br2"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br3"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*0
         + c["heated_area:has_garage"]*heated_area*1
         + c["heated_area:has_br2"]*heated_area*0
         + c["heated_area:has_br3"]*heated_area*0
         + c["heated_area:has_br4_or_more"]*heated_area*0
         + c["has_garage:has_br2"]*1*0
         + c["has_garage:has_br3"]*1*0
         + c["has_garage:has_br4_or_more"]*1*0,
         
         ya_g1_b2 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*1 
         + c["has_br2"]*1
         + c["has_br3"]*0
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*round(year_summary$mean, 0)*1
         + c["year_built:has_br2"]*round(year_summary$mean, 0)*1
         + c["year_built:has_br3"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*0
         + c["heated_area:has_garage"]*heated_area*1
         + c["heated_area:has_br2"]*heated_area*1
         + c["heated_area:has_br3"]*heated_area*0
         + c["heated_area:has_br4_or_more"]*heated_area*0
         + c["has_garage:has_br2"]*1*1
         + c["has_garage:has_br3"]*1*0
         + c["has_garage:has_br4_or_more"]*1*0,
         
         ya_g1_b3 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*1 
         + c["has_br2"]*0
         + c["has_br3"]*1
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*round(year_summary$mean, 0)*1
         + c["year_built:has_br2"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br3"]*round(year_summary$mean, 0)*1
         + c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*0
         + c["heated_area:has_garage"]*heated_area*1
         + c["heated_area:has_br2"]*heated_area*0
         + c["heated_area:has_br3"]*heated_area*1
         + c["heated_area:has_br4_or_more"]*heated_area*0
         + c["has_garage:has_br2"]*1*0
         + c["has_garage:has_br3"]*1*1
         + c["has_garage:has_br4_or_more"]*1*0,
         
         ya_g1_b4 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*1 
         + c["has_br2"]*0
         + c["has_br3"]*0
         + c["has_br4_or_more"]*1
         + c["year_built:has_garage"]*round(year_summary$mean, 0)*1
         + c["year_built:has_br2"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br3"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*1
         + c["heated_area:has_garage"]*heated_area*1
         + c["heated_area:has_br2"]*heated_area*0
         + c["heated_area:has_br3"]*heated_area*0
         + c["heated_area:has_br4_or_more"]*heated_area*1
         + c["has_garage:has_br2"]*1*0
         + c["has_garage:has_br3"]*1*0
         + c["has_garage:has_br4_or_more"]*1*1
         )

sales %>% ggplot(aes(x = heated_area, y = price, color = has_br)) +
  geom_point(size=0.5) +
  geom_line(aes(y = ya_g1_b1), color = "red") +
  geom_line(aes(y = ya_g1_b2), color = "green") +
  geom_line(aes(y = ya_g1_b3), color = "blue") +
  geom_line(aes(y = ya_g1_b4), color = "purple") +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",",
                                                       scientific = FALSE)) +
  theme_minimal() +
  ggtitle("Plot for Heated Area with Attached Garage")

sales %>% filter(bedrooms == 1) %>% 
  ggplot(aes(x = heated_area, y = price, color = has_br)) +
  geom_point(size=0.5) +
  geom_line(aes(y = ya_g1_b1), color = "red") +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",",
                                                       scientific = FALSE)) +
  theme_minimal() +
  ggtitle("Plot for 1-Bedroom Houses")

The graph shows a 1-bedroom house that is close to 2000 sq-ft that sold for nearly $500,000. This observation falls in line with the other observations as unlikely as it seems. This could indicate a similar issue to the no-bedroom houses discussed earlier in this analysis.

sales <- sales %>%
  mutate(ya_g0_b1 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*0 
         + c["has_br2"]*0
         + c["has_br3"]*0
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br2"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br3"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*0
         + c["heated_area:has_garage"]*heated_area*0
         + c["heated_area:has_br2"]*heated_area*0
         + c["heated_area:has_br3"]*heated_area*0
         + c["heated_area:has_br4_or_more"]*heated_area*0
         + c["has_garage:has_br2"]*0*0
         + c["has_garage:has_br3"]*0*0
         + c["has_garage:has_br4_or_more"]*0*0,
         
         ya_g0_b2 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*0 
         + c["has_br2"]*1
         + c["has_br3"]*0
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br2"]*round(year_summary$mean, 0)*1
         + c["year_built:has_br3"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*0
         + c["heated_area:has_garage"]*heated_area*0
         + c["heated_area:has_br2"]*heated_area*1
         + c["heated_area:has_br3"]*heated_area*0
         + c["heated_area:has_br4_or_more"]*heated_area*0
         + c["has_garage:has_br2"]*0*1
         + c["has_garage:has_br3"]*0*0
         + c["has_garage:has_br4_or_more"]*0*0,
         
         ya_g0_b3 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*0 
         + c["has_br2"]*0
         + c["has_br3"]*1
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br2"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br3"]*round(year_summary$mean, 0)*1
         + c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*0
         + c["heated_area:has_garage"]*heated_area*0
         + c["heated_area:has_br2"]*heated_area*0
         + c["heated_area:has_br3"]*heated_area*1
         + c["heated_area:has_br4_or_more"]*heated_area*0
         + c["has_garage:has_br2"]*0*0
         + c["has_garage:has_br3"]*0*1
         + c["has_garage:has_br4_or_more"]*0*0,
         
         ya_g0_b4 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*0 
         + c["has_br2"]*0
         + c["has_br3"]*0
         + c["has_br4_or_more"]*1
         + c["year_built:has_garage"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br2"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br3"]*round(year_summary$mean, 0)*0
         + c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*1
         + c["heated_area:has_garage"]*heated_area*0
         + c["heated_area:has_br2"]*heated_area*0
         + c["heated_area:has_br3"]*heated_area*0
         + c["heated_area:has_br4_or_more"]*heated_area*1
         + c["has_garage:has_br2"]*0*0
         + c["has_garage:has_br3"]*0*0
         + c["has_garage:has_br4_or_more"]*0*1
         )

sales %>% ggplot(aes(x = heated_area, y = price, color = has_br)) +
  geom_point(size=0.5) +
  geom_line(aes(y = ya_g1_b1), color = "red") +
  geom_line(aes(y = ya_g1_b2), color = "green") +
  geom_line(aes(y = ya_g1_b3), color = "blue") +
  geom_line(aes(y = ya_g1_b4), color = "purple") +
  geom_line(aes(y = ya_g0_b1), color = "red", alpha = 0.5) +
  geom_line(aes(y = ya_g0_b2), color = "green", alpha = 0.5) +
  geom_line(aes(y = ya_g0_b3), color = "blue", alpha = 0.5) +
  geom_line(aes(y = ya_g0_b4), color = "purple", alpha = 0.5) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",",
                                                       scientific = FALSE)) +
  theme_minimal() +
  ggtitle("Plot for Heated Area (No Attached Garage is lighter color)")

Let us compare all the predicted values at the median heated area of 1,361 sq-ft. I will also display predicted value at the smallest area of 136 and the largest of 5,731.

sales %>% select(heated_area, 
                 ya_g1_b1,
                 ya_g1_b2,
                 ya_g1_b3,
                 ya_g1_b4,
                 ya_g0_b1,
                 ya_g0_b2,
                 ya_g0_b3,
                 ya_g0_b4,
                 ) %>% 
  filter(heated_area == round(area_summary$median, 0) | heated_area == round(area_summary$min, 0) | heated_area == round(area_summary$max, 0)) %>% 
  # There could be multiple actual homes with these areas. We only need one of each
  # since we only want the predicted values
  distinct %>% 
  arrange(heated_area)
## # A tibble: 3 × 9
##   heated_area ya_g1_b1 ya_g1_b2 ya_g1_b3 ya_g1_b4 ya_g0_b1 ya_g0_b2 ya_g0_b3
##         <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1         136  -83249.   92659.   86780.  113707.   21973.  114208.  224551.
## 2        1361  247957.  234952.  246579.  303631.  258275.  161596.  289447.
## 3        5731 1429484.  742559.  816639.  981155. 1101247.  330648.  520952.
## # … with 1 more variable: ya_g0_b4 <dbl>

Graphing Heated Area vs Price (year built: 2021)

Let us plot if the house was built last year using the same X and Y axis.

sales <- sales %>%
  mutate(yb_g1_b1 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*1 
         + c["has_br2"]*0
         + c["has_br3"]*0
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*round(year_summary$max, 0)*1
         + c["year_built:has_br2"]*round(year_summary$max, 0)*0
         + c["year_built:has_br3"]*round(year_summary$max, 0)*0
         + c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*0
         + c["heated_area:has_garage"]*heated_area*1
         + c["heated_area:has_br2"]*heated_area*0
         + c["heated_area:has_br3"]*heated_area*0
         + c["heated_area:has_br4_or_more"]*heated_area*0
         + c["has_garage:has_br2"]*1*0
         + c["has_garage:has_br3"]*1*0
         + c["has_garage:has_br4_or_more"]*1*0,
         
         yb_g1_b2 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*1 
         + c["has_br2"]*1
         + c["has_br3"]*0
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*round(year_summary$max, 0)*1
         + c["year_built:has_br2"]*round(year_summary$max, 0)*1
         + c["year_built:has_br3"]*round(year_summary$max, 0)*0
         + c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*0
         + c["heated_area:has_garage"]*heated_area*1
         + c["heated_area:has_br2"]*heated_area*1
         + c["heated_area:has_br3"]*heated_area*0
         + c["heated_area:has_br4_or_more"]*heated_area*0
         + c["has_garage:has_br2"]*1*1
         + c["has_garage:has_br3"]*1*0
         + c["has_garage:has_br4_or_more"]*1*0,
         
         yb_g1_b3 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*1 
         + c["has_br2"]*0
         + c["has_br3"]*1
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*round(year_summary$max, 0)*1
         + c["year_built:has_br2"]*round(year_summary$max, 0)*0
         + c["year_built:has_br3"]*round(year_summary$max, 0)*1
         + c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*0
         + c["heated_area:has_garage"]*heated_area*1
         + c["heated_area:has_br2"]*heated_area*0
         + c["heated_area:has_br3"]*heated_area*1
         + c["heated_area:has_br4_or_more"]*heated_area*0
         + c["has_garage:has_br2"]*1*0
         + c["has_garage:has_br3"]*1*1
         + c["has_garage:has_br4_or_more"]*1*0,
         
         yb_g1_b4 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*1 
         + c["has_br2"]*0
         + c["has_br3"]*0
         + c["has_br4_or_more"]*1
         + c["year_built:has_garage"]*round(year_summary$max, 0)*1
         + c["year_built:has_br2"]*round(year_summary$max, 0)*0
         + c["year_built:has_br3"]*round(year_summary$max, 0)*0
         + c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*1
         + c["heated_area:has_garage"]*heated_area*1
         + c["heated_area:has_br2"]*heated_area*0
         + c["heated_area:has_br3"]*heated_area*0
         + c["heated_area:has_br4_or_more"]*heated_area*1
         + c["has_garage:has_br2"]*1*0
         + c["has_garage:has_br3"]*1*0
         + c["has_garage:has_br4_or_more"]*1*1
         )

sales %>% ggplot(aes(x = heated_area, y = price, color = has_br)) +
  geom_point(size=0.5) +
  geom_line(aes(y = yb_g1_b1), color = "red") +
  geom_line(aes(y = yb_g1_b2), color = "green") +
  geom_line(aes(y = yb_g1_b3), color = "blue") +
  geom_line(aes(y = yb_g1_b4), color = "purple") +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",",
                                                       scientific = FALSE)) +
  theme_minimal() +
  ggtitle("Plot for Heated Area with Attached Garage")

sales <- sales %>%
  mutate(yb_g0_b1 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*0 
         + c["has_br2"]*0
         + c["has_br3"]*0
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*round(year_summary$max, 0)*0
         + c["year_built:has_br2"]*round(year_summary$max, 0)*0
         + c["year_built:has_br3"]*round(year_summary$max, 0)*0
         + c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*0
         + c["heated_area:has_garage"]*heated_area*0
         + c["heated_area:has_br2"]*heated_area*0
         + c["heated_area:has_br3"]*heated_area*0
         + c["heated_area:has_br4_or_more"]*heated_area*0
         + c["has_garage:has_br2"]*0*0
         + c["has_garage:has_br3"]*0*0
         + c["has_garage:has_br4_or_more"]*0*0,
         
         yb_g0_b2 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*0 
         + c["has_br2"]*1
         + c["has_br3"]*0
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*round(year_summary$max, 0)*0
         + c["year_built:has_br2"]*round(year_summary$max, 0)*1
         + c["year_built:has_br3"]*round(year_summary$max, 0)*0
         + c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*0
         + c["heated_area:has_garage"]*heated_area*0
         + c["heated_area:has_br2"]*heated_area*1
         + c["heated_area:has_br3"]*heated_area*0
         + c["heated_area:has_br4_or_more"]*heated_area*0
         + c["has_garage:has_br2"]*0*1
         + c["has_garage:has_br3"]*0*0
         + c["has_garage:has_br4_or_more"]*0*0,
         
         yb_g0_b3 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*0 
         + c["has_br2"]*0
         + c["has_br3"]*1
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*round(year_summary$max, 0)*0
         + c["year_built:has_br2"]*round(year_summary$max, 0)*0
         + c["year_built:has_br3"]*round(year_summary$max, 0)*1
         + c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*0
         + c["heated_area:has_garage"]*heated_area*0
         + c["heated_area:has_br2"]*heated_area*0
         + c["heated_area:has_br3"]*heated_area*1
         + c["heated_area:has_br4_or_more"]*heated_area*0
         + c["has_garage:has_br2"]*0*0
         + c["has_garage:has_br3"]*0*1
         + c["has_garage:has_br4_or_more"]*0*0,
         
         yb_g0_b4 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
         + c["heated_area"]*heated_area + c["has_garage"]*0 
         + c["has_br2"]*0
         + c["has_br3"]*0
         + c["has_br4_or_more"]*1
         + c["year_built:has_garage"]*round(year_summary$max, 0)*0
         + c["year_built:has_br2"]*round(year_summary$max, 0)*0
         + c["year_built:has_br3"]*round(year_summary$max, 0)*0
         + c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*1
         + c["heated_area:has_garage"]*heated_area*0
         + c["heated_area:has_br2"]*heated_area*0
         + c["heated_area:has_br3"]*heated_area*0
         + c["heated_area:has_br4_or_more"]*heated_area*1
         + c["has_garage:has_br2"]*0*0
         + c["has_garage:has_br3"]*0*0
         + c["has_garage:has_br4_or_more"]*0*1
         )

sales %>% ggplot(aes(x = heated_area, y = price, color = has_br)) +
  geom_point(size=0.5) +
  geom_line(aes(y = yb_g1_b1), color = "red") +
  geom_line(aes(y = yb_g1_b2), color = "green") +
  geom_line(aes(y = yb_g1_b3), color = "blue") +
  geom_line(aes(y = yb_g1_b4), color = "purple") +
  geom_line(aes(y = yb_g0_b1), color = "red", alpha = .5) +
  geom_line(aes(y = yb_g0_b2), color = "green", alpha = .5) +
  geom_line(aes(y = yb_g0_b3), color = "blue", alpha = .5) +
  geom_line(aes(y = yb_g0_b4), color = "purple", alpha = .5) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",",
                                                       scientific = FALSE)) +
  theme_minimal() +
  ggtitle("Plot for Heated Area (No Attached Garage is lighter color)")

Again, let us compare the predicted values at the median, minimum, and maximum heated area of 1,361, 136, and 5,731 sq-ft respectively.

sales %>% select(heated_area, 
                 yb_g1_b1,
                 yb_g1_b2,
                 yb_g1_b3,
                 yb_g1_b4,
                 yb_g0_b1,
                 yb_g0_b2,
                 yb_g0_b3,
                 yb_g0_b4,
                 ) %>% 
  filter(heated_area == round(area_summary$median, 0) | heated_area == round(area_summary$min, 0) | heated_area == round(area_summary$max, 0)) %>% 
  # There could be multiple actual homes with these areas. We only need one of each
  # since we only want the predicted values
  distinct %>% 
  arrange(heated_area)
## # A tibble: 3 × 9
##   heated_area yb_g1_b1 yb_g1_b2 yb_g1_b3 yb_g1_b4 yb_g0_b1 yb_g0_b2 yb_g0_b3
##         <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1         136 -148949.   56183.  127318.  174385.   19556.  141015.  328373.
## 2        1361  182257.  198476.  287118.  364309.  255858.  188404.  393269.
## 3        5731 1363784.  706083.  857178. 1041833. 1098830.  357456.  624774.
## # … with 1 more variable: yb_g0_b4 <dbl>

Graphing Year Built vs Price

Now, I will graph the year built on the X-axis. I set heated area constant at 1361 and assume the house does not have a garage.

sales <- sales %>%
  mutate(yc_g0_b1 = c["(Intercept)"] + c["year_built"]*year_built
         + c["heated_area"]*round(area_summary$median, 0) + c["has_garage"]*0 
         + c["has_br2"]*0
         + c["has_br3"]*0
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*year_built*0
         + c["year_built:has_br2"]*year_built*0
         + c["year_built:has_br3"]*year_built*0
         + c["year_built:has_br4_or_more"]*year_built*0
         + c["heated_area:has_garage"]*round(area_summary$median, 0)*0
         + c["heated_area:has_br2"]*round(area_summary$median, 0)*0
         + c["heated_area:has_br3"]*round(area_summary$median, 0)*0
         + c["heated_area:has_br4_or_more"]*round(area_summary$median, 0)*0
         + c["has_garage:has_br2"]*0*0
         + c["has_garage:has_br3"]*0*0
         + c["has_garage:has_br4_or_more"]*0*0,
         
         yc_g0_b2 = c["(Intercept)"] + c["year_built"]*year_built
         + c["heated_area"]*round(area_summary$median, 0) + c["has_garage"]*0 
         + c["has_br2"]*1
         + c["has_br3"]*0
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*year_built*0
         + c["year_built:has_br2"]*year_built*1
         + c["year_built:has_br3"]*year_built*0
         + c["year_built:has_br4_or_more"]*year_built*0
         + c["heated_area:has_garage"]*round(area_summary$median, 0)*0
         + c["heated_area:has_br2"]*round(area_summary$median, 0)*1
         + c["heated_area:has_br3"]*round(area_summary$median, 0)*0
         + c["heated_area:has_br4_or_more"]*round(area_summary$median, 0)*0
         + c["has_garage:has_br2"]*0*1
         + c["has_garage:has_br3"]*0*0
         + c["has_garage:has_br4_or_more"]*0*0,
         
         yc_g0_b3 = c["(Intercept)"] + c["year_built"]*year_built
         + c["heated_area"]*round(area_summary$median, 0) + c["has_garage"]*0 
         + c["has_br2"]*0
         + c["has_br3"]*1
         + c["has_br4_or_more"]*0
         + c["year_built:has_garage"]*year_built*0
         + c["year_built:has_br2"]*year_built*0
         + c["year_built:has_br3"]*year_built*1
         + c["year_built:has_br4_or_more"]*year_built*0
         + c["heated_area:has_garage"]*round(area_summary$median, 0)*0
         + c["heated_area:has_br2"]*round(area_summary$median, 0)*0
         + c["heated_area:has_br3"]*round(area_summary$median, 0)*1
         + c["heated_area:has_br4_or_more"]*round(area_summary$median, 0)*0
         + c["has_garage:has_br2"]*0*0
         + c["has_garage:has_br3"]*0*1
         + c["has_garage:has_br4_or_more"]*0*0,
         
         yc_g0_b4 = c["(Intercept)"] + c["year_built"]*year_built
         + c["heated_area"]*round(area_summary$median, 0) + c["has_garage"]*0 
         + c["has_br2"]*0
         + c["has_br3"]*0
         + c["has_br4_or_more"]*1
         + c["year_built:has_garage"]*year_built*0
         + c["year_built:has_br2"]*year_built*0
         + c["year_built:has_br3"]*year_built*0
         + c["year_built:has_br4_or_more"]*year_built*1
         + c["heated_area:has_garage"]*round(area_summary$median, 0)*0
         + c["heated_area:has_br2"]*round(area_summary$median, 0)*0
         + c["heated_area:has_br3"]*round(area_summary$median, 0)*0
         + c["heated_area:has_br4_or_more"]*round(area_summary$median, 0)*1
         + c["has_garage:has_br2"]*0*0
         + c["has_garage:has_br3"]*0*0
         + c["has_garage:has_br4_or_more"]*0*1
         )

sales %>% ggplot(aes(x = year_built, y = price, color = has_br)) +
  geom_point(size=0.5) +
  geom_line(aes(y = yc_g0_b1), color = "red") +
  geom_line(aes(y = yc_g0_b2), color = "green") +
  geom_line(aes(y = yc_g0_b3), color = "blue") +
  geom_line(aes(y = yc_g0_b4), color = "purple") +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",",
                                                       scientific = FALSE)) +
  theme_minimal() +
  ggtitle("Plot for Year Built")

Interpretations of Predictors

When the year built is held constant, the sales price on average increases as heated area increases. One-bedroom house expected sales prices increase the most rapidly as heated area increases. As mentioned earlier, these could be houses next to multi-bedroom houses that sold on the same deal.

For houses built in 1984 and in 2021, houses with garages have a greater increase in expected sale price as heated area increases than houses without an attached garage. However, houses with garages below a certain heated area are expected to sale for less than houses without a garage. Three-bedroom houses without garages are expected to sale for more money than two-bedroom houses without garages. Meanwhile, houses with 4 or more bedrooms are consistently expected to sale for more than both two- and three-bedroom houses as heated area increases.

When year built increases, expected sales prices increases for three- and four or more bedroom houses. Sales price for two-bedroom houses increase at a lower rate, and one-bedroom houses decrease in sales price as the year built increases.

The most surprising result of the model is that having an attached garage only helps the expected sales price if the house is above a particular square footage.

References