This regression model seeks to determine if the sales price of a single-family home can be predicted from the house’s characteristics. My analysis will use over 3 months of sales data from my county’s appraisal website. Although it is publicly available, I removed the owner, address, parcel ID, and date of sale from data before importing into R. The two continuous predictors will be the year built and heated area of the house. A categorical predictor will be whether the house has an attached garage. The other categorical predictor will be based on the number of bedrooms with houses split into groups - 1, 2, 3, and 4 or more bedrooms.
The dataset will be composed of one file that contains the sales price, year built, heated area, number of bedrooms, and if the house has an attached garage.
sales_raw <- read.csv("~/sales.csv") %>%
as_tibble
sales_raw
## # A tibble: 876 × 12
## nh_cd price qualified use_code acreage vacant year_built heated_area
## <dbl> <int> <chr> <int> <dbl> <chr> <int> <int>
## 1 1240 230000 Qualified 1 0.43 Improved 2006 1819
## 2 3031 325000 Qualified 1 1.63 Improved 1974 1882
## 3 1115 274300 Qualified 1 1 Improved 1984 1436
## 4 1228 487000 Qualified 1 0.79 Improved 1988 3115
## 5 3209 1100000 Qualified 1 0.54 Improved 2019 3979
## 6 3202 285000 Qualified 1 0.37 Improved 1979 1229
## 7 4593 63000 Qualified 1 0.01 Improved 1973 575
## 8 1083 335000 Qualified 1 0.29 Improved 1995 2213
## 9 4491 170000 Qualified 1 0.04 Improved 2006 1278
## 10 4485 225000 Qualified 1 0.25 Improved 1986 1681
## # … with 866 more rows, and 4 more variables: bedrooms <int>,
## # bathroom_fixtures <int>, has_garage <int>, millage <int>
ggplot(sales_raw, aes(bedrooms)) +
geom_histogram(binwidth = 1, color = "#000000", fill = "#0099F8") +
geom_text(aes(label = ..count..), stat = "count", vjust = -0.25) +
scale_x_continuous(breaks=seq(0, 7, by = 1)) +
xlab("# of Bedrooms") +
theme_classic()
bedroom_frequency <- sales_raw %>%
group_by(bedrooms) %>%
summarise(cnt = n()) %>%
mutate(pct = round(cnt / sum(cnt) * 100, 1)) %>%
arrange(desc(pct))
bedroom_frequency
## # A tibble: 7 × 3
## bedrooms cnt pct
## <int> <int> <dbl>
## 1 3 402 45.9
## 2 2 382 43.6
## 3 4 54 6.2
## 4 1 19 2.2
## 5 5 13 1.5
## 6 0 5 0.6
## 7 7 1 0.1
top_2 <- slice_max(bedroom_frequency, cnt, n = 2)
The data show that 3- and 2-bedroom houses accounted for a combined 89.5% of houses sold. We can also see 5 houses sold with no bedrooms.
filter(sales_raw, bedrooms == 0)
## # A tibble: 5 × 12
## nh_cd price qualified use_code acreage vacant year_built heated_area bedrooms
## <dbl> <int> <chr> <int> <dbl> <chr> <int> <int> <int>
## 1 561 100000 Qualified 1 4.79 Impro… 1994 784 0
## 2 561 100000 Qualified 1 4.79 Impro… 1977 1440 0
## 3 1241 380000 Qualified 1 0.5 Impro… 2021 900 0
## 4 1241 380000 Qualified 1 0.5 Impro… 2021 1498 0
## 5 5130 499900 Qualified 1 1.37 Impro… 1957 1914 0
## # … with 3 more variables: bathroom_fixtures <int>, has_garage <int>,
## # millage <int>
Upon inspection, 4 of the “no bedroom” houses are composed of 2 houses with buildings on adjacent parcels. The last “house” appears to be a single building on a large parcel. I will exclude the “no bedroom” houses since the sales price listed for these is actually based on the sales price of the adjacent house.
sales <- sales_raw %>%
# Remove 0 bedroom houses
filter(bedrooms != 0) %>%
# Group together houses with 4 or more bedrooms
mutate(has_br = if_else(bedrooms >= 4, '4_or_more', paste(bedrooms))) %>%
# Select outcome and predictor columns
select(price,
year_built,
heated_area,
bedrooms,
has_br,
has_garage) %>%
# Create Dummy Columns for Bedroom Groups and Garage
dummy_cols('has_br')
sales
## # A tibble: 871 × 10
## price year_built heated_area bedrooms has_br has_garage has_br_1 has_br_2
## <int> <int> <int> <int> <chr> <int> <int> <int>
## 1 230000 2006 1819 3 3 1 0 0
## 2 325000 1974 1882 3 3 1 0 0
## 3 274300 1984 1436 3 3 1 0 0
## 4 487000 1988 3115 5 4_or_mo… 1 0 0
## 5 1100000 2019 3979 5 4_or_mo… 1 0 0
## 6 285000 1979 1229 3 3 0 0 0
## 7 63000 1973 575 1 1 0 1 0
## 8 335000 1995 2213 3 3 1 0 0
## 9 170000 2006 1278 2 2 1 0 1
## 10 225000 1986 1681 3 3 1 0 0
## # … with 861 more rows, and 2 more variables: has_br_3 <int>,
## # has_br_4_or_more <int>
ggplot(sales, aes(x=year_built, y=price)) +
geom_point(size=0.5) +
xlab("Year Built") +
ylab("Sales Price") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",",
scientific = FALSE)) +
theme_minimal()
ggplot(sales, aes(x=heated_area, y=price)) +
geom_point(size=0.5) +
xlab("Heated Area") +
ylab("Sales Price") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",",
scientific = FALSE)) +
theme_minimal()
# Histogram of Year Built
ggplot(sales, aes(year_built)) +
geom_histogram(binwidth = 5) +
xlab("Year Built") +
theme_minimal()
# Year Built Summary Stats
year_summary <- summarize(sales, mean = mean(year_built, na.rm=TRUE),
max = max(year_built, na.rm=TRUE),
median = median(year_built, na.rm=TRUE),
min = min(year_built, na.rm=TRUE))
The average year built for houses sold is around 1984, and the median is 1987. The most recent home was built in 2021, and the oldest was built in 1914.
# Histogram of Heated Area
ggplot(sales, aes(heated_area)) +
geom_histogram(binwidth = 100) +
xlab("Heated Area") +
theme_minimal()
# Year Built Summary Stats
area_summary <- summarize(sales, mean = mean(heated_area, na.rm=TRUE),
sd = sd(heated_area, na.rm=TRUE),
min = min(heated_area, na.rm=TRUE),
max = max(heated_area, na.rm=TRUE),
median = median(heated_area, na.rm=TRUE),
IQR = IQR(heated_area, na.rm=TRUE))
The heated area has a mean of 1,428 sq-ft with a standard deviation of 607 sq-ft. The median is 1,361 sq-ft with an interquartile range of 650 sq-ft.
# Frequency of houses that have a garage
garage_frequency <- sales %>%
group_by(has_garage) %>%
summarise(cnt = n()) %>%
mutate(pct = round(cnt / sum(cnt) * 100, 1)) %>%
arrange(desc(pct))
# Pie Chart
ggplot(garage_frequency, aes(x = "", y = pct, fill = factor(has_garage))) +
geom_col() +
geom_text(aes(label = paste(pct,"%")),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
coord_polar(theta = "y") +
guides(fill = guide_legend(title = "Has Garage (1=Yes)")) +
theme_void()
About 64% of the houses that sold had attached garages.
I will construct a model with both continuous and both categorical predictors while removing the reference group (houses with 1 bedroom) from the model.
# Constructing the model with dummy columns
model_dummy <- lm(price ~ year_built + heated_area + has_garage + has_br_2 + has_br_3 + has_br_4_or_more,
data = sales)
summary(model_dummy)
##
## Call:
## lm(formula = price ~ year_built + heated_area + has_garage +
## has_br_2 + has_br_3 + has_br_4_or_more, data = sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -368804 -62904 -20153 35024 1027678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.706e+06 4.976e+05 -5.438 7.00e-08 ***
## year_built 1.401e+03 2.522e+02 5.556 3.68e-08 ***
## heated_area 1.141e+02 8.340e+00 13.680 < 2e-16 ***
## has_garage 5.536e+03 1.066e+04 0.519 0.60358
## has_br_2 -1.686e+04 2.893e+04 -0.583 0.56028
## has_br_3 2.271e+04 2.954e+04 0.769 0.44223
## has_br_4_or_more 1.045e+05 3.336e+04 3.134 0.00178 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 121900 on 864 degrees of freedom
## Multiple R-squared: 0.4051, Adjusted R-squared: 0.401
## F-statistic: 98.07 on 6 and 864 DF, p-value: < 2.2e-16
# Constructing the same model without dummy columns (will use this method to avoid typos)
model_1 <- lm(price ~ year_built + heated_area + has_garage + has_br,
data = sales)
summary(model_1)
##
## Call:
## lm(formula = price ~ year_built + heated_area + has_garage +
## has_br, data = sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -368804 -62904 -20153 35024 1027678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.706e+06 4.976e+05 -5.438 7.00e-08 ***
## year_built 1.401e+03 2.522e+02 5.556 3.68e-08 ***
## heated_area 1.141e+02 8.340e+00 13.680 < 2e-16 ***
## has_garage 5.536e+03 1.066e+04 0.519 0.60358
## has_br2 -1.686e+04 2.893e+04 -0.583 0.56028
## has_br3 2.271e+04 2.954e+04 0.769 0.44223
## has_br4_or_more 1.045e+05 3.336e+04 3.134 0.00178 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 121900 on 864 degrees of freedom
## Multiple R-squared: 0.4051, Adjusted R-squared: 0.401
## F-statistic: 98.07 on 6 and 864 DF, p-value: < 2.2e-16
Now, I will construct a model with three-way interactions. I will avoid including the four-way interaction since it would be difficult to interpret. It also creates a singularity. See below.
# Example of creating 4-way interaction (will not be used)
model_4way <- lm(price ~ year_built * heated_area * has_garage * has_br,
data = sales)
summary(model_4way)
##
## Call:
## lm(formula = price ~ year_built * heated_area * has_garage *
## has_br, data = sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -360825 -57773 -12565 33537 891915
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) -1.171e+07 1.552e+07 -0.755
## year_built 5.948e+03 7.866e+03 0.756
## heated_area 1.593e+04 2.083e+04 0.765
## has_garage 2.697e+07 7.169e+07 0.376
## has_br2 9.127e+06 1.574e+07 0.580
## has_br3 4.101e+06 1.577e+07 0.260
## has_br4_or_more -2.512e+06 1.702e+07 -0.148
## year_built:heated_area -7.993e+00 1.055e+01 -0.757
## year_built:has_garage -1.364e+04 3.591e+04 -0.380
## heated_area:has_garage -5.547e+03 4.112e+03 -1.349
## year_built:has_br2 -4.590e+03 7.978e+03 -0.575
## year_built:has_br3 -2.007e+03 7.994e+03 -0.251
## year_built:has_br4_or_more 1.393e+03 8.625e+03 0.162
## heated_area:has_br2 -1.368e+04 2.097e+04 -0.652
## heated_area:has_br3 -1.500e+04 2.094e+04 -0.716
## heated_area:has_br4_or_more -1.141e+04 2.117e+04 -0.539
## has_garage:has_br2 -2.984e+07 7.194e+07 -0.415
## has_garage:has_br3 -1.456e+07 7.182e+07 -0.203
## has_garage:has_br4_or_more -1.333e+07 7.164e+07 -0.186
## year_built:heated_area:has_garage 2.811e+00 2.069e+00 1.359
## year_built:heated_area:has_br2 6.873e+00 1.062e+01 0.647
## year_built:heated_area:has_br3 7.554e+00 1.061e+01 0.712
## year_built:heated_area:has_br4_or_more 5.724e+00 1.072e+01 0.534
## year_built:has_garage:has_br2 1.507e+04 3.604e+04 0.418
## year_built:has_garage:has_br3 7.331e+03 3.598e+04 0.204
## year_built:has_garage:has_br4_or_more 6.626e+03 3.589e+04 0.185
## heated_area:has_garage:has_br2 7.170e+03 6.287e+03 1.140
## heated_area:has_garage:has_br3 1.174e+03 4.971e+03 0.236
## heated_area:has_garage:has_br4_or_more 1.125e+02 5.509e+02 0.204
## year_built:heated_area:has_garage:has_br2 -3.594e+00 3.167e+00 -1.135
## year_built:heated_area:has_garage:has_br3 -5.830e-01 2.502e+00 -0.233
## year_built:heated_area:has_garage:has_br4_or_more NA NA NA
## Pr(>|t|)
## (Intercept) 0.451
## year_built 0.450
## heated_area 0.445
## has_garage 0.707
## has_br2 0.562
## has_br3 0.795
## has_br4_or_more 0.883
## year_built:heated_area 0.449
## year_built:has_garage 0.704
## heated_area:has_garage 0.178
## year_built:has_br2 0.565
## year_built:has_br3 0.802
## year_built:has_br4_or_more 0.872
## heated_area:has_br2 0.514
## heated_area:has_br3 0.474
## heated_area:has_br4_or_more 0.590
## has_garage:has_br2 0.678
## has_garage:has_br3 0.839
## has_garage:has_br4_or_more 0.852
## year_built:heated_area:has_garage 0.175
## year_built:heated_area:has_br2 0.518
## year_built:heated_area:has_br3 0.477
## year_built:heated_area:has_br4_or_more 0.594
## year_built:has_garage:has_br2 0.676
## year_built:has_garage:has_br3 0.839
## year_built:has_garage:has_br4_or_more 0.854
## heated_area:has_garage:has_br2 0.254
## heated_area:has_garage:has_br3 0.813
## heated_area:has_garage:has_br4_or_more 0.838
## year_built:heated_area:has_garage:has_br2 0.257
## year_built:heated_area:has_garage:has_br3 0.816
## year_built:heated_area:has_garage:has_br4_or_more NA
##
## Residual standard error: 118100 on 840 degrees of freedom
## Multiple R-squared: 0.4575, Adjusted R-squared: 0.4381
## F-statistic: 23.61 on 30 and 840 DF, p-value: < 2.2e-16
# Three-way interaction model
model_3way <- lm(price ~ (year_built + heated_area + has_garage + has_br)^3,
data = sales)
summary(model_3way)
##
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage +
## has_br)^3, data = sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -361812 -57526 -13227 33497 896580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.171e+07 1.552e+07 -0.755 0.4505
## year_built 5.948e+03 7.864e+03 0.756 0.4496
## heated_area 1.593e+04 2.083e+04 0.765 0.4445
## has_garage 2.468e+07 7.155e+07 0.345 0.7303
## has_br2 7.833e+06 1.570e+07 0.499 0.6179
## has_br3 4.690e+06 1.572e+07 0.298 0.7654
## has_br4_or_more 4.743e+05 1.618e+07 0.029 0.9766
## year_built:heated_area -7.993e+00 1.055e+01 -0.757 0.4490
## year_built:has_garage -1.247e+04 3.584e+04 -0.348 0.7279
## year_built:has_br2 -3.930e+03 7.956e+03 -0.494 0.6215
## year_built:has_br3 -2.304e+03 7.964e+03 -0.289 0.7724
## year_built:has_br4_or_more -1.168e+02 8.197e+03 -0.014 0.9886
## heated_area:has_garage -3.562e+03 2.134e+03 -1.669 0.0954 .
## heated_area:has_br2 -1.242e+04 2.093e+04 -0.593 0.5531
## heated_area:has_br3 -1.546e+04 2.091e+04 -0.740 0.4598
## heated_area:has_br4_or_more -1.307e+04 2.095e+04 -0.624 0.5331
## has_garage:has_br2 -2.129e+07 7.153e+07 -0.298 0.7661
## has_garage:has_br3 -1.349e+07 7.153e+07 -0.189 0.8504
## has_garage:has_br4_or_more -1.469e+07 7.157e+07 -0.205 0.8375
## year_built:heated_area:has_garage 1.803e+00 1.047e+00 1.722 0.0854 .
## year_built:heated_area:has_br2 6.232e+00 1.061e+01 0.588 0.5569
## year_built:heated_area:has_br3 7.789e+00 1.059e+01 0.735 0.4622
## year_built:heated_area:has_br4_or_more 6.562e+00 1.062e+01 0.618 0.5366
## year_built:has_garage:has_br2 1.076e+04 3.584e+04 0.300 0.7641
## year_built:has_garage:has_br3 6.783e+03 3.583e+04 0.189 0.8499
## year_built:has_garage:has_br4_or_more 7.306e+03 3.586e+04 0.204 0.8386
## heated_area:has_garage:has_br2 5.730e+01 5.516e+02 0.104 0.9173
## heated_area:has_garage:has_br3 3.168e+01 5.503e+02 0.058 0.9541
## heated_area:has_garage:has_br4_or_more 1.216e+02 5.504e+02 0.221 0.8251
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 118000 on 842 degrees of freedom
## Multiple R-squared: 0.4565, Adjusted R-squared: 0.4384
## F-statistic: 25.26 on 28 and 842 DF, p-value: < 2.2e-16
Before continuing, let us check if any of the three-way interactions are worth keeping in the model.
full <- model_3way
# Reduced model with main effects and two-way interactions
model_2way <- lm(price ~ (year_built + heated_area + has_garage + has_br)^2,
data = sales)
reduced <- model_2way
summary(reduced)
##
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage +
## has_br)^2, data = sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -362994 -58395 -16323 31733 906634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.824e+04 3.073e+06 0.006 0.995265
## year_built -1.084e+01 1.557e+03 -0.007 0.994447
## heated_area 3.331e+02 9.024e+02 0.369 0.712077
## has_garage 3.228e+06 1.071e+06 3.015 0.002647 **
## has_br2 -1.490e+06 3.133e+06 -0.476 0.634519
## has_br3 -5.552e+06 3.160e+06 -1.757 0.079297 .
## has_br4_or_more -6.658e+06 3.430e+06 -1.941 0.052567 .
## year_built:heated_area -7.135e-02 4.569e-01 -0.156 0.875948
## year_built:has_garage -1.686e+03 5.366e+02 -3.143 0.001732 **
## year_built:has_br2 8.081e+02 1.587e+03 0.509 0.610763
## year_built:has_br3 2.910e+03 1.600e+03 1.818 0.069344 .
## year_built:has_br4_or_more 3.484e+03 1.737e+03 2.006 0.045185 *
## heated_area:has_garage 7.894e+01 2.102e+01 3.755 0.000185 ***
## heated_area:has_br2 -1.542e+02 8.916e+01 -1.730 0.083976 .
## heated_area:has_br3 -1.394e+02 8.842e+01 -1.576 0.115296
## heated_area:has_br4_or_more -1.147e+02 8.868e+01 -1.293 0.196258
## has_garage:has_br2 8.375e+04 9.486e+04 0.883 0.377552
## has_garage:has_br3 -3.308e+04 9.516e+04 -0.348 0.728177
## has_garage:has_br4_or_more -4.208e+04 1.006e+05 -0.418 0.675829
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 118600 on 852 degrees of freedom
## Multiple R-squared: 0.4449, Adjusted R-squared: 0.4332
## F-statistic: 37.94 on 18 and 852 DF, p-value: < 2.2e-16
test_stat <- anova(reduced, full)
test_stat
## Analysis of Variance Table
##
## Model 1: price ~ (year_built + heated_area + has_garage + has_br)^2
## Model 2: price ~ (year_built + heated_area + has_garage + has_br)^3
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 852 1.1983e+13
## 2 842 1.1733e+13 10 2.4924e+11 1.7885 0.05876 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model_2way)
##
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage +
## has_br)^2, data = sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -362994 -58395 -16323 31733 906634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.824e+04 3.073e+06 0.006 0.995265
## year_built -1.084e+01 1.557e+03 -0.007 0.994447
## heated_area 3.331e+02 9.024e+02 0.369 0.712077
## has_garage 3.228e+06 1.071e+06 3.015 0.002647 **
## has_br2 -1.490e+06 3.133e+06 -0.476 0.634519
## has_br3 -5.552e+06 3.160e+06 -1.757 0.079297 .
## has_br4_or_more -6.658e+06 3.430e+06 -1.941 0.052567 .
## year_built:heated_area -7.135e-02 4.569e-01 -0.156 0.875948
## year_built:has_garage -1.686e+03 5.366e+02 -3.143 0.001732 **
## year_built:has_br2 8.081e+02 1.587e+03 0.509 0.610763
## year_built:has_br3 2.910e+03 1.600e+03 1.818 0.069344 .
## year_built:has_br4_or_more 3.484e+03 1.737e+03 2.006 0.045185 *
## heated_area:has_garage 7.894e+01 2.102e+01 3.755 0.000185 ***
## heated_area:has_br2 -1.542e+02 8.916e+01 -1.730 0.083976 .
## heated_area:has_br3 -1.394e+02 8.842e+01 -1.576 0.115296
## heated_area:has_br4_or_more -1.147e+02 8.868e+01 -1.293 0.196258
## has_garage:has_br2 8.375e+04 9.486e+04 0.883 0.377552
## has_garage:has_br3 -3.308e+04 9.516e+04 -0.348 0.728177
## has_garage:has_br4_or_more -4.208e+04 1.006e+05 -0.418 0.675829
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 118600 on 852 degrees of freedom
## Multiple R-squared: 0.4449, Adjusted R-squared: 0.4332
## F-statistic: 37.94 on 18 and 852 DF, p-value: < 2.2e-16
c <- coefficients(model_2way)
c
## (Intercept) year_built
## 1.823913e+04 -1.083981e+01
## heated_area has_garage
## 3.331486e+02 3.228399e+06
## has_br2 has_br3
## -1.489779e+06 -5.551769e+06
## has_br4_or_more year_built:heated_area
## -6.657537e+06 -7.135289e-02
## year_built:has_garage year_built:has_br2
## -1.686329e+03 8.081133e+02
## year_built:has_br3 year_built:has_br4_or_more
## 2.909934e+03 3.483509e+03
## heated_area:has_garage heated_area:has_br2
## 7.894349e+01 -1.542497e+02
## heated_area:has_br3 heated_area:has_br4_or_more
## -1.393855e+02 -1.146905e+02
## has_garage:has_br2 has_garage:has_br3
## 8.375274e+04 -3.308471e+04
## has_garage:has_br4_or_more
## -4.208227e+04
\[\begin{split} \hat{\text{price}} = 1.824\times 10^{4} - 10.84*\text{year} \\ + 333.1*\text{area} \\ + 3.228\times 10^{6}*\text{garage} \\ - 1.49\times 10^{6}*\text{br}_{2} \\ - 5.552\times 10^{6}*\text{br}_{3} \\ - 6.658\times 10^{6}*\text{br}_{4} \\ - 0.07135*\text{(year:area)} \\ - 1686*\text{(year:garage)} \\ + 808.1*\text{year:br}_{2} \\ + 2910*\text{year:br}_{3} \\ + 3484*\text{year:br}_{4} \\ + 78.94*\text{(area:garage)} \\ - 154.2*\text{area:br}_{2} \\ - 139.4*\text{area:br}_{3} \\ - 114.7*\text{area:br}_{4} \\ + 8.375\times 10^{4}*\text{garage:br}_{2} \\ - 3.308\times 10^{4}*\text{garage:br}_{3} \\ - 4.208\times 10^{4}*\text{garage:br}_{4} \end{split}\]
I will check if any of the two-way interactions are significant to determine if I can further simplify the model.
full <- model_2way
reduced <- model_1
test_stat <- anova(reduced, full)
test_stat
## Analysis of Variance Table
##
## Model 1: price ~ year_built + heated_area + has_garage + has_br
## Model 2: price ~ (year_built + heated_area + has_garage + has_br)^2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 864 1.2842e+13
## 2 852 1.1983e+13 12 8.5935e+11 5.0918 2.959e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Although I cannot remove all interactions, I can remove one interaction (year built and heated area) without affecting the rest of the model. Below is the t-test for the year built and heated area interaction.
s2 <- summary(model_2way)
s2
##
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage +
## has_br)^2, data = sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -362994 -58395 -16323 31733 906634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.824e+04 3.073e+06 0.006 0.995265
## year_built -1.084e+01 1.557e+03 -0.007 0.994447
## heated_area 3.331e+02 9.024e+02 0.369 0.712077
## has_garage 3.228e+06 1.071e+06 3.015 0.002647 **
## has_br2 -1.490e+06 3.133e+06 -0.476 0.634519
## has_br3 -5.552e+06 3.160e+06 -1.757 0.079297 .
## has_br4_or_more -6.658e+06 3.430e+06 -1.941 0.052567 .
## year_built:heated_area -7.135e-02 4.569e-01 -0.156 0.875948
## year_built:has_garage -1.686e+03 5.366e+02 -3.143 0.001732 **
## year_built:has_br2 8.081e+02 1.587e+03 0.509 0.610763
## year_built:has_br3 2.910e+03 1.600e+03 1.818 0.069344 .
## year_built:has_br4_or_more 3.484e+03 1.737e+03 2.006 0.045185 *
## heated_area:has_garage 7.894e+01 2.102e+01 3.755 0.000185 ***
## heated_area:has_br2 -1.542e+02 8.916e+01 -1.730 0.083976 .
## heated_area:has_br3 -1.394e+02 8.842e+01 -1.576 0.115296
## heated_area:has_br4_or_more -1.147e+02 8.868e+01 -1.293 0.196258
## has_garage:has_br2 8.375e+04 9.486e+04 0.883 0.377552
## has_garage:has_br3 -3.308e+04 9.516e+04 -0.348 0.728177
## has_garage:has_br4_or_more -4.208e+04 1.006e+05 -0.418 0.675829
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 118600 on 852 degrees of freedom
## Multiple R-squared: 0.4449, Adjusted R-squared: 0.4332
## F-statistic: 37.94 on 18 and 852 DF, p-value: < 2.2e-16
#Constructing new model
model_2 <- lm(price ~ (year_built + heated_area + has_garage + has_br)^2 - year_built:heated_area,
data = sales)
summary(model_2)
##
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage +
## has_br)^2 - year_built:heated_area, data = sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -362688 -58292 -16329 31438 908343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.253e+05 2.993e+06 0.042 0.966613
## year_built -6.532e+01 1.517e+03 -0.043 0.965656
## heated_area 1.929e+02 8.722e+01 2.212 0.027254 *
## has_garage 3.278e+06 1.023e+06 3.204 0.001404 **
## has_br2 -1.454e+06 3.123e+06 -0.466 0.641620
## has_br3 -5.475e+06 3.120e+06 -1.755 0.079631 .
## has_br4_or_more -6.523e+06 3.317e+06 -1.966 0.049595 *
## year_built:has_garage -1.710e+03 5.138e+02 -3.329 0.000909 ***
## year_built:has_br2 7.899e+02 1.582e+03 0.499 0.617691
## year_built:has_br3 2.871e+03 1.580e+03 1.817 0.069539 .
## year_built:has_br4_or_more 3.416e+03 1.680e+03 2.033 0.042400 *
## heated_area:has_garage 7.747e+01 1.879e+01 4.124 4.09e-05 ***
## heated_area:has_br2 -1.542e+02 8.911e+01 -1.731 0.083868 .
## heated_area:has_br3 -1.399e+02 8.830e+01 -1.585 0.113420
## heated_area:has_br4_or_more -1.153e+02 8.853e+01 -1.303 0.193035
## has_garage:has_br2 8.367e+04 9.481e+04 0.883 0.377725
## has_garage:has_br3 -3.255e+04 9.505e+04 -0.342 0.732087
## has_garage:has_br4_or_more -4.118e+04 1.004e+05 -0.410 0.681712
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 118500 on 853 degrees of freedom
## Multiple R-squared: 0.4449, Adjusted R-squared: 0.4339
## F-statistic: 40.22 on 17 and 853 DF, p-value: < 2.2e-16
I reviewed the summary table for the model that includes all two-way interactions. Two-way interactions that consider if the house has a garage are shown to be significant when looking at the p-values. However, the p-values that include interaction between bedroom and garage are greater than our reject region. Let us see if the bedroom-garage interaction helps in the prediction of house sales price.
full <- model_2
# Model with Bedroom-Garage interactions removed
reduced <- lm(price ~ (year_built + heated_area + has_garage + has_br)^2 - year_built:heated_area - (has_garage:has_br),
data = sales)
summary(reduced)
##
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage +
## has_br)^2 - year_built:heated_area - (has_garage:has_br),
## data = sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -387631 -63674 -15902 31533 959313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.794e+05 3.017e+06 0.093 0.926222
## year_built -1.429e+02 1.530e+03 -0.093 0.925627
## heated_area 1.874e+02 7.118e+01 2.633 0.008611 **
## has_garage 3.574e+06 1.032e+06 3.463 0.000561 ***
## has_br2 -3.263e+06 3.128e+06 -1.043 0.297131
## has_br3 -4.890e+06 3.138e+06 -1.558 0.119582
## has_br4_or_more -6.291e+06 3.316e+06 -1.897 0.058144 .
## year_built:has_garage -1.831e+03 5.202e+02 -3.519 0.000456 ***
## year_built:has_br2 1.696e+03 1.586e+03 1.069 0.285387
## year_built:has_br3 2.551e+03 1.591e+03 1.604 0.109078
## year_built:has_br4_or_more 3.273e+03 1.680e+03 1.948 0.051724 .
## heated_area:has_garage 4.849e+01 1.778e+01 2.727 0.006518 **
## heated_area:has_br2 -1.049e+02 7.303e+01 -1.437 0.151173
## heated_area:has_br3 -1.219e+02 7.262e+01 -1.679 0.093614 .
## heated_area:has_br4_or_more -9.401e+01 7.302e+01 -1.288 0.198236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 120200 on 856 degrees of freedom
## Multiple R-squared: 0.427, Adjusted R-squared: 0.4177
## F-statistic: 45.57 on 14 and 856 DF, p-value: < 2.2e-16
test_stat <- anova(reduced, full)
test_stat
## Analysis of Variance Table
##
## Model 1: price ~ (year_built + heated_area + has_garage + has_br)^2 -
## year_built:heated_area - (has_garage:has_br)
## Model 2: price ~ (year_built + heated_area + has_garage + has_br)^2 -
## year_built:heated_area
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 856 1.2370e+13
## 2 853 1.1983e+13 3 3.8652e+11 9.1713 5.622e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now I will check if the continuous predictors are significant. If the continuous predictors are found to be insignificant, then their corresponding interactions will also have to be removed from the model.
full <- model_2
# Model with Year Built, Heated Area, and their corresponding interactions removed
reduced <- lm(price ~ (has_garage + has_br)^2,
data = sales)
summary(reduced)
##
## Call:
## lm(formula = price ~ (has_garage + has_br)^2, data = sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -441893 -73447 -19027 40923 988107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 129631 33901 3.824 0.000141 ***
## has_garage 121602 85316 1.425 0.154428
## has_br2 6815 35443 0.192 0.847562
## has_br3 128177 36645 3.498 0.000493 ***
## has_br4_or_more 206769 42783 4.833 1.59e-06 ***
## has_garage:has_br2 -32012 86448 -0.370 0.711245
## has_garage:has_br3 -80384 86789 -0.926 0.354601
## has_garage:has_br4_or_more 23891 91697 0.261 0.794511
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135600 on 863 degrees of freedom
## Multiple R-squared: 0.2649, Adjusted R-squared: 0.2589
## F-statistic: 44.43 on 7 and 863 DF, p-value: < 2.2e-16
test_stat <- anova(reduced, full)
test_stat
## Analysis of Variance Table
##
## Model 1: price ~ (has_garage + has_br)^2
## Model 2: price ~ (year_built + heated_area + has_garage + has_br)^2 -
## year_built:heated_area
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 863 1.5869e+13
## 2 853 1.1983e+13 10 3.8864e+12 27.665 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After performing tests above, the regression line for the model is below.
c <- coefficients(model_2)
c
## (Intercept) year_built
## 1.253312e+05 -6.531881e+01
## heated_area has_garage
## 1.928996e+02 3.277593e+06
## has_br2 has_br3
## -1.453852e+06 -5.475093e+06
## has_br4_or_more year_built:has_garage
## -6.522773e+06 -1.710359e+03
## year_built:has_br2 year_built:has_br3
## 7.898488e+02 2.871321e+03
## year_built:has_br4_or_more heated_area:has_garage
## 3.415623e+03 7.747259e+01
## heated_area:has_br2 heated_area:has_br3
## -1.542150e+02 -1.399237e+02
## heated_area:has_br4_or_more has_garage:has_br2
## -1.153323e+02 8.367309e+04
## has_garage:has_br3 has_garage:has_br4_or_more
## -3.254989e+04 -4.118238e+04
\[\begin{split} \hat{\text{price}} = 1.253\times 10^{5} - 65.32*\text{year} \\ + 192.9*\text{area} \\ + 3.278\times 10^{6}*\text{garage} \\ - 1.454\times 10^{6}*\text{br}_{2} \\ - 5.475\times 10^{6}*\text{br}_{3} \\ - 6.523\times 10^{6}*\text{br}_{4} \\ - 1710*\text{(year:garage)} \\ + 789.8*\text{year:br}_{2} \\ + 2871*\text{year:br}_{3} \\ + 3416*\text{year:br}_{4} \\ + 77.47*\text{(area:garage)} \\ - 154.2*\text{area:br}_{2} \\ - 139.9*\text{area:br}_{3} \\ - 115.3*\text{area:br}_{4} \\ + 8.367\times 10^{4}*\text{garage:br}_{2} \\ - 3.255\times 10^{4}*\text{garage:br}_{3} \\ - 4.118\times 10^{4}*\text{garage:br}_{4} \end{split}\]
Since the interactions are significant, the remaining predictors are also significant.
summary(model_2)
##
## Call:
## lm(formula = price ~ (year_built + heated_area + has_garage +
## has_br)^2 - year_built:heated_area, data = sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -362688 -58292 -16329 31438 908343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.253e+05 2.993e+06 0.042 0.966613
## year_built -6.532e+01 1.517e+03 -0.043 0.965656
## heated_area 1.929e+02 8.722e+01 2.212 0.027254 *
## has_garage 3.278e+06 1.023e+06 3.204 0.001404 **
## has_br2 -1.454e+06 3.123e+06 -0.466 0.641620
## has_br3 -5.475e+06 3.120e+06 -1.755 0.079631 .
## has_br4_or_more -6.523e+06 3.317e+06 -1.966 0.049595 *
## year_built:has_garage -1.710e+03 5.138e+02 -3.329 0.000909 ***
## year_built:has_br2 7.899e+02 1.582e+03 0.499 0.617691
## year_built:has_br3 2.871e+03 1.580e+03 1.817 0.069539 .
## year_built:has_br4_or_more 3.416e+03 1.680e+03 2.033 0.042400 *
## heated_area:has_garage 7.747e+01 1.879e+01 4.124 4.09e-05 ***
## heated_area:has_br2 -1.542e+02 8.911e+01 -1.731 0.083868 .
## heated_area:has_br3 -1.399e+02 8.830e+01 -1.585 0.113420
## heated_area:has_br4_or_more -1.153e+02 8.853e+01 -1.303 0.193035
## has_garage:has_br2 8.367e+04 9.481e+04 0.883 0.377725
## has_garage:has_br3 -3.255e+04 9.505e+04 -0.342 0.732087
## has_garage:has_br4_or_more -4.118e+04 1.004e+05 -0.410 0.681712
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 118500 on 853 degrees of freedom
## Multiple R-squared: 0.4449, Adjusted R-squared: 0.4339
## F-statistic: 40.22 on 17 and 853 DF, p-value: < 2.2e-16
For the first visualization, I will model houses with attached garages but different amount of bedrooms. Price is the outcome and will be on the Y-axis. The continuous predictor of Heated Area will be on the X-axis. I will use different line for the different amount of bedrooms. I will assume this house is average age and was built in 1984.
sales <- sales %>%
mutate(ya_g1_b1 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*1
+ c["has_br2"]*0
+ c["has_br3"]*0
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*round(year_summary$mean, 0)*1
+ c["year_built:has_br2"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br3"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*0
+ c["heated_area:has_garage"]*heated_area*1
+ c["heated_area:has_br2"]*heated_area*0
+ c["heated_area:has_br3"]*heated_area*0
+ c["heated_area:has_br4_or_more"]*heated_area*0
+ c["has_garage:has_br2"]*1*0
+ c["has_garage:has_br3"]*1*0
+ c["has_garage:has_br4_or_more"]*1*0,
ya_g1_b2 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*1
+ c["has_br2"]*1
+ c["has_br3"]*0
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*round(year_summary$mean, 0)*1
+ c["year_built:has_br2"]*round(year_summary$mean, 0)*1
+ c["year_built:has_br3"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*0
+ c["heated_area:has_garage"]*heated_area*1
+ c["heated_area:has_br2"]*heated_area*1
+ c["heated_area:has_br3"]*heated_area*0
+ c["heated_area:has_br4_or_more"]*heated_area*0
+ c["has_garage:has_br2"]*1*1
+ c["has_garage:has_br3"]*1*0
+ c["has_garage:has_br4_or_more"]*1*0,
ya_g1_b3 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*1
+ c["has_br2"]*0
+ c["has_br3"]*1
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*round(year_summary$mean, 0)*1
+ c["year_built:has_br2"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br3"]*round(year_summary$mean, 0)*1
+ c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*0
+ c["heated_area:has_garage"]*heated_area*1
+ c["heated_area:has_br2"]*heated_area*0
+ c["heated_area:has_br3"]*heated_area*1
+ c["heated_area:has_br4_or_more"]*heated_area*0
+ c["has_garage:has_br2"]*1*0
+ c["has_garage:has_br3"]*1*1
+ c["has_garage:has_br4_or_more"]*1*0,
ya_g1_b4 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*1
+ c["has_br2"]*0
+ c["has_br3"]*0
+ c["has_br4_or_more"]*1
+ c["year_built:has_garage"]*round(year_summary$mean, 0)*1
+ c["year_built:has_br2"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br3"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*1
+ c["heated_area:has_garage"]*heated_area*1
+ c["heated_area:has_br2"]*heated_area*0
+ c["heated_area:has_br3"]*heated_area*0
+ c["heated_area:has_br4_or_more"]*heated_area*1
+ c["has_garage:has_br2"]*1*0
+ c["has_garage:has_br3"]*1*0
+ c["has_garage:has_br4_or_more"]*1*1
)
sales %>% ggplot(aes(x = heated_area, y = price, color = has_br)) +
geom_point(size=0.5) +
geom_line(aes(y = ya_g1_b1), color = "red") +
geom_line(aes(y = ya_g1_b2), color = "green") +
geom_line(aes(y = ya_g1_b3), color = "blue") +
geom_line(aes(y = ya_g1_b4), color = "purple") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",",
scientific = FALSE)) +
theme_minimal() +
ggtitle("Plot for Heated Area with Attached Garage")
sales %>% filter(bedrooms == 1) %>%
ggplot(aes(x = heated_area, y = price, color = has_br)) +
geom_point(size=0.5) +
geom_line(aes(y = ya_g1_b1), color = "red") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",",
scientific = FALSE)) +
theme_minimal() +
ggtitle("Plot for 1-Bedroom Houses")
The graph shows a 1-bedroom house that is close to 2000 sq-ft that sold for nearly $500,000. This observation falls in line with the other observations as unlikely as it seems. This could indicate a similar issue to the no-bedroom houses discussed earlier in this analysis.
sales <- sales %>%
mutate(ya_g0_b1 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*0
+ c["has_br2"]*0
+ c["has_br3"]*0
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br2"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br3"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*0
+ c["heated_area:has_garage"]*heated_area*0
+ c["heated_area:has_br2"]*heated_area*0
+ c["heated_area:has_br3"]*heated_area*0
+ c["heated_area:has_br4_or_more"]*heated_area*0
+ c["has_garage:has_br2"]*0*0
+ c["has_garage:has_br3"]*0*0
+ c["has_garage:has_br4_or_more"]*0*0,
ya_g0_b2 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*0
+ c["has_br2"]*1
+ c["has_br3"]*0
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br2"]*round(year_summary$mean, 0)*1
+ c["year_built:has_br3"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*0
+ c["heated_area:has_garage"]*heated_area*0
+ c["heated_area:has_br2"]*heated_area*1
+ c["heated_area:has_br3"]*heated_area*0
+ c["heated_area:has_br4_or_more"]*heated_area*0
+ c["has_garage:has_br2"]*0*1
+ c["has_garage:has_br3"]*0*0
+ c["has_garage:has_br4_or_more"]*0*0,
ya_g0_b3 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*0
+ c["has_br2"]*0
+ c["has_br3"]*1
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br2"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br3"]*round(year_summary$mean, 0)*1
+ c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*0
+ c["heated_area:has_garage"]*heated_area*0
+ c["heated_area:has_br2"]*heated_area*0
+ c["heated_area:has_br3"]*heated_area*1
+ c["heated_area:has_br4_or_more"]*heated_area*0
+ c["has_garage:has_br2"]*0*0
+ c["has_garage:has_br3"]*0*1
+ c["has_garage:has_br4_or_more"]*0*0,
ya_g0_b4 = c["(Intercept)"] + c["year_built"]*round(year_summary$mean, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*0
+ c["has_br2"]*0
+ c["has_br3"]*0
+ c["has_br4_or_more"]*1
+ c["year_built:has_garage"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br2"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br3"]*round(year_summary$mean, 0)*0
+ c["year_built:has_br4_or_more"]*round(year_summary$mean, 0)*1
+ c["heated_area:has_garage"]*heated_area*0
+ c["heated_area:has_br2"]*heated_area*0
+ c["heated_area:has_br3"]*heated_area*0
+ c["heated_area:has_br4_or_more"]*heated_area*1
+ c["has_garage:has_br2"]*0*0
+ c["has_garage:has_br3"]*0*0
+ c["has_garage:has_br4_or_more"]*0*1
)
sales %>% ggplot(aes(x = heated_area, y = price, color = has_br)) +
geom_point(size=0.5) +
geom_line(aes(y = ya_g1_b1), color = "red") +
geom_line(aes(y = ya_g1_b2), color = "green") +
geom_line(aes(y = ya_g1_b3), color = "blue") +
geom_line(aes(y = ya_g1_b4), color = "purple") +
geom_line(aes(y = ya_g0_b1), color = "red", alpha = 0.5) +
geom_line(aes(y = ya_g0_b2), color = "green", alpha = 0.5) +
geom_line(aes(y = ya_g0_b3), color = "blue", alpha = 0.5) +
geom_line(aes(y = ya_g0_b4), color = "purple", alpha = 0.5) +
scale_y_continuous(labels = function(x) format(x, big.mark = ",",
scientific = FALSE)) +
theme_minimal() +
ggtitle("Plot for Heated Area (No Attached Garage is lighter color)")
Let us compare all the predicted values at the median heated area of 1,361 sq-ft. I will also display predicted value at the smallest area of 136 and the largest of 5,731.
sales %>% select(heated_area,
ya_g1_b1,
ya_g1_b2,
ya_g1_b3,
ya_g1_b4,
ya_g0_b1,
ya_g0_b2,
ya_g0_b3,
ya_g0_b4,
) %>%
filter(heated_area == round(area_summary$median, 0) | heated_area == round(area_summary$min, 0) | heated_area == round(area_summary$max, 0)) %>%
# There could be multiple actual homes with these areas. We only need one of each
# since we only want the predicted values
distinct %>%
arrange(heated_area)
## # A tibble: 3 × 9
## heated_area ya_g1_b1 ya_g1_b2 ya_g1_b3 ya_g1_b4 ya_g0_b1 ya_g0_b2 ya_g0_b3
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 136 -83249. 92659. 86780. 113707. 21973. 114208. 224551.
## 2 1361 247957. 234952. 246579. 303631. 258275. 161596. 289447.
## 3 5731 1429484. 742559. 816639. 981155. 1101247. 330648. 520952.
## # … with 1 more variable: ya_g0_b4 <dbl>
Let us plot if the house was built last year using the same X and Y axis.
sales <- sales %>%
mutate(yb_g1_b1 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*1
+ c["has_br2"]*0
+ c["has_br3"]*0
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*round(year_summary$max, 0)*1
+ c["year_built:has_br2"]*round(year_summary$max, 0)*0
+ c["year_built:has_br3"]*round(year_summary$max, 0)*0
+ c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*0
+ c["heated_area:has_garage"]*heated_area*1
+ c["heated_area:has_br2"]*heated_area*0
+ c["heated_area:has_br3"]*heated_area*0
+ c["heated_area:has_br4_or_more"]*heated_area*0
+ c["has_garage:has_br2"]*1*0
+ c["has_garage:has_br3"]*1*0
+ c["has_garage:has_br4_or_more"]*1*0,
yb_g1_b2 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*1
+ c["has_br2"]*1
+ c["has_br3"]*0
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*round(year_summary$max, 0)*1
+ c["year_built:has_br2"]*round(year_summary$max, 0)*1
+ c["year_built:has_br3"]*round(year_summary$max, 0)*0
+ c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*0
+ c["heated_area:has_garage"]*heated_area*1
+ c["heated_area:has_br2"]*heated_area*1
+ c["heated_area:has_br3"]*heated_area*0
+ c["heated_area:has_br4_or_more"]*heated_area*0
+ c["has_garage:has_br2"]*1*1
+ c["has_garage:has_br3"]*1*0
+ c["has_garage:has_br4_or_more"]*1*0,
yb_g1_b3 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*1
+ c["has_br2"]*0
+ c["has_br3"]*1
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*round(year_summary$max, 0)*1
+ c["year_built:has_br2"]*round(year_summary$max, 0)*0
+ c["year_built:has_br3"]*round(year_summary$max, 0)*1
+ c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*0
+ c["heated_area:has_garage"]*heated_area*1
+ c["heated_area:has_br2"]*heated_area*0
+ c["heated_area:has_br3"]*heated_area*1
+ c["heated_area:has_br4_or_more"]*heated_area*0
+ c["has_garage:has_br2"]*1*0
+ c["has_garage:has_br3"]*1*1
+ c["has_garage:has_br4_or_more"]*1*0,
yb_g1_b4 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*1
+ c["has_br2"]*0
+ c["has_br3"]*0
+ c["has_br4_or_more"]*1
+ c["year_built:has_garage"]*round(year_summary$max, 0)*1
+ c["year_built:has_br2"]*round(year_summary$max, 0)*0
+ c["year_built:has_br3"]*round(year_summary$max, 0)*0
+ c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*1
+ c["heated_area:has_garage"]*heated_area*1
+ c["heated_area:has_br2"]*heated_area*0
+ c["heated_area:has_br3"]*heated_area*0
+ c["heated_area:has_br4_or_more"]*heated_area*1
+ c["has_garage:has_br2"]*1*0
+ c["has_garage:has_br3"]*1*0
+ c["has_garage:has_br4_or_more"]*1*1
)
sales %>% ggplot(aes(x = heated_area, y = price, color = has_br)) +
geom_point(size=0.5) +
geom_line(aes(y = yb_g1_b1), color = "red") +
geom_line(aes(y = yb_g1_b2), color = "green") +
geom_line(aes(y = yb_g1_b3), color = "blue") +
geom_line(aes(y = yb_g1_b4), color = "purple") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",",
scientific = FALSE)) +
theme_minimal() +
ggtitle("Plot for Heated Area with Attached Garage")
sales <- sales %>%
mutate(yb_g0_b1 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*0
+ c["has_br2"]*0
+ c["has_br3"]*0
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*round(year_summary$max, 0)*0
+ c["year_built:has_br2"]*round(year_summary$max, 0)*0
+ c["year_built:has_br3"]*round(year_summary$max, 0)*0
+ c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*0
+ c["heated_area:has_garage"]*heated_area*0
+ c["heated_area:has_br2"]*heated_area*0
+ c["heated_area:has_br3"]*heated_area*0
+ c["heated_area:has_br4_or_more"]*heated_area*0
+ c["has_garage:has_br2"]*0*0
+ c["has_garage:has_br3"]*0*0
+ c["has_garage:has_br4_or_more"]*0*0,
yb_g0_b2 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*0
+ c["has_br2"]*1
+ c["has_br3"]*0
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*round(year_summary$max, 0)*0
+ c["year_built:has_br2"]*round(year_summary$max, 0)*1
+ c["year_built:has_br3"]*round(year_summary$max, 0)*0
+ c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*0
+ c["heated_area:has_garage"]*heated_area*0
+ c["heated_area:has_br2"]*heated_area*1
+ c["heated_area:has_br3"]*heated_area*0
+ c["heated_area:has_br4_or_more"]*heated_area*0
+ c["has_garage:has_br2"]*0*1
+ c["has_garage:has_br3"]*0*0
+ c["has_garage:has_br4_or_more"]*0*0,
yb_g0_b3 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*0
+ c["has_br2"]*0
+ c["has_br3"]*1
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*round(year_summary$max, 0)*0
+ c["year_built:has_br2"]*round(year_summary$max, 0)*0
+ c["year_built:has_br3"]*round(year_summary$max, 0)*1
+ c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*0
+ c["heated_area:has_garage"]*heated_area*0
+ c["heated_area:has_br2"]*heated_area*0
+ c["heated_area:has_br3"]*heated_area*1
+ c["heated_area:has_br4_or_more"]*heated_area*0
+ c["has_garage:has_br2"]*0*0
+ c["has_garage:has_br3"]*0*1
+ c["has_garage:has_br4_or_more"]*0*0,
yb_g0_b4 = c["(Intercept)"] + c["year_built"]*round(year_summary$max, 0)
+ c["heated_area"]*heated_area + c["has_garage"]*0
+ c["has_br2"]*0
+ c["has_br3"]*0
+ c["has_br4_or_more"]*1
+ c["year_built:has_garage"]*round(year_summary$max, 0)*0
+ c["year_built:has_br2"]*round(year_summary$max, 0)*0
+ c["year_built:has_br3"]*round(year_summary$max, 0)*0
+ c["year_built:has_br4_or_more"]*round(year_summary$max, 0)*1
+ c["heated_area:has_garage"]*heated_area*0
+ c["heated_area:has_br2"]*heated_area*0
+ c["heated_area:has_br3"]*heated_area*0
+ c["heated_area:has_br4_or_more"]*heated_area*1
+ c["has_garage:has_br2"]*0*0
+ c["has_garage:has_br3"]*0*0
+ c["has_garage:has_br4_or_more"]*0*1
)
sales %>% ggplot(aes(x = heated_area, y = price, color = has_br)) +
geom_point(size=0.5) +
geom_line(aes(y = yb_g1_b1), color = "red") +
geom_line(aes(y = yb_g1_b2), color = "green") +
geom_line(aes(y = yb_g1_b3), color = "blue") +
geom_line(aes(y = yb_g1_b4), color = "purple") +
geom_line(aes(y = yb_g0_b1), color = "red", alpha = .5) +
geom_line(aes(y = yb_g0_b2), color = "green", alpha = .5) +
geom_line(aes(y = yb_g0_b3), color = "blue", alpha = .5) +
geom_line(aes(y = yb_g0_b4), color = "purple", alpha = .5) +
scale_y_continuous(labels = function(x) format(x, big.mark = ",",
scientific = FALSE)) +
theme_minimal() +
ggtitle("Plot for Heated Area (No Attached Garage is lighter color)")
Again, let us compare the predicted values at the median, minimum, and maximum heated area of 1,361, 136, and 5,731 sq-ft respectively.
sales %>% select(heated_area,
yb_g1_b1,
yb_g1_b2,
yb_g1_b3,
yb_g1_b4,
yb_g0_b1,
yb_g0_b2,
yb_g0_b3,
yb_g0_b4,
) %>%
filter(heated_area == round(area_summary$median, 0) | heated_area == round(area_summary$min, 0) | heated_area == round(area_summary$max, 0)) %>%
# There could be multiple actual homes with these areas. We only need one of each
# since we only want the predicted values
distinct %>%
arrange(heated_area)
## # A tibble: 3 × 9
## heated_area yb_g1_b1 yb_g1_b2 yb_g1_b3 yb_g1_b4 yb_g0_b1 yb_g0_b2 yb_g0_b3
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 136 -148949. 56183. 127318. 174385. 19556. 141015. 328373.
## 2 1361 182257. 198476. 287118. 364309. 255858. 188404. 393269.
## 3 5731 1363784. 706083. 857178. 1041833. 1098830. 357456. 624774.
## # … with 1 more variable: yb_g0_b4 <dbl>
Now, I will graph the year built on the X-axis. I set heated area constant at 1361 and assume the house does not have a garage.
sales <- sales %>%
mutate(yc_g0_b1 = c["(Intercept)"] + c["year_built"]*year_built
+ c["heated_area"]*round(area_summary$median, 0) + c["has_garage"]*0
+ c["has_br2"]*0
+ c["has_br3"]*0
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*year_built*0
+ c["year_built:has_br2"]*year_built*0
+ c["year_built:has_br3"]*year_built*0
+ c["year_built:has_br4_or_more"]*year_built*0
+ c["heated_area:has_garage"]*round(area_summary$median, 0)*0
+ c["heated_area:has_br2"]*round(area_summary$median, 0)*0
+ c["heated_area:has_br3"]*round(area_summary$median, 0)*0
+ c["heated_area:has_br4_or_more"]*round(area_summary$median, 0)*0
+ c["has_garage:has_br2"]*0*0
+ c["has_garage:has_br3"]*0*0
+ c["has_garage:has_br4_or_more"]*0*0,
yc_g0_b2 = c["(Intercept)"] + c["year_built"]*year_built
+ c["heated_area"]*round(area_summary$median, 0) + c["has_garage"]*0
+ c["has_br2"]*1
+ c["has_br3"]*0
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*year_built*0
+ c["year_built:has_br2"]*year_built*1
+ c["year_built:has_br3"]*year_built*0
+ c["year_built:has_br4_or_more"]*year_built*0
+ c["heated_area:has_garage"]*round(area_summary$median, 0)*0
+ c["heated_area:has_br2"]*round(area_summary$median, 0)*1
+ c["heated_area:has_br3"]*round(area_summary$median, 0)*0
+ c["heated_area:has_br4_or_more"]*round(area_summary$median, 0)*0
+ c["has_garage:has_br2"]*0*1
+ c["has_garage:has_br3"]*0*0
+ c["has_garage:has_br4_or_more"]*0*0,
yc_g0_b3 = c["(Intercept)"] + c["year_built"]*year_built
+ c["heated_area"]*round(area_summary$median, 0) + c["has_garage"]*0
+ c["has_br2"]*0
+ c["has_br3"]*1
+ c["has_br4_or_more"]*0
+ c["year_built:has_garage"]*year_built*0
+ c["year_built:has_br2"]*year_built*0
+ c["year_built:has_br3"]*year_built*1
+ c["year_built:has_br4_or_more"]*year_built*0
+ c["heated_area:has_garage"]*round(area_summary$median, 0)*0
+ c["heated_area:has_br2"]*round(area_summary$median, 0)*0
+ c["heated_area:has_br3"]*round(area_summary$median, 0)*1
+ c["heated_area:has_br4_or_more"]*round(area_summary$median, 0)*0
+ c["has_garage:has_br2"]*0*0
+ c["has_garage:has_br3"]*0*1
+ c["has_garage:has_br4_or_more"]*0*0,
yc_g0_b4 = c["(Intercept)"] + c["year_built"]*year_built
+ c["heated_area"]*round(area_summary$median, 0) + c["has_garage"]*0
+ c["has_br2"]*0
+ c["has_br3"]*0
+ c["has_br4_or_more"]*1
+ c["year_built:has_garage"]*year_built*0
+ c["year_built:has_br2"]*year_built*0
+ c["year_built:has_br3"]*year_built*0
+ c["year_built:has_br4_or_more"]*year_built*1
+ c["heated_area:has_garage"]*round(area_summary$median, 0)*0
+ c["heated_area:has_br2"]*round(area_summary$median, 0)*0
+ c["heated_area:has_br3"]*round(area_summary$median, 0)*0
+ c["heated_area:has_br4_or_more"]*round(area_summary$median, 0)*1
+ c["has_garage:has_br2"]*0*0
+ c["has_garage:has_br3"]*0*0
+ c["has_garage:has_br4_or_more"]*0*1
)
sales %>% ggplot(aes(x = year_built, y = price, color = has_br)) +
geom_point(size=0.5) +
geom_line(aes(y = yc_g0_b1), color = "red") +
geom_line(aes(y = yc_g0_b2), color = "green") +
geom_line(aes(y = yc_g0_b3), color = "blue") +
geom_line(aes(y = yc_g0_b4), color = "purple") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",",
scientific = FALSE)) +
theme_minimal() +
ggtitle("Plot for Year Built")
When the year built is held constant, the sales price on average increases as heated area increases. One-bedroom house expected sales prices increase the most rapidly as heated area increases. As mentioned earlier, these could be houses next to multi-bedroom houses that sold on the same deal.
For houses built in 1984 and in 2021, houses with garages have a greater increase in expected sale price as heated area increases than houses without an attached garage. However, houses with garages below a certain heated area are expected to sale for less than houses without a garage. Three-bedroom houses without garages are expected to sale for more money than two-bedroom houses without garages. Meanwhile, houses with 4 or more bedrooms are consistently expected to sale for more than both two- and three-bedroom houses as heated area increases.
When year built increases, expected sales prices increases for three- and four or more bedroom houses. Sales price for two-bedroom houses increase at a lower rate, and one-bedroom houses decrease in sales price as the year built increases.
The most surprising result of the model is that having an attached garage only helps the expected sales price if the house is above a particular square footage.