How will impact Covid-19 “new normality” in housing market? Do you think it will be a right moment to buy a house?
Given the extraordinary characteristics of Covid-19, it is difficult to make precise predictions of macroeconomic conditions in a post-pandemic world.
However, we are able to analyze the data we have and learn from the past to detect patterns.
[My name is Ricardo Santana, and I hope you enjoy this project developed to understand the housing market in London. I am a passionate of data and found this dataset challenging and motivating so I decided to explore it and share it. I will explore the data, visualize it through geospatial approach, and identify the trends in London housing prices. Then I will explore Kensinton and Chelsea borough and propose a model for the prediction of prices. Note that this dataset includes information until 2020 January].
First of all, I prepare the environment and load libraries to start the study:
{
packages = c("knitr","tidyverse","scales","stringr","DataExplorer","caret",
"nnet","rpart", "rpart.plot","e1071",
"randomForest","xgboost","ada","MASS","questionr",
"psych","car","Hmisc", "jsonlite","ggplot2","dplyr","scales",
"gridExtra","corrplot","lubridate","plotly","maps","maptools","broom",
"data.table","maptools","httr","leaflet","widgetframe","here",
"rgdal","raster","Metrics","gbm","grid","viridis","dygraphs","xts","igraph","ggraph",
"ggfortify", "forecast","gapminder","gganimate")
newpack = packages[!(packages %in% installed.packages()[,"Package"])]
if(length(newpack)) install.packages(newpack)
a=lapply(packages, library, character.only=TRUE)
}
We load the data we will be working with. This dataset was extracted from: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.
We start with around 13000 observations. Every observation includes variables such as average price and London boroughs. For deeper studies of this matter, House Price Index is a more suitable metric to be used; however, the average price would be our best indicator for housing market given this dataset. I highlight that this is a work to practice and show possibilities of learning from the past.
So, we need to summarize the data to have a preliminar idea of the dataset:
## date area average_price code
## Min. :1995-01-01 Length:13549 Min. : 40722 Length:13549
## 1st Qu.:2001-04-01 Class :character 1st Qu.: 132380 Class :character
## Median :2007-07-01 Mode :character Median : 222919 Mode :character
## Mean :2007-06-30 Mean : 263520
## 3rd Qu.:2013-10-01 3rd Qu.: 336843
## Max. :2020-01-01 Max. :1463378
##
## houses_sold no_of_crimes borough_flag
## Min. : 2 Min. : 0 Min. :0.0000
## 1st Qu.: 247 1st Qu.:1623 1st Qu.:0.0000
## Median : 371 Median :2132 Median :1.0000
## Mean : 3894 Mean :2158 Mean :0.7333
## 3rd Qu.: 3146 3rd Qu.:2582 3rd Qu.:1.0000
## Max. :132163 Max. :7461 Max. :1.0000
## NA's :94 NA's :6110
## Observations: 13,549
## Variables: 7
## $ date <date> 1995-01-01, 1995-02-01, 1995-03-01, 1995-04-01, 1995...
## $ area <chr> "city of london", "city of london", "city of london",...
## $ average_price <dbl> 91449, 82203, 79121, 77101, 84409, 94901, 110128, 112...
## $ code <chr> "E09000001", "E09000001", "E09000001", "E09000001", "...
## $ houses_sold <dbl> 17, 7, 14, 7, 10, 17, 13, 14, 17, 14, 11, 18, 17, 10,...
## $ no_of_crimes <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ borough_flag <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
We can observe that there are not missing values (NAs). The observations are registers from January 1995 to January 2020. Furthermore, we observe that the borough flag column has 1 or 0 as possible values. The London observations have 1 as flag; this will be relevant in order to filter information and infer knowledge.
data1 <- data %>%
mutate(code = factor(code),
date = ymd(date),
area = factor(area),
average_price = as.numeric(average_price),
houses_sold = as.numeric(houses_sold),
no_of_crimes = as.numeric(no_of_crimes),
borough_flag = as.numeric(borough_flag))
Let’s visualize the dataset.
price <- data1 %>%
filter(area %in% c("england","london")) %>%
group_by(area, year = year(date)) %>%
summarise(mean = mean(average_price, na.rm = TRUE))
ken <- price %>%
filter(area %in% c("england"))%>%
mutate(mean = mean/1000)
cam <- price %>%
filter(area %in% c("london"))%>%
mutate(mean = mean/1000)
data2 <- cbind(ken,cam)
don <- xts(x=data2[,c(3,6)], order.by= as.Date(ISOdate(data2$year, 1, 1)))
dygraph(don, main = "Mean Price Housing Market: London vs England")%>%
dyOptions(fillGraph=TRUE) %>%
dyAxis("y", label = "Price (Thousands of £)") %>%
dyAxis("x", drawGrid = FALSE, label = "Time (Year)") %>%
dySeries("mean", label = "England")%>%
dySeries("mean1", label = "London")%>%
dyRangeSelector(height = 20)%>%
dyLegend(width = 650)%>%
dyShading(from = "2007-1-1", to = "2009-1-1", color = "#FFE6E6") %>%
dyShading(from = "2016-1-1", to = "2020-1-1", color = "#CCEBD6") %>%
dyEvent("2007-1-1", "Downturn", labelLoc = "bottom") %>%
dyEvent("2016-1-1", "Brexit", labelLoc = "bottom")%>%
dyUnzoom()
prueba2 <- data1 %>%
mutate(borough_flag = factor(borough_flag, levels = c("0", "1"), labels = c("Rest of Great Britain","London")))%>%
filter(year(date) > 2003)%>%
group_by(Territory = borough_flag,year = year(date)) %>%
summarise(vol = sum(houses_sold))
cam <- prueba2 %>%
filter(Territory %in% c("London"))%>%
mutate(vol = vol/1000)
don <- xts(x=cam[,c(3)], order.by= as.Date(ISOdate(cam$year, 1, 1)))
dygraph(don, main = "Volume of Transactions in London")%>%
dyOptions( fillGraph=TRUE) %>%
dyAxis("y", label = "Number of transactions (%)") %>%
dyAxis("x", drawGrid = FALSE, label = "Time (Year)") %>%
dyRangeSelector(height = 20)%>%
dyLegend(width = 650)%>%
dySeries("vol", label = "Transactions")%>%
dyShading(from = "2007-1-1", to = "2009-1-1", color = "#FFE6E6") %>%
dyShading(from = "2016-1-1", to = "2018-1-1", color = "#CCEBD6") %>%
dyEvent("2007-1-1", "Downturn", labelLoc = "bottom") %>%
dyEvent("2016-1-1", "Brexit", labelLoc = "bottom")%>%
dyUnzoom()
per <- data1 %>%
filter(area %in% c("england","london")) %>%
group_by(area, year = year(date)) %>%
summarise(mean = mean(average_price, na.rm = TRUE))%>%
mutate(pct_change = (mean/lag(mean) - 1) * 100)%>%
mutate(pct_change = replace_na(pct_change, 0))
ken <- per %>%
filter(area %in% c("england"))
cam <- per %>%
filter(area %in% c("london"))
data2 <- cbind(ken,cam)
don <- xts(x=data2[,c(4,8)], order.by= as.Date(ISOdate(data2$year, 1, 1)))
dygraph(don, main = "Price Housing Market Variation (%)")%>%
dyOptions( fillGraph=TRUE) %>%
dyAxis("y", label = "Price variation (%)") %>%
dyAxis("x", drawGrid = FALSE, label = "Time (Year)") %>%
dyRangeSelector(height = 20)%>%
dySeries("pct_change", label = "England")%>%
dySeries("pct_change1", label = "London")%>%
dyLegend(width = 650)%>%
dyShading(from = "2007-1-1", to = "2009-1-1", color = "#FFE6E6") %>%
dyShading(from = "2016-1-1", to = "2020-1-1", color = "#CCEBD6") %>%
dyEvent("2007-1-1", "Downturn", labelLoc = "bottom") %>%
dyEvent("2016-1-1", "Brexit", labelLoc = "bottom")%>%
dyUnzoom()
First of all, we compare the housing prices in London to the rest of Great Britain. From these 2 graphics (variation and variation of percentage), we must highlight different points:
First, we need to see the distribution average prices in London; I raise a geospatial approach to identify boroughs. I apply the average price of the boroughs taking into consideration the last 5 years:
states <- readOGR("London_Borough_Excluding_MHW.shp")
gsub('"', '', states@data$NAME)
states@data$NAME <- str_to_lower(states@data$NAME, locale = "en")
data3 <- data1 %>%
filter(year(date) %in% c(2015:2020))%>%
group_by(NAME = area) %>%
summarise(mean = mean(average_price, na.rm = TRUE))
fusion <- merge(states, data3, by = "NAME")
geoData_latlon <- spTransform(fusion, CRS("+proj=longlat +datum=WGS84"))
factpal2 <- colorFactor("Reds", geoData_latlon$mean)
leaflet() %>%
addTiles() %>%
addPolygons(label = ~(paste("Borough:",str_to_title(NAME),":","£",round(mean, digits = 2))),data = geoData_latlon,
stroke = FALSE, smoothFactor = 0.2,fillOpacity = 0.8, color = ~factpal2(mean),
labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",direction = "auto"), group = "NYC Boroughs")
We see that the majority of the expensive boroughs are located in the centre. The further we go, the cheaper are the houses. However, there are important exceptions. The most expensive boroughs are: Kensington and Chelsea, Westminster and Camden in that order (hover the map to see the different amounts). The boroughs that are around these boroughs such as Hammersmith and Fulham, City of London, Wandsworth and Islington present remarkable prices. It is also remarkable the boroughs around Camden such as Barnet or Haringey. However, there is a borough that it is not around the most expensive boroughs but it still shows high prices: Richmond upon Thames. We also see that east borough includes such as Brent, Harrow and Ealing which are more expensive than peripheral west boroughs.
Now that we have the boroughs visualized, let’s see how the prices evolve in different boroughs to identify patterns. In first place, we check the three most expensive boroughs: Camden, Kensington and Chelsea and Westminster (Group 1). Then We check some of the boroughs around them (Group 2). At the end, we will also check the cheapest boroughs (Group 3).
In this case, we visualize the trend for Inner London and Outer London prices to try to infer a pattern. Inner London includes the following boroughs: City of London, Camden, Greenwich, Hackney, Hammersmith and Fulham, Islington, Kensington and Chelsea, Lambeth, Lewisham, Newham, Southwark, Tower Hamlets, Wandsworth, and Westminster. Outer London includes these boroughs: Barking and Dagenham, Barnet, Bexley, Brent, Bromley, Croydon, Ealing, Enfield, Haringey, Harrow, Havering, Hillingdon, Hounslow, Kingston, Merton, Redbridge, Richmond, Sutton, and Waltham Forest.
price <- data1 %>%
filter(area %in% c("inner london","outer london")) %>%
group_by(area, year = year(date)) %>%
summarise(mean = mean(average_price, na.rm = TRUE))%>%
mutate(pct_change = (mean/lag(mean) - 1) * 100)%>%
mutate(pct_change = replace_na(pct_change, 0))
ken <- price %>%
filter(area %in% c("inner london"))
cam <- price %>%
filter(area %in% c("outer london"))
data2 <- cbind(ken,cam)
don <- xts(x=data2[,c(4,8)], order.by= as.Date(ISOdate(data2$year, 1, 1)))
dygraph(don, main = "Price Housing Market Variation (%): Inner London vs Outer London")%>%
dyOptions(fillGraph=TRUE) %>%
dyAxis("y", label = "Variation of price (%)") %>%
dyAxis("x", drawGrid = FALSE, label = "Time (Year)") %>%
dyRangeSelector(height = 20)%>%
dyLegend(width = 650)%>%
dySeries("pct_change", label = "Inner London")%>%
dySeries("pct_change1", label = "Outer London")%>%
dyShading(from = "2007-1-1", to = "2009-1-1", color = "#FFE6E6") %>%
dyShading(from = "2016-1-1", to = "2020-1-1", color = "#CCEBD6") %>%
dyEvent("2007-1-1", "Downturn", labelLoc = "bottom") %>%
dyEvent("2016-1-1", "Brexit", labelLoc = "bottom")%>%
dyUnzoom()
Let’s visualize different groups of boroughs: 1) Group 1: The 3 most expensive zones in the time series: Camden, Westminster and Kensington and Chelsea. Group 2: Boroughs around the group 1 with high prices. Group 3: Peripheral groups from Outer London with cheapest average prices.
west <- data1 %>%
filter(area %in% c("westminster"))%>%
mutate(average_price = average_price/1000)
ken <- data1 %>%
filter(area %in% c("kensington and chelsea"))%>%
mutate(average_price = average_price/1000)
cam <- data1 %>%
filter(area %in% c("camden"))%>%
mutate(average_price = average_price/1000)
data2 <- merge(ken,west, by = "date")
data2 <- merge(data2,cam, by = "date")
don <- xts(x=data2[,c(3,9,15)], order.by=data2$date)
dyUnzoom <-function(dygraph) {
dyPlugin(
dygraph = dygraph,
name = "Unzoom",
path = system.file("plugins/unzoom.js", package = "dygraphs")
)
}
p <- dygraph(don, main = "Most expensive boroughs in London")%>%
dyOptions( fillGraph=TRUE ) %>%
dyAxis("y", label = "Price (thousants of £)") %>%
dySeries("average_price.y", label = "Westminster")%>%
dySeries("average_price.x", label = "Kensin. & Chelsea")%>%
dySeries("average_price", label = "Camden")%>%
dyAxis("x", drawGrid = FALSE, label = "Time (Year)") %>%
dyRangeSelector(height = 20)%>%
dyLegend(width = 650)%>%
dyShading(from = "2008-1-1", to = "2009-3-1", color = "#FFE6E6") %>%
dyShading(from = "2016-6-23", to = "2020-1-1", color = "#CCEBD6") %>%
dyEvent("2008-1-1", "Downturn", labelLoc = "bottom") %>%
dyEvent("2016-6-23", "Brexit", labelLoc = "bottom")%>%
dyUnzoom()
p
We can observe that 2008/2009 was a year with a recesion in terms of prices for Kensington and Chelsea the fallen was more intense. From March 2009 to December 2014 this the prices of this borough experimented a significant growth. After 2016, the up and downs represent a volatile price. More than Westminster and Camden.
In this section we forecast the average price of the borough Kensington and Chelsea. It is the most expensive borough and we have seen a negative shift from 2016 with volatile behavior. Furthermore, we cannot discard the possibility of radiating from Inner London zone to Outer London. I will apply a decomposition. After that I will apply predictions according to Holt Winters and ARIMA approaches.
prueba <- data1 %>%
dplyr::filter(area %in% c("kensington and chelsea"))
west_ts <- ts(prueba, start = c(1995,1), frequency = 12)
west_l <- log(west_ts[,3])
flour.stl <- stl(west_l, "period")
autoplot(flour.stl,ts.colour = "red")+ ggtitle("Kensington and Chelsea borough") +
xlab("Time (Year)") + ylab("Average Price") +
theme(plot.title = element_text(hjust = 0.5))
west_ts <- ts(prueba[,3], start = c(1995,1), frequency = 12)
ggseasonplot(west_ts, year.labels=TRUE, year.labels.left=TRUE) +
ylab("Average Price") +
ggtitle("Price in cheapest borough")
We detect a seasonal curve which gives us information of changes that are repeated through the curve. The remainder is the noise, which is random, in the series. If we sum these curves along with the tren, we obtain the curve without decomposition. It is useful to see that the seasonal and noise components present signficant intervals. Actually the difference between the trend and the data are remarkably different.
But what is the evolution of the prices in terms of variations comparing to the previous period? Let’s see a candle graph.
Westminster <- data1 %>%
filter(area == "kensington and chelsea") %>%
group_by(Year = year(date))%>%
summarise(Open = first(average_price)/1000,
Maximum = max(average_price)/1000,
Minimum = min(average_price)/1000,
Close = last(average_price)/1000)
dygraph(Westminster) %>%
dyCandlestick()%>%
dyLegend(width = 650)%>%
dyAxis("y", label = "Price (Thousands £)") %>%
dyAxis("x", label = "Time (years)", drawGrid = FALSE)%>%
dyShading(from = "2008", to = "2011", color = "#FFE6E6") %>%
dyShading(from = "2016", to = "2020", color = "#CCEBD6") %>%
dyEvent("2008", "Downturn", labelLoc = "bottom") %>%
dyEvent("2016", "Brexit", labelLoc = "bottom")%>%
dyUnzoom()
We see more clearly that the trend curve from 2016 is getting flat.Furthermore, in 2015 it shows a negative value which represents the beginning of the reduction of prices. If we check other boroughs we see that from 2016 there is a fallen but not in 2015.
HW <-HoltWinters(west_l)
myhw <- forecast(HW, h = 24, findfrequency = TRUE)
autoplot(myhw)+ ggtitle("Kensington and Chelsea housing market Holt Winters Prediction") +
xlab("Time (year)") + ylab("Avg. Weigths") + theme_minimal()+
theme(plot.title = element_text(hjust = 0.5)) +
xlim(c(2010,2022))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
Here we present a prediction based on Holt Winters and creates a corridor with a 85% and 95% of confidence intervals. The application of this method gives us the information that the tren will continue growing with low slope.
arim.ts <- ts(log(prueba[,3]), start = c(1995,1), frequency = 12)
arim <- auto.arima(arim.ts)
checkresiduals(arim)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,2)(1,0,1)[12] with drift
## Q* = 32.26, df = 18, p-value = 0.02048
##
## Model df: 6. Total lags used: 24
The graphs of residuals give us an idea of the good performance of the model. The mean of the residuals is close to zero and there is not correlation in the residual series. The histogram suggests that the distribution may not be normal given the right tail. The time plot of the residuals is much the same; we could treat the variance as constant.
autoplot(forecast(arim),h=2, fcol = "red") +
ggtitle("Kensington and Chelsea housing market ARIMA Prediction")+
xlim(c(2010,2022))+
ylab('Avg. Weigths')+
xlab('Time (Year)')+
theme_minimal()
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
The predictions showed by the ARIMA model matches the information of Holt Winters method. So, without extraordinary circumstances, the average price of the housing market in Kensington and Chelsea borough will continue to grow slightly.
We can conclude different relevant points discovered through the study:
The trends of housing market of both London and England present relevant moments: 2008 downturn and Brexit referendum (2016). In both events the average price felt. During the downturn, the fallen was sharply but short. From 2016 the prices have been stalled.
The price average decreased when the volumen of transactions fell. In the years with more growth such as 2013 and 2014, the volumen of transactions grew. So the volume of transactions has an important influence and the adjustments of demand and supply.
In 2015 average price of Inner London decreased whereas Outer London’s increased. It could be a radiation from Inner London to Outer London. However other boroughs has not presented this behavior given the incentives to purchase houses of less than £600000.
With the exploration of Kensington and Chelsea, we saw through the decomposition process that noise and seasonal components are strong, although we saw a clear positive trend.
Holt Winters method gave us a prediction of moderate growth for the following 2 years, under normal circustamces.
ARIMA model gave us a similar approach: a growth of the prices; However, this curve does not present a volatile curve.
The housing market is, in general, a very stable market. For the case of London, it seems that this market is affected when problems appear at macroeconomic level, as expected. From 2016 there is a change of trend that we need to wait more and have more data to predict with more precision. This is given basically Covid-19 and Brexit conditions, which generate a context difficult to mantain the volume of transactions and the high average price reached in 2016.
Given the extraordinary situation, Government decisions about tax policies and incentives, could also affect to this market.
So given the series time analysis, it is difficult to think that the housing market will not continue to grow given the trend of the last 25 years. However an uncertainty epoch possibly will affect negatively to this market, the time for recovering will be a matter of other study.