London Housing Market

Notting Hill, London.

INTRODUCTION

How will impact Covid-19 “new normality” in housing market? Do you think it will be a right moment to buy a house?

Given the extraordinary characteristics of Covid-19, it is difficult to make precise predictions of macroeconomic conditions in a post-pandemic world.

However, we are able to analyze the data we have and learn from the past to detect patterns.

[My name is Ricardo Santana, and I hope you enjoy this project developed to understand the housing market in London. I am a passionate of data and found this dataset challenging and motivating so I decided to explore it and share it. I will explore the data, visualize it through geospatial approach, and identify the trends in London housing prices. Then I will explore Kensinton and Chelsea borough and propose a model for the prediction of prices. Note that this dataset includes information until 2020 January].

1. EXPLORING

First of all, I prepare the environment and load libraries to start the study:

Environment

{
  rm(list = ls())
  cat("\014")
  graphics.off()
}

Libraries

{
  packages = c("knitr","tidyverse","scales","stringr","DataExplorer","caret",
               "nnet","rpart", "rpart.plot","e1071",
               "randomForest","xgboost","ada","MASS","questionr",
               "psych","car","Hmisc", "jsonlite","ggplot2","dplyr","scales",
               "gridExtra","corrplot","lubridate","plotly","maps","maptools","broom",
               "data.table","maptools","httr","leaflet","widgetframe","here",
               "rgdal","raster","Metrics","gbm","grid","viridis","dygraphs","xts","igraph","ggraph",
               "ggfortify", "forecast","gapminder","gganimate")
  newpack  = packages[!(packages %in% installed.packages()[,"Package"])]
  if(length(newpack)) install.packages(newpack)
  a=lapply(packages, library, character.only=TRUE)
}

Functions

source("FunctionsRicardo.R")

1.1. We load the data

We load the data we will be working with. This dataset was extracted from: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.

data <- read_csv("housing_in_london_monthly_variables.csv")

1.2. Summary and structure

We start with around 13000 observations. Every observation includes variables such as average price and London boroughs. For deeper studies of this matter, House Price Index is a more suitable metric to be used; however, the average price would be our best indicator for housing market given this dataset. I highlight that this is a work to practice and show possibilities of learning from the past.

So, we need to summarize the data to have a preliminar idea of the dataset:

Summary

summary(data)

##       date                area           average_price         code          
##  Min.   :1995-01-01   Length:13549       Min.   :  40722   Length:13549      
##  1st Qu.:2001-04-01   Class :character   1st Qu.: 132380   Class :character  
##  Median :2007-07-01   Mode  :character   Median : 222919   Mode  :character  
##  Mean   :2007-06-30                      Mean   : 263520                     
##  3rd Qu.:2013-10-01                      3rd Qu.: 336843                     
##  Max.   :2020-01-01                      Max.   :1463378                     
##                                                                              
##   houses_sold      no_of_crimes   borough_flag   
##  Min.   :     2   Min.   :   0   Min.   :0.0000  
##  1st Qu.:   247   1st Qu.:1623   1st Qu.:0.0000  
##  Median :   371   Median :2132   Median :1.0000  
##  Mean   :  3894   Mean   :2158   Mean   :0.7333  
##  3rd Qu.:  3146   3rd Qu.:2582   3rd Qu.:1.0000  
##  Max.   :132163   Max.   :7461   Max.   :1.0000  
##  NA's   :94       NA's   :6110

Structure

glimpse(data)

## Observations: 13,549
## Variables: 7
## $ date          <date> 1995-01-01, 1995-02-01, 1995-03-01, 1995-04-01, 1995...
## $ area          <chr> "city of london", "city of london", "city of london",...
## $ average_price <dbl> 91449, 82203, 79121, 77101, 84409, 94901, 110128, 112...
## $ code          <chr> "E09000001", "E09000001", "E09000001", "E09000001", "...
## $ houses_sold   <dbl> 17, 7, 14, 7, 10, 17, 13, 14, 17, 14, 11, 18, 17, 10,...
## $ no_of_crimes  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ borough_flag  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...

We can observe that there are not missing values (NAs). The observations are registers from January 1995 to January 2020. Furthermore, we observe that the borough flag column has 1 or 0 as possible values. The London observations have 1 as flag; this will be relevant in order to filter information and infer knowledge.

data1 <- data %>%
mutate(code = factor(code),
       date = ymd(date),
       area = factor(area),
       average_price = as.numeric(average_price),
       houses_sold = as.numeric(houses_sold),
       no_of_crimes = as.numeric(no_of_crimes),
       borough_flag = as.numeric(borough_flag))

Let’s visualize the dataset.

1.3. Price variations

price <- data1 %>%
  filter(area %in% c("england","london")) %>%
  group_by(area, year = year(date)) %>%
  summarise(mean = mean(average_price, na.rm = TRUE))

ken <-  price %>%
  filter(area %in% c("england"))%>%
  mutate(mean = mean/1000)
cam <-  price %>%
  filter(area %in% c("london"))%>%
  mutate(mean = mean/1000)
  
 data2 <- cbind(ken,cam)
don <- xts(x=data2[,c(3,6)], order.by= as.Date(ISOdate(data2$year, 1, 1)))

dygraph(don,  main = "Mean Price Housing Market: London vs England")%>%
  dyOptions(fillGraph=TRUE) %>% 
  dyAxis("y", label = "Price (Thousands of £)") %>% 
  dyAxis("x", drawGrid = FALSE, label = "Time (Year)")  %>%
  dySeries("mean", label = "England")%>% 
  dySeries("mean1", label = "London")%>% 
  dyRangeSelector(height = 20)%>%
  dyLegend(width = 650)%>%
  dyShading(from = "2007-1-1", to = "2009-1-1", color = "#FFE6E6") %>%
  dyShading(from = "2016-1-1", to = "2020-1-1", color = "#CCEBD6") %>%
  dyEvent("2007-1-1", "Downturn", labelLoc = "bottom") %>%
  dyEvent("2016-1-1", "Brexit", labelLoc = "bottom")%>%
  dyUnzoom()

1.4. Volumen transactions

prueba2 <- data1 %>%
  mutate(borough_flag = factor(borough_flag, levels = c("0", "1"), labels = c("Rest of Great Britain","London")))%>%
  filter(year(date) > 2003)%>%
  group_by(Territory = borough_flag,year = year(date)) %>%
  summarise(vol = sum(houses_sold)) 

cam <-  prueba2 %>%
  filter(Territory %in% c("London"))%>%
  mutate(vol = vol/1000)

don <- xts(x=cam[,c(3)], order.by= as.Date(ISOdate(cam$year, 1, 1)))

dygraph(don,  main = "Volume of Transactions in London")%>%
  dyOptions( fillGraph=TRUE) %>% 
  dyAxis("y", label = "Number of transactions (%)") %>% 
  dyAxis("x", drawGrid = FALSE, label = "Time (Year)") %>%
  dyRangeSelector(height = 20)%>%
  dyLegend(width = 650)%>%
  dySeries("vol", label = "Transactions")%>% 
  dyShading(from = "2007-1-1", to = "2009-1-1", color = "#FFE6E6") %>%
  dyShading(from = "2016-1-1", to = "2018-1-1", color = "#CCEBD6") %>%
  dyEvent("2007-1-1", "Downturn", labelLoc = "bottom") %>%
  dyEvent("2016-1-1", "Brexit", labelLoc = "bottom")%>%
  dyUnzoom()

1.5. Price variations [%]

per <- data1 %>%
  filter(area %in% c("england","london")) %>%
  group_by(area, year = year(date)) %>%
  summarise(mean = mean(average_price, na.rm = TRUE))%>%
  mutate(pct_change = (mean/lag(mean) - 1) * 100)%>%
  mutate(pct_change = replace_na(pct_change, 0))

ken <-  per %>%
  filter(area %in% c("england"))
cam <-  per %>%
  filter(area %in% c("london"))
data2 <- cbind(ken,cam)
don <- xts(x=data2[,c(4,8)], order.by= as.Date(ISOdate(data2$year, 1, 1)))

dygraph(don,  main = "Price Housing Market Variation (%)")%>%
  dyOptions( fillGraph=TRUE) %>% 
  dyAxis("y", label = "Price variation (%)") %>% 
  dyAxis("x", drawGrid = FALSE, label = "Time (Year)")  %>%
  dyRangeSelector(height = 20)%>%
  dySeries("pct_change", label = "England")%>% 
  dySeries("pct_change1", label = "London")%>% 
  dyLegend(width = 650)%>%
  dyShading(from = "2007-1-1", to = "2009-1-1", color = "#FFE6E6") %>%
  dyShading(from = "2016-1-1", to = "2020-1-1", color = "#CCEBD6") %>%
  dyEvent("2007-1-1", "Downturn", labelLoc = "bottom") %>%
  dyEvent("2016-1-1", "Brexit", labelLoc = "bottom")%>%
  dyUnzoom()

First of all, we compare the housing prices in London to the rest of Great Britain. From these 2 graphics (variation and variation of percentage), we must highlight different points:

Average price in London from 1995 to 2020 has been higher that the rest of Great Britain.
There is a trend to increase the prices over the years, as expected.
There is a sharply fallen of the prices and the volume of transactions given the international downturn of 2008/2009. From 2010 the previous trend is recovered.
From 2013 to 2014 the curve presents the higher positive slope in all the analized period. It is possible that the cause was a lack of capacity of supply when volumne of transactions increased sharply and low interest rates.
From 2016 the prices have been in steady decline until 2019, when a positive slope starts.
The decrease of 2016 coincides with the introduction of a higher rate of Stamp Duty on additional properties.
The prices of London market, comparing to the rest of Great Britain show a higher positive slope until 2016.
From 2016 the prices in the rest of Great Britain are more stable. London prices decrease slightly.
The maximum is found in 2017.
The decrease produced has been also influenced by the rental market trend, which showed a significant decrease those years (https://www.ons.gov.uk/economy/inflationandpriceindices/articles/exploringrecenttrendsinthelondonhousingmarket/2018-09-19).

3. VISUALIZATION

3.1. Geospatial approach

First, we need to see the distribution average prices in London; I raise a geospatial approach to identify boroughs. I apply the average price of the boroughs taking into consideration the last 5 years:

states <- readOGR("London_Borough_Excluding_MHW.shp")
gsub('"', '', states@data$NAME)
states@data$NAME <- str_to_lower(states@data$NAME, locale = "en")

data3 <- data1 %>%
  filter(year(date) %in% c(2015:2020))%>%
  group_by(NAME = area) %>%
  summarise(mean = mean(average_price, na.rm = TRUE))

fusion <- merge(states, data3, by = "NAME")

geoData_latlon <- spTransform(fusion, CRS("+proj=longlat +datum=WGS84"))

factpal2 <- colorFactor("Reds", geoData_latlon$mean)

leaflet() %>%
  addTiles() %>%
  addPolygons(label = ~(paste("Borough:",str_to_title(NAME),":","£",round(mean, digits = 2))),data = geoData_latlon, 
              stroke = FALSE, smoothFactor = 0.2,fillOpacity = 0.8, color = ~factpal2(mean),
              labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
              textsize = "15px",direction = "auto"), group = "NYC Boroughs")

We see that the majority of the expensive boroughs are located in the centre. The further we go, the cheaper are the houses. However, there are important exceptions. The most expensive boroughs are: Kensington and Chelsea, Westminster and Camden in that order (hover the map to see the different amounts). The boroughs that are around these boroughs such as Hammersmith and Fulham, City of London, Wandsworth and Islington present remarkable prices. It is also remarkable the boroughs around Camden such as Barnet or Haringey. However, there is a borough that it is not around the most expensive boroughs but it still shows high prices: Richmond upon Thames. We also see that east borough includes such as Brent, Harrow and Ealing which are more expensive than peripheral west boroughs.

Now that we have the boroughs visualized, let’s see how the prices evolve in different boroughs to identify patterns. In first place, we check the three most expensive boroughs: Camden, Kensington and Chelsea and Westminster (Group 1). Then We check some of the boroughs around them (Group 2). At the end, we will also check the cheapest boroughs (Group 3).

Inner London vs Outer London

In this case, we visualize the trend for Inner London and Outer London prices to try to infer a pattern. Inner London includes the following boroughs: City of London, Camden, Greenwich, Hackney, Hammersmith and Fulham, Islington, Kensington and Chelsea, Lambeth, Lewisham, Newham, Southwark, Tower Hamlets, Wandsworth, and Westminster. Outer London includes these boroughs: Barking and Dagenham, Barnet, Bexley, Brent, Bromley, Croydon, Ealing, Enfield, Haringey, Harrow, Havering, Hillingdon, Hounslow, Kingston, Merton, Redbridge, Richmond, Sutton, and Waltham Forest.

price <- data1 %>%
  filter(area %in% c("inner london","outer london")) %>%
  group_by(area, year = year(date)) %>%
  summarise(mean = mean(average_price, na.rm = TRUE))%>%
  mutate(pct_change = (mean/lag(mean) - 1) * 100)%>%
  mutate(pct_change = replace_na(pct_change, 0))

ken <-  price %>%
  filter(area %in% c("inner london"))
cam <-  price %>%
  filter(area %in% c("outer london"))
 data2 <- cbind(ken,cam)
don <- xts(x=data2[,c(4,8)], order.by= as.Date(ISOdate(data2$year, 1, 1)))

dygraph(don,  main = "Price Housing Market Variation (%): Inner London vs Outer London")%>%
  dyOptions(fillGraph=TRUE) %>% 
  dyAxis("y", label = "Variation of price (%)") %>% 
  dyAxis("x", drawGrid = FALSE, label = "Time (Year)")  %>%
  dyRangeSelector(height = 20)%>%
  dyLegend(width = 650)%>%
  dySeries("pct_change", label = "Inner London")%>% 
  dySeries("pct_change1", label = "Outer London")%>% 
  dyShading(from = "2007-1-1", to = "2009-1-1", color = "#FFE6E6") %>%
  dyShading(from = "2016-1-1", to = "2020-1-1", color = "#CCEBD6") %>%
  dyEvent("2007-1-1", "Downturn", labelLoc = "bottom") %>%
  dyEvent("2016-1-1", "Brexit", labelLoc = "bottom")%>%
  dyUnzoom()

We can highlight some interesting points:

In 2002, the growth of Outer London was the first time that overcame Innter London in out dataset. This trend was mantained until 2004.
In 2008 and 2009 both fell sharply with a rapid recuperation in 2010 (from -11.08% to to 9.42% in Outer London and from -9.29% to 11.85%).
In 2011 the percentage decreases, but in 2012-2014 it presents a elevated positive slope.
We detect a different behavior between Inner London and Outer London. The Inner London percentage falls significantly comparing to to Outer London. Furthermore in 2016, the trend of Outer London is positive but not for Inner London. This provokes a minor growth in 2017 and negative results in 2018 for Inner London. We can also see these negative results for both zones in 2018.
We see that there is a correlation between the distributions presented by Outer London and Inner London. We could conclude that given the changes presented from 2016, the reduction of of the grouth of Inner London prices is possibly affecting, among other variables, the outer London prices.

Let’s visualize different groups of boroughs: 1) Group 1: The 3 most expensive zones in the time series: Camden, Westminster and Kensington and Chelsea. Group 2: Boroughs around the group 1 with high prices. Group 3: Peripheral groups from Outer London with cheapest average prices.

3.2. Most expensive boroughs

west <- data1 %>%
  filter(area %in% c("westminster"))%>%
  mutate(average_price = average_price/1000)

ken <-  data1 %>%
  filter(area %in% c("kensington and chelsea"))%>%
  mutate(average_price = average_price/1000)

cam <-  data1 %>%
  filter(area %in% c("camden"))%>%
  mutate(average_price = average_price/1000)
  
data2 <- merge(ken,west, by = "date") 

data2 <- merge(data2,cam, by = "date")




don <- xts(x=data2[,c(3,9,15)], order.by=data2$date)

dyUnzoom <-function(dygraph) {
  dyPlugin(
    dygraph = dygraph,
    name = "Unzoom",
    path = system.file("plugins/unzoom.js", package = "dygraphs")
  )
}


p <- dygraph(don,  main = "Most expensive boroughs in London")%>%
  dyOptions( fillGraph=TRUE ) %>% 
  dyAxis("y", label = "Price (thousants of £)") %>% 
  dySeries("average_price.y", label = "Westminster")%>% 
  dySeries("average_price.x", label = "Kensin. & Chelsea")%>% 
  dySeries("average_price", label = "Camden")%>% 
  dyAxis("x", drawGrid = FALSE, label = "Time (Year)")  %>%
  dyRangeSelector(height = 20)%>%
  dyLegend(width = 650)%>%
  dyShading(from = "2008-1-1", to = "2009-3-1", color = "#FFE6E6") %>%
  dyShading(from = "2016-6-23", to = "2020-1-1", color = "#CCEBD6") %>%
  dyEvent("2008-1-1", "Downturn", labelLoc = "bottom") %>%
  dyEvent("2016-6-23", "Brexit", labelLoc = "bottom")%>%
  dyUnzoom() 
p

We can observe that 2008/2009 was a year with a recesion in terms of prices for Kensington and Chelsea the fallen was more intense. From March 2009 to December 2014 this the prices of this borough experimented a significant growth. After 2016, the up and downs represent a volatile price. More than Westminster and Camden.

4. THE CASE OF KENSINGTON AND CHELSEA

In this section we forecast the average price of the borough Kensington and Chelsea. It is the most expensive borough and we have seen a negative shift from 2016 with volatile behavior. Furthermore, we cannot discard the possibility of radiating from Inner London zone to Outer London. I will apply a decomposition. After that I will apply predictions according to Holt Winters and ARIMA approaches.

4.1. Decomposition

Decomposition of time series by Loess

prueba <- data1 %>%
  dplyr::filter(area %in% c("kensington and chelsea")) 

west_ts <- ts(prueba, start = c(1995,1), frequency = 12)
west_l <- log(west_ts[,3])
flour.stl <- stl(west_l, "period")

autoplot(flour.stl,ts.colour = "red")+ ggtitle("Kensington and Chelsea borough") + 
  xlab("Time (Year)") + ylab("Average Price") +
  theme(plot.title = element_text(hjust = 0.5))

Months distribution

west_ts <- ts(prueba[,3], start = c(1995,1), frequency = 12)

ggseasonplot(west_ts, year.labels=TRUE, year.labels.left=TRUE) +
ylab("Average Price") +
ggtitle("Price in cheapest borough")

We detect a seasonal curve which gives us information of changes that are repeated through the curve. The remainder is the noise, which is random, in the series. If we sum these curves along with the tren, we obtain the curve without decomposition. It is useful to see that the seasonal and noise components present signficant intervals. Actually the difference between the trend and the data are remarkably different.

But what is the evolution of the prices in terms of variations comparing to the previous period? Let’s see a candle graph.

4.2. Year variations

Westminster <- data1 %>%
  filter(area == "kensington and chelsea") %>%
  group_by(Year = year(date))%>%
  summarise(Open = first(average_price)/1000,
            Maximum = max(average_price)/1000,
            Minimum = min(average_price)/1000,
            Close = last(average_price)/1000)


dygraph(Westminster) %>%
  dyCandlestick()%>%
  dyLegend(width = 650)%>%
  dyAxis("y", label = "Price (Thousands £)") %>%
  dyAxis("x", label = "Time (years)", drawGrid = FALSE)%>% 
  dyShading(from = "2008", to = "2011", color = "#FFE6E6") %>%
  dyShading(from = "2016", to = "2020", color = "#CCEBD6") %>%
  dyEvent("2008", "Downturn", labelLoc = "bottom") %>%
  dyEvent("2016", "Brexit", labelLoc = "bottom")%>% 
  dyUnzoom()

We see more clearly that the trend curve from 2016 is getting flat.Furthermore, in 2015 it shows a negative value which represents the beginning of the reduction of prices. If we check other boroughs we see that from 2016 there is a fallen but not in 2015.

4.3. Holt Winters method

HW <-HoltWinters(west_l)

myhw <- forecast(HW, h = 24, findfrequency = TRUE)
autoplot(myhw)+ ggtitle("Kensington and Chelsea housing market Holt Winters Prediction") + 
  xlab("Time (year)") + ylab("Avg. Weigths") + theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5)) +
  xlim(c(2010,2022))

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

Here we present a prediction based on Holt Winters and creates a corridor with a 85% and 95% of confidence intervals. The application of this method gives us the information that the tren will continue growing with low slope.

4.4. ARIMA model

arim.ts <- ts(log(prueba[,3]), start = c(1995,1), frequency = 12)
arim <- auto.arima(arim.ts)

checkresiduals(arim)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2)(1,0,1)[12] with drift
## Q* = 32.26, df = 18, p-value = 0.02048
## 
## Model df: 6.   Total lags used: 24

The graphs of residuals give us an idea of the good performance of the model. The mean of the residuals is close to zero and there is not correlation in the residual series. The histogram suggests that the distribution may not be normal given the right tail. The time plot of the residuals is much the same; we could treat the variance as constant.

4.5. ARIMA predictions

autoplot(forecast(arim),h=2, fcol = "red") +
         ggtitle("Kensington and Chelsea housing market ARIMA Prediction")+
         xlim(c(2010,2022))+ 
         ylab('Avg. Weigths')+
         xlab('Time (Year)')+
         theme_minimal()

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

The predictions showed by the ARIMA model matches the information of Holt Winters method. So, without extraordinary circumstances, the average price of the housing market in Kensington and Chelsea borough will continue to grow slightly.

5. CONCLUSIONS

We can conclude different relevant points discovered through the study:

The trends of housing market of both London and England present relevant moments: 2008 downturn and Brexit referendum (2016). In both events the average price felt. During the downturn, the fallen was sharply but short. From 2016 the prices have been stalled.
The price average decreased when the volumen of transactions fell. In the years with more growth such as 2013 and 2014, the volumen of transactions grew. So the volume of transactions has an important influence and the adjustments of demand and supply.
In 2015 average price of Inner London decreased whereas Outer London’s increased. It could be a radiation from Inner London to Outer London. However other boroughs has not presented this behavior given the incentives to purchase houses of less than £600000.
With the exploration of Kensington and Chelsea, we saw through the decomposition process that noise and seasonal components are strong, although we saw a clear positive trend.
Holt Winters method gave us a prediction of moderate growth for the following 2 years, under normal circustamces.
ARIMA model gave us a similar approach: a growth of the prices; However, this curve does not present a volatile curve.

The housing market is, in general, a very stable market. For the case of London, it seems that this market is affected when problems appear at macroeconomic level, as expected. From 2016 there is a change of trend that we need to wait more and have more data to predict with more precision. This is given basically Covid-19 and Brexit conditions, which generate a context difficult to mantain the volume of transactions and the high average price reached in 2016.

Given the extraordinary situation, Government decisions about tax policies and incentives, could also affect to this market.

So given the series time analysis, it is difficult to think that the housing market will not continue to grow given the trend of the last 25 years. However an uncertainty epoch possibly will affect negatively to this market, the time for recovering will be a matter of other study.