Icons and tips: The case of NYC yellow taxis

$Manhattan, New York City.$

Manhattan, New York City.

INTRODUCTION

When was the last time you felt unsure when calculating a tip? How often you feel you should have given more (or less)?

Individually, the variables that influence the tip amount might be infinite.

However, thanks to the development of artificial intelligence, and specially machine learning field, we are able to infer patterns of how we make decisions.

This study takes place in New York City (NYC). The tip amounts of the iconic yellow taxis can be predicted in order to adjust the strategies for existing public transport companies or even for new projects. Let’s explore this situation. Make yourself comfortable. It will be a nice trip.

[My name is Ricardo Santana, and I hope you enjoy this project developed to understand the dynamic of yellow taxis in New York City (NYC). I am a passionate of data and found this dataset challenging and motivating so I decided to explore it and share it. I enjoy public transport aspects and was fun to practice with it. Concretely, I will explore the data, visualize it through geospatial approach, and identify relevant variables for tip amount. Next I will create models to predict these amounts. Possibly total amount is correlated with tip amount but we will make sure and analize other variables to contribute to higher accuracy. This resulting model will be suitable for business purposes, even I will provide the code to create a Rest API to share, for instance, with a client interested in public transport].

1. EXPLORING

First of all, I prepare the environment and load libraries to start the study:

Environment

{
  rm(list = ls())
  cat("\014")
  graphics.off()
}

Libraries

{
  packages = c("knitr","tidyverse","scales","stringr","DataExplorer","caret",
               "nnet","rpart", "rpart.plot","e1071",
               "randomForest","xgboost","ada","MASS","questionr",
               "psych","car","Hmisc", "jsonlite","ggplot2","dplyr","scales",
               "gridExtra","corrplot","lubridate","plotly","maps","maptools","broom",
               "data.table","maptools","httr","leaflet","widgetframe","here",
               "rgdal","raster","Metrics","gbm","grid","viridis")
  newpack  = packages[!(packages %in% installed.packages()[,"Package"])]
  if(length(newpack)) install.packages(newpack)
  a=lapply(packages, library, character.only=TRUE)
}

Functions

source("FunctionsRicardo.R")

1.1. We load the data

We load the data we will be working with. This dataset was extracted from: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.

data <- read_csv("yellow_tripdata_2017-03.csv")
data2 <- read_csv("yellow_tripdata_2017-06.csv")
data3 <- read_csv("yellow_tripdata_2017-11.csv")

fusion <- bind_rows(data,data2,data3)

1.2. Preliminar exploration

We start with around 30 millions of observations (around 10 millions per month, which are March, October and November). Every observation corresponds to NYC yellow taxi trip. We could create models directly with raw data, but the result (if possible) would not be realistic. We need to check the data we are dealing with:

Summary

summary(fusion)

##     VendorID     tpep_pickup_datetime          tpep_dropoff_datetime        
##  Min.   :1.000   Min.   :2001-01-01 00:04:13   Min.   :2001-01-01 00:04:51  
##  1st Qu.:1.000   1st Qu.:2017-03-23 06:52:56   1st Qu.:2017-03-23 07:04:51  
##  Median :2.000   Median :2017-06-14 07:51:59   Median :2017-06-14 08:05:43  
##  Mean   :1.546   Mean   :2017-07-02 02:55:25   Mean   :2017-07-02 03:12:20  
##  3rd Qu.:2.000   3rd Qu.:2017-11-07 05:30:25   3rd Qu.:2017-11-07 05:47:49  
##  Max.   :2.000   Max.   :2041-11-15 02:57:16   Max.   :2041-11-15 03:12:19  
##  passenger_count   trip_distance        RatecodeID     store_and_fwd_flag
##  Min.   :  0.000   Min.   :   0.000   Min.   : 1.000   Length:29236424   
##  1st Qu.:  1.000   1st Qu.:   0.970   1st Qu.: 1.000   Class :character  
##  Median :  1.000   Median :   1.600   Median : 1.000   Mode  :character  
##  Mean   :  1.618   Mean   :   2.919   Mean   : 1.043                     
##  3rd Qu.:  2.000   3rd Qu.:   3.010   3rd Qu.: 1.000                     
##  Max.   :192.000   Max.   :9496.980   Max.   :99.000                     
##   PULocationID    DOLocationID    payment_type    fare_amount      
##  Min.   :  1.0   Min.   :  1.0   Min.   :1.000   Min.   :  -550.0  
##  1st Qu.:114.0   1st Qu.:107.0   1st Qu.:1.000   1st Qu.:     6.5  
##  Median :162.0   Median :162.0   Median :1.000   Median :     9.5  
##  Mean   :163.2   Mean   :161.2   Mean   :1.329   Mean   :    13.1  
##  3rd Qu.:233.0   3rd Qu.:233.0   3rd Qu.:2.000   3rd Qu.:    14.5  
##  Max.   :265.0   Max.   :265.0   Max.   :5.000   Max.   :630461.8  
##      extra             mta_tax           tip_amount        tolls_amount     
##  Min.   :-53.7100   Min.   : -0.5000   Min.   :-112.000   Min.   : -17.500  
##  1st Qu.:  0.0000   1st Qu.:  0.5000   1st Qu.:   0.000   1st Qu.:   0.000  
##  Median :  0.0000   Median :  0.5000   Median :   1.360   Median :   0.000  
##  Mean   :  0.3339   Mean   :  0.4973   Mean   :   1.874   Mean   :   0.329  
##  3rd Qu.:  0.5000   3rd Qu.:  0.5000   3rd Qu.:   2.460   3rd Qu.:   0.000  
##  Max.   : 69.8000   Max.   :140.0000   Max.   : 450.000   Max.   :1018.950  
##  improvement_surcharge  total_amount     
##  Min.   :-0.3000       Min.   :  -550.3  
##  1st Qu.: 0.3000       1st Qu.:     8.8  
##  Median : 0.3000       Median :    11.8  
##  Mean   : 0.2996       Mean   :    16.4  
##  3rd Qu.: 0.3000       3rd Qu.:    17.8  
##  Max.   : 1.0000       Max.   :630463.1

We can observe that there are not missing values (NAs). However, there are some values that are not realistic, such as total amount negative values. Same comment applies to fare amount. It is not coherent trip distance of 0, which means either the taxi did not move or the system failed. Negative values are also present in other variables such as mta tax. Furthermore, we have columns with dates of pickup and dropoff trips. Finally, as relevant information, we have all the variables as numeric, so we will need to change the class to operate with models and optimize them.

Summarizing, we need to clean and prepare the data. Let’s visualize the dataset.

Structure

glimpse(fusion)

## Rows: 29,236,424
## Columns: 17
## $ VendorID              <dbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,…
## $ tpep_pickup_datetime  <dttm> 2017-03-09 21:30:11, 2017-03-09 21:47:00, 2017…
## $ tpep_dropoff_datetime <dttm> 2017-03-09 21:44:20, 2017-03-09 21:58:01, 2017…
## $ passenger_count       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 5, 5,…
## $ trip_distance         <dbl> 4.06, 2.73, 2.27, 3.86, 3.45, 2.80, 6.00, 8.70,…
## $ RatecodeID            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ store_and_fwd_flag    <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N…
## $ PULocationID          <dbl> 148, 48, 79, 237, 41, 261, 87, 142, 68, 261, 16…
## $ DOLocationID          <dbl> 48, 107, 162, 41, 162, 79, 142, 181, 141, 163, …
## $ payment_type          <dbl> 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ fare_amount           <dbl> 14.0, 11.5, 10.0, 12.0, 12.0, 12.5, 19.5, 30.0,…
## $ extra                 <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.…
## $ mta_tax               <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.…
## $ tip_amount            <dbl> 3.06, 0.00, 2.82, 3.99, 0.00, 1.00, 3.50, 7.80,…
## $ tolls_amount          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ improvement_surcharge <dbl> 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.…
## $ total_amount          <dbl> 18.36, 12.80, 14.12, 17.29, 13.30, 14.80, 24.30…

Summarizing, we need to clean and prepare the data. Let’s visualize the dataset.

2. VISUALIZATION

First, we raise a geospatial approach to know the distribution of boroughs and taxi zones in NYC.

To do so, we fusion the information we already have with a geospatial dataset of NYC and generate a map with zones and boroughs (upper-right corner).

Note that in this map, I only show the zones we will be working with. The colors are useful to better visualize the 263 taxi zones in NYC. Hover to check the different zones and boroughs names.

states <- readOGR("taxi_zones.shp")

geoData_latlon <- spTransform(states, CRS("+proj=longlat +datum=WGS84"))

factpal <- colorFactor(rainbow(5), geoData_latlon$zone)
factpal2 <- colorFactor(rainbow(10), geoData_latlon$borough)

leaflet() %>%
  addTiles() %>%
  addPolygons(label = ~(paste("(Borough:", borough,")")),data = geoData_latlon, 
              stroke = FALSE, smoothFactor = 0.2,fillOpacity = 1, color = ~factpal2(borough),
              labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
              textsize = "15px",direction = "auto"), group = "NYC Boroughs")%>%
  
  addPolygons(label = ~(paste("(Zone:", zone,")")),data = geoData_latlon, stroke = FALSE, 
              smoothFactor = 0.2, fillOpacity = 0.8, color = ~factpal(zone), 
              labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "15px",direction = "auto"), group = "NYC Taxi Zones - NYC Boroughs") %>%
  
  addLayersControl(overlayGroups = c("NYC Taxi Zones - NYC Boroughs"),
  options = layersControlOptions(collapsed = FALSE))

2.1. Pickup and dropoff zones

Once we have identified the taxi zones, What can we see from the air?

One thing we need to do, in order to observe the cases according to their location, is to apply filters taking into account the information we already have by the dictionary data and the summary: Tip amount only has credit card tips. So we filter by payment tip and we make sure that there are not negative values for tip amounts.

fusion_mp <- fusion %>% 
  filter(payment_type == 1) %>%
  filter(tip_amount >= 0)

Now we check in our second map, the most (and less) popular zones for trips taken into consideration the mean tip amount. The polygons in green represent mean tip amount for trips by pickup zones. The blue polygons (check upper-right corner) represent dropoff zones. To do so, we add extra information about the boroughs and yellow taxi zones in NYC (included in DOlocationtaxi and PUlocationtaxi datasets).

DOlocationtaxi <- read_csv("DOLocation.csv")
Pulocationtaxi <- read_csv("PULocation.csv")

fusion_zones <- fusion_mp %>%
  left_join(DOlocationtaxi, fusion_mp, by = "DOLocationID") %>%
  left_join(Pulocationtaxi, fusion_mp, by = "PULocationID") %>%
  filter(!is.na(DOBorough) & !is.na(PUBorough))
IDDO <- fusion_zones %>% 
  group_by(LocationID=PULocationID) %>% 
  summarise(Mean_tipPU=mean(tip_amount)) 
geoData1 <- merge(states, IDDO, by = "LocationID")
IDPU <- fusion_zones %>% 
  group_by(LocationID = DOLocationID) %>% 
  summarise(Mean_tipDO=mean(tip_amount)) 
geoData2 <- merge(geoData1, IDPU, by = "LocationID")

geoData_latlon <- spTransform(geoData2, CRS("+proj=longlat +datum=WGS84"))
factpal <-colorNumeric(palette = "Blues", domain = geoData_latlon$Mean_tipDO)
factpal2 <-colorNumeric(palette = "Greens", domain = geoData_latlon$Mean_tipPU)

geoData_latlon%>%
  leaflet() %>%
  addTiles() %>%
  addPolygons(label = ~(paste(zone,"(Mean Tip:", Mean_tipDO,")")), stroke = FALSE, smoothFactor = 0.2, 
              fillOpacity = 0.9, color = ~factpal(Mean_tipDO),
              highlightOptions = highlightOptions(color = "red",
                                                  weight = 2,
                                                  bringToFront = TRUE),
              labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
              textsize = "15px",direction = "auto"), group = "Dropoff") %>%
  
  addPolygons(label = ~(paste(zone,"(Mean Tip:", Mean_tipPU,")")), stroke = FALSE, smoothFactor = 0.2, 
              fillOpacity = 0.9, color = ~factpal2(Mean_tipPU),
              highlightOptions = highlightOptions(color = "red",
                                                  weight = 2,
                                                  bringToFront = TRUE),
              labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "15px", direction = "auto"), group = "Pickup (Greens) - Dropoff (Blues)") %>%

  addLayersControl(overlayGroups = c("Pickup (Greens) - Dropoff (Blues)"),
              options = layersControlOptions(collapsed = FALSE))

We can observe that there is a considering proportion of peripheral zones considering Manhattan as reference. The reason might be less presence of public transport and long trips for these zones. We need more information and will have it with the following maps.

2.2. Short trips

We need to confirm the distribution of short trips. The following map shows the most usual pickup zones for trips that start and finish in the same borough:

IDST <- fusion_zones %>% 
  filter(DOBorough == PUBorough)%>% 
  count(LocationID = PULocationID)

geoData1 <- merge(states, IDST, by = "LocationID")

geoData_latlon <- spTransform(geoData1, CRS("+proj=longlat +datum=WGS84"))
factpal <-colorNumeric(palette = "YlOrRd", domain = geoData_latlon$n)

geoData_latlon%>%
  leaflet() %>%
  addTiles() %>%
  addPolygons(label = ~(paste(zone,"(Number of trips:", n,")")), stroke = FALSE, smoothFactor = 0.2, 
              fillOpacity = 0.9, color = ~factpal(n),
              highlightOptions = highlightOptions(color = "red",
                                                  weight = 2,
                                                  bringToFront = TRUE),
              labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
              textsize = "15px",direction = "auto"), group = "Dropoff")

We note that Manhattan has the highest number of trips inside the same borough. We could consider them “short trips”. The east side of Central Park, along with Fifth and Seventh avenues and midtown, are the zones with more pickups for this kind of trips.

Furthermore, there is a remarkable number in JFK and La Guardia airport for trips to Queens.

This map along with the other maps give us a clue about the importance of the role of the airports, specially JFK and Newark. We will analize them in the following section.

2.3. Best locations for airport trips

In first place, we observe the most usual zones to pick up a taxi in order to go to Newark airport:

IDNEWARK <- fusion_zones %>% 
  filter(DOLocationID == 1) %>% 
  count(LocationID = PULocationID)

geoData1 <- merge(states, IDNEWARK, by = "LocationID")

geoData_latlon <- spTransform(geoData1, CRS("+proj=longlat +datum=WGS84"))
factpal <-colorNumeric(palette = "Reds", domain = geoData_latlon$n)
factpal2 <-colorNumeric(palette = "YlGnBu ", domain = geoData_latlon$n)

geoData_latlon%>%
  leaflet() %>%
  addTiles() %>%
  addPolygons(label = ~(paste(zone,"(Number of trips:", n,")")), stroke = FALSE, smoothFactor = 0.2, 
              fillOpacity = 0.9, color = ~factpal(n),
              highlightOptions = highlightOptions(color = "red",
                                                  weight = 2,
                                                  bringToFront = TRUE),
              labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
              textsize = "15px",direction = "auto"), group = "Dropoff")

In this map, we see that for Newark airports trips, the zone with more pickups is the heart of Manhattan: Midtwon and Times Square, which is foreseeable apart from the touristic attractive, Times Squares has one of the best subway stop in terms of connection.

Regarding the JFK airport, this is the map of pickup zones:

IDJFK <- fusion_zones %>% 
  filter(DOLocationID == 132) %>% 
  count(LocationID = PULocationID)

geoData1 <- merge(states, IDJFK, by = "LocationID")

geoData_latlon <- spTransform(geoData1, CRS("+proj=longlat +datum=WGS84"))
factpal <-colorNumeric(palette = "Purples", domain = geoData_latlon$n)

geoData_latlon%>%
  leaflet() %>%
  addTiles() %>%
  addPolygons(label = ~(paste(zone,"(Number of trips:", n,")")), stroke = FALSE, smoothFactor = 0.2, 
              fillOpacity = 0.9, color = ~factpal(n),
              highlightOptions = highlightOptions(color = "red",
                                                  weight = 2,
                                                  bringToFront = TRUE),
              labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
              textsize = "15px",direction = "auto"), group = "Dropoff")

For JFK airport, the pickup zones are very similar comparing to Newark. We must highlight the high number of trips inside these airports.

We can summarize what we have so far.

There are different significant zones in terms of dropoff and pickup trips:

The borough of Manhattan has more short trips. Hence, the mean tip is lower.
The trips that involve the airports, specially Newark and JFK, present a tip mean considerably high.
The high connection and the subway 24 hours working seems the preference for passengers to arrive to Manhattan from other boroughs. Manhattan and Brooklyn are the boroughs with less pickup zones.
South of Queens and Brooklyn, along with north of Bronx present a high mean tip.
Staten Island is borough with higher mean tip.
The dropoff and pickup maps show us that the further is the dropoff from Manhattan, the higher is the tip. This applies to Bronx, Queens, Staten Island and Brooklyn. Specially this applies to the south of Brooklyn and Staten Island. It seems that the majority of passengers use taxi service for zones that the public transport is not that efficacy.

A plausible strategy for a new company would be to focus on possible trips in pickup and dropoff zones with high tip and total amounts.

But what about the other variables? How can they affect the tip amount? Let’s analize them.

2.4. Total amount vs. tip amount

Given the possible correlation between our objective variable (tip amount) and the total amount paid by the passengers, we need to visualize them. First, we need to filter, because summary gave us information about no realistic values (negative values or even trips of more than 60000 dollars). It seems reasonable not to consider a trip for less than 2 dollars in NYC and no more than 400 dollars. We only consider positive values for tip amounts. Regarding the trip distance, we include a range of 0.2 miles which is possible for short trips and 40 miles which means going through the city:

fusion_fi <- fusion %>% 
  filter(total_amount >= 2 & total_amount <= 400) %>% 
  filter(fare_amount >= 2 & fare_amount <= 400) %>%
  filter(tip_amount >= 0 & tip_amount <= 150) %>%
  filter(trip_distance >= 0.2 & trip_distance <= 40) %>%
  filter(payment_type == 1)
  

f1 <- fusion_fi %>%
  ggplot(aes(tip_amount)) +
  geom_histogram(fill = "#20639B", bins = 100) +
  scale_x_log10() +
  scale_y_sqrt()+
  labs(y= "Number of trips", x = "Tip amount [$]")+ 
  theme_classic()


f2 <- fusion_fi %>%
  ggplot(aes(total_amount)) +
  geom_histogram(fill = "#F6D55C", bins = 100) +
  scale_x_log10() +
  scale_y_sqrt()+
  labs(y= "Number of trips", x = "Total amount [$]")+ 
  theme_classic()

layout <- matrix(c(1,2),1,2,byrow=FALSE)
multiplot(f1,f2,layout=layout)

Note that the x axis and y axis have logaritmic and square root scales, respectively. We observe that the majority of the observations follow a normal distribution for both variables (very similar). Regarding the tip amount, there is a peak for 1 dollar tip. The tips of 0.5 dollars or more, are more usual, however, we need to include lower tips, mainly trips without tip (or cash tip).

We can see the correlation between these two variables for the majority of observations:

myColor <- rev(RColorBrewer::brewer.pal(20, "Spectral"))
myColor_scale_fill <- scale_fill_gradientn(colours = myColor)

fusion_fi %>%
  ggplot(aes(total_amount, tip_amount)) +
  stat_binhex(bins = 45)+
  myColor_scale_fill+
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Total amount [$]", y = "Tip amount [$]")+ 
  theme_classic()

As expected there are many trips that passengers do not pay tip, no matter what total amount is. However, a high number of observations follows a positive correlation. We will check the Pearson coefficient, with correlation matrix, but we already know that will be positive and considerably high.

2.5. Fare amount vs. tip amount

Given that the total amount includes information of other variales such as taxes or other charges that are not included in the fare amount, we must see the correlation between fare amount and tip amount:

myColor <- rev(RColorBrewer::brewer.pal(20, "RdYlGn"))
myColor_scale_fill <- scale_fill_gradientn(colours = myColor)

p1 <- fusion_fi %>%
  ggplot(aes(fare_amount, tip_amount)) +
  stat_binhex(bins = 50)+
  myColor_scale_fill+
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Fare amount [$]", y = "Tip amount [$]")+ 
  theme_classic()

p2 <- fusion_fi %>%
  ggplot(aes(fare_amount)) +
  geom_histogram(fill = "#104908", bins = 100) +
  scale_y_sqrt() +
  theme(legend.position = "none")+
  labs(x = "Fare amount [$]", y="Number of trips")+ 
  theme_classic()

layout <- matrix(c(1,2),1,2,byrow=FALSE)
multiplot(p1,p2,layout=layout)

We can see the positive correlation between these variables. We see that a high number the cases is concentrated in the range 5-15 dollars of fare amount, that corresponds to 1-5 dollars of tip amount. Some observations tell us that for fare amount very low, the tip amount is high which is suspicious.

2.6. Trip distance vs. tip amount

It seems logical that, apart from total amount and fare amount, trip distance will be a relevant variable for tip amount:

myColor <- rev(RColorBrewer::brewer.pal(20, "RdBu"))
myColor_scale_fill <- scale_fill_gradientn(colours = myColor)

p1 <- fusion_fi %>%
  ggplot(aes(trip_distance, tip_amount)) +
  stat_binhex(bins = 30)+
  myColor_scale_fill+
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Trip distance [mi]]", y = "Tip amount [$]")+ 
  theme_classic()

p2 <- fusion_fi %>%
  ggplot(aes(trip_distance)) +
  geom_histogram(fill = "#051e3e", bins = 50) +
  scale_y_sqrt() +
  theme(legend.position = "none")+
  labs(x = "Distance [mi]", y="Number of trips")+ 
  theme_classic()

layout <- matrix(c(1,2),1,2,byrow=FALSE)
multiplot(p1,p2,layout=layout)

We can see that distance is also positively correlated with tip amount. We will select cases with a minimum distance. With this selection we will also discard cases that seem suspicious, according to the high tip for trips withoud distance. As we can infer, the distribution of the distance is right-skewed distribution given that the number of short trips is higher. We will take only cases with at least 0.2 miles of distance.

2.7. The rest of features

Before plotting these features, we must consider the information included in the dictionary document and extracted from summary exploration. Passenger count must be until 8, given the current authorized cars for circulating in NYC as yellow taxi. RatecodeID variable only includes from 1 to 6 different rates. The variable “extra” has only 3 possible values which are 0,0.5 and 1 according to rush hour and overnight charges). The same happens with “improvement_charge” variable (with 0 or 0.3) and “mta_tax” (with 0 or 0.5). We also change the class of features, to see the graphics in a proper way.

fusion_gr <- fusion %>% 
  filter(passenger_count == c("1","2","3","4","5","6","7","8")) %>%
  filter(RatecodeID == c("1","2","3","4","5","6")) %>%
  filter(extra == c("0","0.5","1")) %>%
  filter(mta_tax == c("0","0.5")) %>%
  filter(improvement_surcharge == c("0","0.3")) %>%
  filter(payment_type == 1)

fusion_gr <- fusion_gr %>%
  mutate(VendorID = factor(VendorID),
         passenger_count = factor(passenger_count),
         RatecodeID = factor(RatecodeID),
         payment_type = factor(payment_type),
         extra = factor(extra),
         mta_tax = factor(mta_tax),
         improvement_surcharge = factor(improvement_surcharge))

p3 <- fusion_gr %>%
  group_by(RatecodeID) %>%
  count() %>%
  ggplot(aes(RatecodeID, n, fill = RatecodeID)) +
  theme(legend.position = "none")+
  geom_col() +
  scale_y_sqrt() +
  labs(x = "Rate code ID", y = "Number of trips")

p4 <- fusion_gr %>%
  ggplot(aes(passenger_count, tip_amount, color = passenger_count, 
             group = passenger_count)) +
  geom_boxplot() +
  scale_y_log10() +
  theme(legend.position = "none") +
  facet_wrap(~ VendorID) +
  labs(y = "Tip amount [$]", x = "Number of passengers")

p5 <- fusion_gr %>%
  ggplot(aes(store_and_fwd_flag, tip_amount, 
             color = store_and_fwd_flag, group = store_and_fwd_flag)) +
  geom_boxplot() +
  scale_y_log10() +
  theme(legend.position = "none") +
  facet_wrap(~ VendorID) +
  labs(y = "Tip amount [$]", x = "Store and fwd flag")
  
layout <- matrix(c(1,2,3),1,3,byrow=FALSE)
multiplot(p3,p4,p5,layout=layout)

Standard rate (RatecodeID = 1) presents the highest number. We also note that JFK airport rate (RatecodeID = 2) shows a notable number comparing to the rest of rates. It could give us information to improve the predictions of our models, or even we can take it to generate other models for airport rates in future works.

Furthermore the number of trips with one passenger is very elevated followed by 2 passengers trips. However, the difference of tip amount between the different groups does not seem significant. Even if we compare between the vendors. We will have more information with the correlation matrix but initially does not seem they are good predictors. The same comment for store and flag, which probably gives us information about the connectivity of the zone; however the correlation seems poor with tip amounts.

2.8. Time analysis

Once we have information of the other variables, we analyze now the relation of service time (hour, day and month) to detect patterns:

fusion <- fusion %>%
mutate(VendorID = as.factor(VendorID))


m1 <- fusion %>%
  mutate(hpick = month(tpep_pickup_datetime)) %>%
  group_by(hpick, VendorID) %>%
  count() %>%
  ggplot(aes(hpick, n)) +
  geom_point(aes(color = VendorID, size = 5)) +
  scale_color_manual(values=c('#FEC925','#5AB190'))+
  labs(x = "Time [Month]", y = "Number of trips") +
  theme(legend.position = "none")

m2 <- fusion %>%
  mutate(week = wday(tpep_pickup_datetime)) %>%
  group_by(week, VendorID) %>%
  count() %>%
  ggplot(aes(week, n, color = VendorID, size = 5)) +
  geom_point(aes(color = VendorID)) +
  scale_color_manual(values=c('#FEC925','#5AB190'))+
  labs(x = "Time [Day]", y = "Number of trips") +
  theme(legend.position = "none")

m3 <- fusion %>%
  mutate(hpick = hour(tpep_pickup_datetime)) %>%
  group_by(hpick, VendorID) %>%
  count() %>%
  ggplot(aes(hpick, n, color = VendorID)) +
  geom_point(aes(color = VendorID), size = 5) +
  scale_color_manual(values=c('#FEC925','#5AB190'))+
  labs(x = "Time [Hour]", y = "Number of trips") +
  theme(legend.position = "bottom")

layout <- matrix(c(1,2,3,3),2,2,byrow=TRUE)
multiplot(m1,m2,m3,layout=layout)

We observe in these pictures interesting points:

The number of trips is slightly lower in November comparing to March and June, which is possibly related to the weather and the higher cultural offer and public events.There are few trips collected with date before and after the months we are taking as reference.
The number of trips increases during the week and Sundays present the lower number of trips. Note that weeks starts on Sundays (day = 1).
The number of trips finds the minimum at 5 am. From there, it decreases until 6 pm, which is the maximum, possibly related to rush hour. Then it decreases again.

This gives us an idea of how distribution of trips is explained by the time. However, ¿how is it related to tip amount? Let’s visualize it:

m4 <- fusion %>%
  mutate(hpick = hour(tpep_dropoff_datetime),
         wday = factor(wday(tpep_dropoff_datetime))) %>%
  group_by(hpick, wday) %>%
  count() %>%
  ggplot(aes(hpick, n, color = wday)) +
  geom_line(size = 1.5) +
  labs(x = "Time [Hour]", y = "count")+ 
  theme_classic()

m5 <- fusion %>%
  group_by(wday = wday(tpep_dropoff_datetime), 
           hour = hour(tpep_dropoff_datetime), label = TRUE) %>%
  summarise(median_tip = median(tip_amount)) %>%
  ggplot(aes(hour, wday, fill = median_tip)) +
  geom_tile() +
  labs(x = "Time [Hour]", y = "Day of the week") +
  scale_fill_distiller(palette = "Spectral")+ 
  theme_classic()

layout <- matrix(c(1,2),2,1,byrow=TRUE)
multiplot(m4,m5,layout=layout)

We observe that there are differences between the week days. The days present a similar distribution with decreased activity on Sundays and Saturdays in the morning. On Thursday, the number of trips is significantly higher from 6:00 pm to 10:00 pm.

We could summarize the tip amounts during the week as follows:

Tip amounts change significantly from weekend to midweek.
In the midweek, the range 8:00 am-10:00am present high tip amounts. The same happens from 4:00 pm- 00:00 am.
In the weekend, tha range 8:00 pm - 4:00 am is the best for tips.The rest of the Saturdays and Sundays the amount decrease substantially.
The lowest tip amounts in general are given between 3:00 am and 5:00 am in the midweek.

These could be relevant points for a new public transport company, given that the number of cars circulating must be higher at hour peak and lower in weekends until 8:00 pm. The incentive for the drivers would be a higher tip working at night in the weekend or from 4:00 pm in the midweek.

3. MODELLING

3.1. Preparing the data

Once we have done the exploration of the data, we will build a model to predict tip amounts.

With the following script, we transform the variables considering the information included in the dictionary document and extracted from figures and summary exploration. We transform all these values to NAs in order to process them:

Passenger count must be until 8, given the current authorized cars for circulating in NYC as yellow taxi.
RatecodeID variable only includes from 1 to 6 different rates.
The variable extra has only 3 possible values which are 0, 0.5 and 1 according to rush hour and overnight charges.
The same happens with improvement_charge variable (with 0 or 0.3) and mta_tax (with 0 or 0.5).
Fare_amount and total_amount include negative records even if they are less than 0.5. This is given that it seems reasonable for a travel cost, must be at least 0.5. Regarding tip_amount must not be negative.
We filter records paid with credit card only.
We also filter by duration, given a 3 hours duration as maximum, to discard too long travels included in the dataset.
Finally, we change separetly the class of the variables for better treatment of NAs.

fusion <- fusion %>% mutate(passenger_count = ifelse(passenger_count %in% c("1","2","3","4","5","6","7","8"),passenger_count,NA)) %>%
  mutate(RatecodeID = ifelse(RatecodeID %in% c("1","2","3","4","5","6"),RatecodeID,NA)) %>%
  mutate(extra = ifelse(extra %in% c("0","0.5","1"),extra,NA)) %>%
  mutate(mta_tax = ifelse(mta_tax %in% c("0","0.5"),mta_tax,NA)) %>%
  mutate(improvement_surcharge = ifelse(improvement_surcharge %in% c("0","0.3"),
                                        improvement_surcharge,NA)) %>%
  mutate(fare_amount = ifelse(fare_amount >= 2 & fare_amount <= 400,fare_amount,NA)) %>%
  mutate(trip_distance = ifelse(trip_distance >= 0.2 & trip_distance <= 40,trip_distance,NA)) %>%
  mutate(total_amount = ifelse(total_amount >= 2 & total_amount <= 400,total_amount,NA)) %>%
  filter(tip_amount >= 0 & tip_amount <= 150) %>%
  mutate(duration = tpep_dropoff_datetime - tpep_pickup_datetime) %>%
  filter(duration >= 120 & duration <= 10800) %>%
  filter(payment_type == 1)

fusion <- fusion %>%
  mutate(VendorID = factor(VendorID),
         tpep_pickup_datetime = ymd_hms(tpep_pickup_datetime),
         tpep_dropoff_datetime = ymd_hms(tpep_dropoff_datetime),
         passenger_count = factor(passenger_count),
         RatecodeID = factor(RatecodeID),
         PULocationID = factor(PULocationID),
         DOLocationID = factor(DOLocationID),
         payment_type = factor(payment_type),
         extra = factor(extra),
         mta_tax = factor(mta_tax),
         improvement_surcharge = factor(improvement_surcharge),
         duration = (tpep_dropoff_datetime - tpep_pickup_datetime))

We create the objective variable and the input variables to process NAs:

varObjCont<-fusion$tip_amount
input<-fusion %>% 
           dplyr::select(-tip_amount)

We need to make sure that the proportion of NAs per observation or per variable is not higher than 50%.

input$prop_missings<-rowMeans(is.na(input))
summary(input$prop_missings)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0000000 0.0000000 0.0004176 0.0000000 0.3529412

prop_missingsVars<-colMeans(is.na(input)) 

varObjCont<-varObjCont[input$prop_missings < 0.5]
input <- subset(input, prop_missings < 0.5, select=names(prop_missingsVars)[prop_missingsVars<0.5])

We impute the new values: Mean for numeric variables and random for categorical variables:

input[,as.vector(which(sapply(input, class)=="numeric"))]<-sapply(Filter(is.numeric, input),function(x) impute(x,mean))
input[,as.vector(which(sapply(input, class)=="factor"))]<-sapply(Filter(is.factor, input),function(x) impute(x,"random"))
input[,as.vector(which(sapply(input, class)=="character"))] <- lapply(input[,as.vector(which(sapply(input, class)=="character"))] , factor)

Finally, with the information we already have, we can select the relevant variables to generate different models. Besides this information, we must create matrix correlation, to select possible good predictors. The values must be convert into numeric:

fuscor <- data.frame(varObjCont,input)

fuscor <- fuscor %>%
    mutate(hpick= as.integer(hour(tpep_pickup_datetime)),
          hdrop= as.integer(hour(tpep_dropoff_datetime)) ,
          mpick= as.integer(month(tpep_pickup_datetime)), 
          mdrop= as.integer(month(tpep_dropoff_datetime)),
          passenger_count = as.integer(passenger_count),
          DOLocationID = as.integer(DOLocationID),
          PULocationID = as.integer(PULocationID),
          VendorID = as.integer(VendorID),
          RatecodeID = as.integer(RatecodeID), 
          duration = as.integer(duration)) %>%
    dplyr::select(hpick,
                  hdrop,
                  mpick,
                  mdrop,
                  passenger_count,
                  DOLocationID,
                  PULocationID,
                  VendorID,
                  RatecodeID,
                  total_amount,
                  fare_amount,
                  trip_distance,
                  varObjCont)

fuscor %>%
  cor(use="complete.obs", method = "spearman") %>%
  corrplot(method="pie", diag=FALSE)

As we want to build a predictive model it is relevant that fare amount, total amount and trip distance are positively correlated with tip amount. This could see it through graphics above. It is interesting that RatecodeID is also positively correlated with passenger count, which is reasonable.

Furthermore, we need to take into consideration, for the selection of features, that pickup hour and dropoff hour (pickup month and dropoff month too) are correlated, which is perfectly understandable given that they normally coincide and add the same variance. VendorID would be a good predictor for other variables, but in this case, it does not seem to have good capacity to predict tip amount. We will have this on mind to select combinations of features while modelling:

input <- input %>%
  mutate(hpick= factor(hour(tpep_pickup_datetime))) %>%
  mutate(hdrop= factor(hour(tpep_dropoff_datetime))) %>%
  mutate(mpick= factor(month(tpep_pickup_datetime))) %>%
  mutate(mdrop= factor(month(tpep_dropoff_datetime))) %>%
  mutate(as.integer(passenger_count)) %>%
  dplyr::select(DOLocationID,
                PULocationID,
                passenger_count,
                trip_distance,
                RatecodeID ,
                duration,
                total_amount, fare_amount,hpick,hdrop,mpick,mdrop)

fusion<- data.frame(varObjCont,input)

3.2. Building the models

In first place, we use Linear Regression algorithm: We generate 4 models and compare the results.The models include the following variables:

Model 1: Total amount,fare amount and trip distance

Model 2: Total amount and fare amount

Model 3: Total amount,fare amount and passenger count

Model 4: Total amount,fare amount and RatecodeID

rm(input)

set.seed(1234)
trainIndex <- createDataPartition(fusion$varObjCont, p=0.75, list=FALSE)
data_train <- fusion[trainIndex,]
data_test <- fusion[-trainIndex,]



model1<-lm(varObjCont~ total_amount+fare_amount+trip_distance,
            data=data_train)

model2<-lm(varObjCont~total_amount+fare_amount,
            data=data_train)

model3<-lm(varObjCont~total_amount+fare_amount,passenger_count,
            data=data_train)

model4<-lm(varObjCont~total_amount+fare_amount,RatecodeID,
            data=data_train)

We apply a loop to apply cross validation and compare the results:

total<-c()
models<-list(model1,model2,model3,model4)
formulaModels<-sapply(models,formula)
for (i in 1:length(models)){
  set.seed(1234)
  vcr<-train(as.formula(formulaModels[[i]]), data = data_train,
             method = "lm",
             trControl = trainControl(method="repeatedcv", number=3, repeats=2,
                                      returnResamp="all"))
  
  total<-rbind(total,data.frame(Rsquared=vcr$resample[,2],
                                model=rep(paste("Model",i),                                                                      nrow(vcr$resample))))
}

ggplot(total, aes(x=model, y=Rsquared, fill=model)) + 
    geom_boxplot(alpha=0.5) +
    theme(legend.position="none") +
    scale_fill_brewer(palette="Dark2")

3.3. Comparing models

CV Mean

    aggregate(Rsquared~model, data = total, mean)

##     model  Rsquared
## 1 Model 1 0.8627817
## 2 Model 2 0.8621051
## 3 Model 3 0.8621051
## 4 Model 4 0.8621051

CV Standard deviation

    aggregate(Rsquared~model, data = total, sd)

##     model    Rsquared
## 1 Model 1 0.001613091
## 2 Model 2 0.001626418
## 3 Model 3 0.001626418
## 4 Model 4 0.001626418

Model 1

predictions <- predict(model1,data_test[,-1])
print(paste0('R2 for test subset: ',caret::postResample(predictions, data_test$varObjCont)['Rsquared'] ))

## [1] "R2 for test subset: 0.855840709662028"

print(paste0('MAE for test subset: ' , mae(data_test$varObjCont,predictions) ))

## [1] "MAE for test subset: 0.488030884597607"

print(paste0('MSE for test subset: ' ,caret::postResample(predictions, data_test$varObjCont)['RMSE']^2 ))

## [1] "MSE for test subset: 0.999106996541211"

Model 2

predictions <- predict(model2,data_test[,-1])
print(paste0('R2 for test subset: ',caret::postResample(predictions, data_test$varObjCont)['Rsquared'] ))

## [1] "R2 for test subset: 0.85490271922214"

print(paste0('MAE for test subset: ' , mae(data_test$varObjCont,predictions) ))

## [1] "MAE for test subset: 0.492283424066824"

print(paste0('MSE for test subset: ' ,caret::postResample(predictions, data_test$varObjCont)['RMSE']^2 ))

## [1] "MSE for test subset: 1.0056208535806"

Model 3

predictions <- predict(model3,data_test[,-1])
print(paste0('R2 for test subset: ',caret::postResample(predictions, data_test$varObjCont)['Rsquared'] ))

## [1] "R2 for test subset: 0.845242489456251"

print(paste0('MAE for test subset: ' , mae(data_test$varObjCont,predictions) ))

## [1] "MAE for test subset: 0.716539151732763"

print(paste0('MSE for test subset: ' ,caret::postResample(predictions, data_test$varObjCont)['RMSE']^2 ))

## [1] "MSE for test subset: 3.48131448649929"

Model 4

predictions <- predict(model4,data_test[,-1])
print(paste0('R2 for test subset: ',caret::postResample(predictions, data_test$varObjCont)['Rsquared'] ))

## [1] "R2 for test subset: 0.840978843100347"

print(paste0('MAE for test subset: ' , mae(data_test$varObjCont,predictions) ))

## [1] "MAE for test subset: 0.692547340307454"

print(paste0('MSE for test subset: ' ,caret::postResample(predictions, data_test$varObjCont)['RMSE']^2 ))

## [1] "MSE for test subset: 3.36667966361824"

3.4. Interpretation

We would select model 1 given the metrics R² (which is high and represents the proportion of the variance for objective variable that is explained by the independent variables), MSE (which is low and means the average of the squares of the errors) and MAE (which is low and measures the errors between paired observations expressing the same phenomenon). Model 1 takes into consideration total amount, fare amount and trip distance. These are the coefficients:

coef(model1)

##   (Intercept)  total_amount   fare_amount trip_distance 
##   -0.11882271    0.54187857   -0.48880403   -0.05153382

importanceVariables(model1)

The total amount is the most important feature, according to Anova method.The trip distance gives less information but helps to the general performance of the model.

After the application of other machine learning algorithms, such as neural networks, random forest or support vector machine, they add complexity and computational cost, however the improvement of results in terms of prediction is in the range of 1-3 points. This is given the characteristics of the data: we have only 3 correlated variables. Furthermore these variables are correlated between them. So we select the model 1 as the model to pass to production.

4. CONCLUSIONS

In conclusion, the model 1 presents the best results. The following figure show the notable accuracy of our model, by showing the prediction of 1000 observations:

We can summarize the following points:

Tip amount is related to dropoff zones far from Manhattan and possibly less efficacy of public transport.
Tip amount presents a strong positive correlation with variables of fare amount, total amount and trip distance.
The hours for higher tip amount change between weekend (from 8:00 pm) and midweek (mornings and from 6:00 pm to 00:00 am).
Generally, the worst range for tips are 3:00 am - 5:00 am.
Fare amount and total amount have big capacity to predict the tip. These amounts include information of mta tax, extra, improvement surcharge and tolls amount.
For all the models we created, model 1 shows better result in terms of R2, MSE and MAE. This model includes the variables of fare amount, total amount and trip distance.

Once we have presented the model, it has (as every model) some limitations or capacity to improve the performance. First, it only applies card payments, so cash is not included and could be an important variable. A solution is to collect information of cash payments. Furthermore, for better predictions, I would suggest to have the exact coordinates of pickup and dropoff places, to generate more geospatial studies. For instance, we could apply clustering algorithms.

Other variables could be important such as weather. We know that possibly taxi services are very correlated with weather changes. This could affect the tip amount. The same happens with NYC subway, given that it is a 24 hours service, schedule changes and cancellations are usual. Some important cancellations of key subway lines could affect the number of trips and tip amount.

However, the model that is presented here is simple, accurated and no computationally demanding to predict the tip amount in NYC taxi trips.

Finally, it is common to have the need to express our gratitude in this case to the taxi driver, and no matter private circumstances, the vast majority of people takes into consideration, consciously or not, the three variables we identified in this study. In order to answer the question with which we started this study, passengers are very complex in many senses, but at the end of the day, from our behavior we can learn and know us better. We all have different motivations, contradictions and perceptions, but patterns also define us.