US flight.

US flight.

INTRODUCTION


When was the last time your flight was delayed? How did you feel at that moment?

Airlines should take care of their passengers. A good reputation can be relevant in terms of sellings.

Sentiment analysis is a useful tool to measure how passengers feel and how marketing and custom service should adjust processes.

In this case, I will explore US flights dataset, which includes tweets of passengers US airlines in 2015. We explore and visualize it to get better insight. Then I will explore sentiments and propose a model for the prediction of positive and negative tweets.

1.1. We load the data

We load the data we will be working with. This dataset was extracted from: https://www.kaggle.com/crowdflower/twitter-airline-sentiment.

## [1] "What @dhepburn said."                                                                                                     
## [2] "plus you've added commercials to the experience... tacky."                                                                
## [3] "I didn't today... Must mean I need to take another trip!"                                                                 
## [4] "it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces & they have little recourse"        
## [5] "and it's a really big bad thing about it"                                                                                 
## [6] "seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA"
##   airline_sentiment
## 2          positive
## 4          negative
## 5          negative
## 6          negative
## 7          positive
## 9          positive
##                                                                                                                        text
## 2                                                                 plus you've added commercials to the experience... tacky.
## 4           it's really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse
## 5                                                                                  and it's a really big bad thing about it
## 6 seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA
## 7                                                    yes, nearly every time I fly VX this â\200œear wormâ\200\235 wonâ\200\231t go away :)
## 9                                                                                        Well, I didn'tâ\200¦but NOW I DO! :-D

1.2. Summary and structure

So, we need to summarize the data to have a preliminar idea of the dataset:

Summary

##     tweet_id         airline_sentiment airline_sentiment_confidence
##  Min.   :5.676e+17   negative:9178     Min.   :0.3350              
##  1st Qu.:5.686e+17   neutral :3099     1st Qu.:0.6923              
##  Median :5.695e+17   positive:2363     Median :1.0000              
##  Mean   :5.692e+17                     Mean   :0.9002              
##  3rd Qu.:5.699e+17                     3rd Qu.:1.0000              
##  Max.   :5.703e+17                     Max.   :1.0000              
##                                                                    
##                 negativereason negativereason_confidence           airline    
##                        :5462   Min.   :0.000             American      :2759  
##  Customer Service Issue:2910   1st Qu.:0.361             Delta         :2222  
##  Late Flight           :1665   Median :0.671             Southwest     :2420  
##  Can't Tell            :1190   Mean   :0.638             United        :3822  
##  Cancelled Flight      : 847   3rd Qu.:1.000             US Airways    :2913  
##  Lost Luggage          : 724   Max.   :1.000             Virgin America: 504  
##  (Other)               :1842   NA's   :4118                                   
##  airline_sentiment_gold          name      
##          :14600         JetBlueNews:   63  
##  negative:   32         kbosspotter:   32  
##  neutral :    3         _mhertz    :   29  
##  positive:    5         otisday    :   28  
##                         throthra   :   27  
##                         rossj987   :   23  
##                         (Other)    :14438  
##                                negativereason_gold retweet_count     
##                                          :14608    Min.   : 0.00000  
##  Customer Service Issue                  :   12    1st Qu.: 0.00000  
##  Late Flight                             :    4    Median : 0.00000  
##  Can't Tell                              :    3    Mean   : 0.08265  
##  Cancelled Flight                        :    3    3rd Qu.: 0.00000  
##  Cancelled Flight\nCustomer Service Issue:    2    Max.   :44.00000  
##  (Other)                                 :    8                      
##      text                                tweet_coord   
##  Length:14640                                  :13621  
##  Class :character   [0.0, 0.0]                 :  164  
##  Mode  :character   [40.64656067, -73.78334045]:    6  
##                     [32.91792297, -97.00367737]:    3  
##                     [40.64646912, -73.79133606]:    3  
##                     [18.22245647, -63.00369733]:    2  
##                     (Other)                    :  841  
##                    tweet_created          tweet_location
##  2015-02-24 09:54:34 -0800:    5                 :4733  
##  2015-02-24 11:43:05 -0800:    4   Boston, MA    : 157  
##  2015-02-23 06:57:24 -0800:    3   New York, NY  : 156  
##  2015-02-23 10:58:58 -0800:    3   Washington, DC: 150  
##  2015-02-23 14:18:58 -0800:    3   New York      : 127  
##  2015-02-23 15:25:46 -0800:    3   USA           : 126  
##  (Other)                  :14619   (Other)       :9191  
##                     user_timezone 
##                            :4820  
##  Eastern Time (US & Canada):3744  
##  Central Time (US & Canada):1931  
##  Pacific Time (US & Canada):1208  
##  Quito                     : 738  
##  Atlantic Time (Canada)    : 497  
##  (Other)                   :1702

Structure

## 'data.frame':    14640 obs. of  15 variables:
##  $ tweet_id                    : num  5.7e+17 5.7e+17 5.7e+17 5.7e+17 5.7e+17 ...
##  $ airline_sentiment           : Factor w/ 3 levels "negative","neutral",..: 2 3 2 1 1 1 3 2 3 3 ...
##  $ airline_sentiment_confidence: num  1 0.349 0.684 1 1 ...
##  $ negativereason              : Factor w/ 11 levels "","Bad Flight",..: 1 1 1 2 3 3 1 1 1 1 ...
##  $ negativereason_confidence   : num  NA 0 NA 0.703 1 ...
##  $ airline                     : Factor w/ 6 levels "American","Delta",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ airline_sentiment_gold      : Factor w/ 4 levels "","negative",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ name                        : Factor w/ 7701 levels "___the___","__betrayal",..: 1073 3477 7666 3477 3477 3477 1392 5658 1874 7665 ...
##  $ negativereason_gold         : Factor w/ 14 levels "","Bad Flight",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ retweet_count               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ text                        : chr  "What @dhepburn said." "plus you've added commercials to the experience... tacky." "I didn't today... Must mean I need to take another trip!" "it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces & they have little recourse" ...
##  $ tweet_coord                 : Factor w/ 833 levels "","[-33.87144962, 151.20821275]",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ tweet_created               : Factor w/ 14247 levels "2015-02-16 23:36:05 -0800",..: 14212 14170 14169 14168 14166 14165 14164 14160 14158 14106 ...
##  $ tweet_location              : Factor w/ 3082 levels "","'Greatness has no limits'",..: 1 1 1465 1 1 1 2407 1529 2389 1529 ...
##  $ user_timezone               : Factor w/ 86 levels "","Abu Dhabi",..: 32 64 29 64 64 64 64 64 64 32 ...

2.1. Sentiment Analysis

Once we know more about the structure of the dataset, let’s apply a sentiment analysis to the text of those tweets.

## Joining, by = "word"

As we observe, the word “delayed” is the most common negative word. “Lost”, “miss”, “bad”, “worst”, “issue” and “problems” are other negative words with remarkable frequency. Other words related to custom service such as rude, frustrated, dissapointed or unacceptable have an important presence. Regarding the positive “words”, “thank”, “like”, “love” or “good” are the most frequent words.

However, we know that sentiments are complex and if we use the a positive/negative classification, we can not identify others. This is the reason why we apply a nrc sentiment analysis. This lexicon categorizes the words into different categories with “yes/no”:

## Joining, by = "word"

With this analysis, we detect that positive words are clearly the sentiment most common. However, negative words are the second most common sentiment. Then, in descending order: trust, anticipation, sadness, joy, fear, anger, surprise and disgust. So every airline must decide what is the sentiment they want to promote, alligned to the image they want to project in the market. However, there is a common aspect: the negative comments should be reduced. Comments representing trust or joy should be more important in relative terms.

This is a nrc approach for sentiment analysis. But if we apply a third approach just to have more information. In this case, afinn approach is applied:

## Joining, by = "word"

Afinn lexicon gives us a scale [-5,5] for positive and negative words. We can observe that positive words are more concentrated in score 1 and especially 2; however for negative words the proportion of -1 is bigger, which gives us an idea that very positive words are more common than very negative words.

Once we have an idea of the sentiment that the dataset expresses, we are able to build a model for prediction purpose. The model in this case, classificates positive or negative tweets, taking into consideration the words they include.

3. BUILDING A MODEL


4. CONCLUSIONS


We can conclude different relevant points discovered through the study:

  • We have a representative data of tweets of different airlines. The airlines that represent the highest number of tweets are United, US Airways and American Airlines.

  • The vast majority of the tweets are not retwitted.

  • From the wordcloud, the negative word that is more frequent is “delayed”. This seems to be one of the most important aspects for negative sentiments. Other words that may be related with custom service are also relevant such us “rude”, “disappointed” or “fustrating”.

  • The number of positive words are the highest, if we apply nrc lexicon. However, the second group is negative, which is something to take into consideration. Sadness is also a common sentiment. The airlines should know this and try to decrease it.

  • The positive comments are very positive. However, the negative comments are not that negative. This could be interpreted as the negative comments are expressed with more neutral words for the majority of cases.

  • The models built with Random Forest and Support Vector Machine have similar (high) Specificity and (low) Sensibility. Support Vector Machine presents a slightly better result.

  • Given the confusion matrix (and possible derived costs) the model I present is a model with high capacity of prediction for negative tweets, given the reduced number of negative falses. For prediction of positive tweets, the model presents difficulties to detect positive tweets given that the number of final positive tweets is much lower.