Building A Hotel Booking Cancellation Prediction App

How to build a Streamlit app to predict booking cancellation

Posted by Mohd Shah on Tuesday, August 16, 2022

Impact of COVID-19 on the hotel industry

In the last 2 years, the tourism industry was dealt a heavy economic blow with the onset of COVID-19. In the span of 3 months, the virus was able to bring the global hotel occupancy rate down to 13.3% since May 2020, which is an 82.3% drop from the same time the previous year.

However, with global vaccination rates inching closer to 70%, most governments around the world has loosen COVID-19 restrictions in an effort to boost the country’s economic recovery. As tourist begin to flock into the various countries around the world, the hotel industry is once again inundated with online bookings from all over the world.

Most of them are done through online travel agencies (OTA) which makes up over 90% of all the bookings that a hotel receives. However, a study in 2019 by Esther Hertzfield reported that on average 40% of these bookings are cancelled.

Prior to COVID-19, the impact of these cancellations were damaging, and they are usually brought on by aggressive marketing strategies from competing hoteliers who wish to poach customers away from their prior bookings by offering cheaper prices through online marketing. Given that most online travel agencies (OTA) offer free cancellations to incentivise sales, customers can make cancellations for any prior bookings at any time with virtually no consequences. The impact of risk free booking are;

  1. It makes it much more difficult for hotels to create a demand forecast
  2. A cancelled booking is ultimately lost revenue

Using machine learning algorithms hoteliers can minimize the loss of revenue by being able to predict them beforehand. Using the information from an online booking as the input variable, the hotelier will be able to assess the probability of a booking being cancelled. Such insight would allow the hotelier to better allocate their marketing resources towards these high risk bookings.

Incentives such as providing discounted rates, cashback, or extra amenities can be targeted towards those customers to lower the cancellation rate. In addition, hoteliers will also be able to conduct a much more accurate demand forecasting by taking the high risk bookings into account.

The dataset used for this project was obtained from a real hotel located in Portugal. It was made available by Antonio et al. as a way to provide more real world data of hotels to improve research and education in the industry.

Feature Selection

The following 8 attributes were chosen to be the independent variables for the machine learning model. They were chosen based on their feature importance score and the generalized nature of the attribute.

Feature importance of hotel booking dataset attributes
Feature importance of hotel booking dataset attributes

Model Selection

The model selection phase involved comparing 4 different types of models (as shown in Table 3) based on the unique advantages that each of them provides.

No. Chosen Algorithm Reason for Choosing
1 Logistic Regression Logistic regression was chosen due to the high explainability of the algorithm and its simplicity which makes it easy to understand.
2 Support Vector Machine (SVM) SVM was chosen as it is able to achieve better performance than logistic regression and is much less computationally heavy than Random Forest or XGB.
3 Random Forest Classifier Random Forest was chosen as it is able to provide a very good performance and is capable of addressing the issue of overfitting.
4 Extreme Gradient Boosting (XGB) Extreme gradient boosting was chosen due to its high performance capabilities.
Model evaluation results for hotel booking cancellation prediction
Model evaluation results for hotel booking cancellation prediction

In spite of the weak performance of the Logistic Regression model, it was still chosen based on the requirements of the application. Unlike the other algorithms which is only capable of producing a binary output for the dependent variable ‘is_canceled’, the Logistic Regression algorithm is capable of outputting the classification probability. Being able to provide the probability rather than a discrete binary output is much more meaningful for the hotelier as it provides them with more information.

Given the example that if two bookings were both classified as cancelled, the hotelier would not be able to compare much between the two when using SVM, Random Forest or XGB, but a Logistic Regression model might indicate that the first booking was labelled with a confidence of 55% and the second at 89%, providing the hotelier with much better insight.

Another advantage of the Logistic Regression algorithm is that it is a much less computationally heavy algorithm. Given the large size of the dataset, it would take a significantly longer amount of time to train a Random Forest or XGB model. Additionally, while SVM and XGB offer better performance, they are prone to overfitting. While Random Forest is more resistant to overfitting, it unfortunately does not meet the established criteria for this application.

Hotel Booking Cancellation Streamlit App
Hotel Booking Cancellation Streamlit App

The app was built using Streamlit and the source code for the application can be found in GitHub link here while the actual Streamlit application can be viewed here.

comments powered by Disqus