A flight delay is when an airline flight takes off and/or lands later than its scheduled time. Usually, it is considered to be delayed when it is 15 minutes later than its scheduled time.
Imagine this situation. You’re ready to fly off to your destination when the flight schedule board at the airport breaks the news to you:
Your flight is delayed.
Without a doubt, you’re first going to be frustrated especially if you are time-constrained.
Then, you’re going to wonder: why is my flight delayed?
Whether it's due to bad weather or other factors that are beyond airline control, flight delays not only irritate air passengers and disrupt their schedules but also cause a decrease in efficiency and increase in capital costs, reallocation of flight crews and aircraft, and additional crew expenses.
These are some reasons why a flight would be delayed.
In this post, I have detailed a solution proposes to build a flight delay predictive model using Machine Learning techniques. The accurate prediction of flight delays will help all players in the air travel ecosystem to set up effective action plans to reduce the impact of the delays and avoid loss of time, capital, and resources.
The model will predict the duration a flight is delayed. For this project, I will build different predictive models and use the following steps.
Data Exploration
Feature Engineering
Building training/test samples
Model Selection
Model Evaluation
DATA EXPLORATION
The data is taken from Zindi( https://zindi.africa/competitions/ai-tunisia-hack-5-predictive-analytics-challenge-2)
We will use Python to build the predictive model. Let’s begin by first loading the data.
Now read the data.
The pd.read_csv function in the pandas library will read the train and test data. The train dataset has 107,833 rows and 10 columns while the test dataset has 9,333 rows and 9 columns. The test dataset will usually not have the target variable in our case the target (delayed minutes).
Next you need to check data for null values and duplicated values.
train.duplicated().any
train.isnull().any
A vital step in data exploration is checking for appropriate data types. Some columns need to be in the date-time format, others need to be changed into categorical or integer data types.
These are non-numeric data such as arrival station, destination airport, etc.
Columns like DATOP -Date of flight, are split into the year, month, day to allow us to gain more insights
An important point to note is that any step taken in wrangling data should be done in both train and test datasets.
Feature Engineering.
In this section, we will create different features to build our model for predicting the delays on flights. We will use those features from our data set that significantly contribute in the prediction. We will break all our variables into their subcategories such as numerical and categorical.
A season column will be very useful in prediction.
Let us get to understand seasons much more in relation to the prediction of delays.
It shows that during summer more flights are taken and most of the flights are delayed. Summer is the busiest season to travel in a year. The weather is suitable for outdoor travel it is obvious that almost everyone will be making plans for traveling. Hence, travel plans during this time in a year signify crowded airports, long queues, and crowded beaches.
Training and Test Samples
We have analyzed our data and created features from them. Now we will split our data. We split the data to get an idea of how our model will perform on unseen data, which in our case will be test data.
Training: This is used to train our model. In most cases, it is the major portion of the data set and is assigned as training samples. In our case, it's the training dataset.
Test: These samples are kept separate and after selecting our best model are used for final evaluation. In our case, it is the test dataset that does not have the target variable.
Now we build X and Y input matrices for our machine learning package.
MODEL SELECTION
After cleaning and preparing the data, we will build our models and check their performance on unseen data and select the best model where different algorithms are used.
Since we are predicting a continuous quantity or value, the model will use a regression algorithm which is commonly evaluated by calculating the root mean squared error of its output.
Finally, we do cross-validation. The goal of cross-validation is to test the model’s ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset.
I have used a Catboost regressor.
MODEL SELECTION: FEATURE IMPORTANCE
To improve our model, it is essential to understand what features are important to the model. This can help us to inspire new feature ideas in both high bias as well as high variance cases, find out the top features and avoid any data leakage which can occur in case the column affecting the output label is included.
Conclusion.
The Root mean squared error is 113. In order to improve RMSE we need:
Some additional features like the ratio of airport capacity and Tunisia’s population can also be helpful.
High-level features on departure time like ‘early morning’,’ morning’,’ afternoon’,’ evening’,’ night’ can be created.
Mean encoding of Categorical variables with smoothing can also be important where categorical values are large in number
Binning less occurring of values for some categorical variables
Flight direct or connecting data.
Removing Bias from dataset like some delays are unpredictable like delay eg. Weather uncertainty
Hyperparameter Tuning can also be performed but as of now, hardware resource (less computation power) is a constraint.
Ensemble, Stacking techniques can also be experimented with to improve results.
For easy follow up please check the full code here