Machine learning to facilitate social distancing and minimize the spread of COVID-19 in the Montreal métro.
The Société de transport de Montréal (STM) provides some 1.3 million trips a day thanks to a 71-km metro network spread over four 4 lines and composed of 68 stations, combined with a network of 225 bus lines.
It employs more than 10,000 people, making it the 15th largest company in Quebec. The STM provides more than 80% of public transit trips in the Montréal area and more than 70% of all trips made in Québec.
As the backbone of the city, the métro is an extremely popular means of transportation and is prized by Montrealers… then came the COVID-19 pandemic in March 2020.
The general confinement decreed by the Government of Quebec, combined with the generalized teleworking that followed for many Montrealers, meant that the métro literally emptied itself of its passengers.
Like many companies, the STM suddenly lost the vast majority of its clientele overnight in its bus and métro network. STM had to react quickly to this new reality and deal with new and immensely complex challenges.
It must now ensure compliance with health rules, such as wearing masks and maintaining a two-meter distance between customers in the métro and bus network.
This social distancing is easier to respect when the métro is empty, but how do you deal with these rules when people will return to work in the midst of a pandemic? How can we gain the confidence of users while providing tools to ensure their safe return in the métro?
Why not predict traffic in train cars using machine learning to allow customers to move to less busy ones?
In this article
Technical guidance and algorithm deployment
The STM called on Moov AI to assist them in carrying out a machine learning project to predict traffic in train cars to better respect social distancing and thus minimize the spread of the virus in the métro network.
The purpose of the project for the STM is to inform customers about the number of passengers on the métro’s orange line trains. This data will be displayed both on the website and on the Métrovision screens that display information on the platforms in the stations.
The predicted traffic in each train car is displayed and passengers can position themselves in front of the right doors and choose the less crowded cars.
Very quickly we saw two distinct needs and therefore separated the project into two different models: the “daily” model and the “real-time” model.
- The “daily” model will have the task of giving the “overcrowding” level for each station, each hour, for the following week.
- The “real-time” model will give the “overcrowding” level for each station, every minute, for the next 15 minutes.
In fact, we accompanied and advised the STM’s specialists and data developers in the development of the “daily” model and were mandated to carry out all the development of the “real-time” model, which is the subject of this text.
How to calculate traffic
Our team of experts used machine learning to predict the number of passengers for each station on the orange line, every minute, within a 15-minute horizon.
To build the data set needed to make the predictions, we used the so-called “real-time” sources provided by train car telemetry, which are equipped with a phenomenal amount of sensors of all kinds that provide data every minute.
In order to evaluate the number of passengers in the train cars, we used load sensors that calculate the weight of said cars. Logically, a heavier car will have more passengers. These data are highly accurate and allowed us to arrive at a margin of error of three passengers.
Also, we were able to calculate (and predict) the number of people in each of the train cars at each station.
Time series forecasting
In this project, we faced a supervised regression problem, more precisely, a time series use case.
Allow us a little time to explain
“Supervised regression problem […] time series.”– Moov AI Data Scientists
To fully understand this sentence, let me explain some concepts:
- Label: it is the result of a prediction, the “answer” of the algorithm. In this case, the label is the number of passengers.
- Regression: At its simplest, regression is a technique used to predict a number.
So we used past data to predict a number, and since the label is known in the dataset, we are dealing with a supervised regression problem.
We speak of time series simply because the sequence of observations is taken successively in time: we know how many passengers are in the train cars at specific moments in time.
The main challenges we had to overcome
Obviously, like every innovation project, this one sent us some curveballs.
1) Huge data preparation challenge
It is often said that in data science, 80% of the time is spent finding, cleaning, and organizing data. The remaining 20% is dedicated to analysis. This means that most of the work in data science is done around… data!
“Real-time” data comes from the Azur train cars every minute. As this data is fetched randomly, the train can be accelerating, decelerating, or completely stopped, which makes the data from the load sensors very volatile.
Due to the variable train transit time (2-5-7-10 minutes), we had to use interpolation and imputation techniques to counter volatility and fill in “missing” data gaps to train the model. Our data developers put a lot of emphasis on data pre-processing and ETL scripting.
2) The solution implemented had to provide predictions very quickly.
Since we wanted predictions every minute, the system had to deliver a very good level of performance.
A high-performance system implies a high-performance architecture and code. To do this, we used powerful virtual machines on the Microsoft Azure Cloud platform.
We also paid attention to the performance of the system in preparing the data and how we used the variables to feed the model.
For a system that predicts once a minute, it is important to get to the point, otherwise it will take too long to predict. This is a fine line to walk in order to have good predictions in a high-performance system.
The inevitable data drift
Data drift is a change in the data, between the data used to train the model and the “real” data used in production.
Let’s take the example of a model that predicts a “cat” or a “dog” and has been trained using images of cats and dogs. Once put into production on a website, if users input pictures of flowers, the prediction will be wrong. Not because the model is bad, but because the data presented by the user is not the same than the training data set.
The same drift occurs with data in text or numeric format. It can occur for several reasons:
- Change in the input data by a system (e.g. the company that provides load sensors change its implementation)
- Change in data processing in the system (e.g. STM engineers decide to change load sensors parameters)
- Change in the behavior of users (e.g. the COVID-19 pandemic reduced métro passengers by 70% overnight)
The risk of data drift in this algorithm was intimately linked to the COVID-19 situation. We had to take into account the fact that there was some drift in the data we used to train the model.
The level of overcrowding has never been so low in the métro given the abrupt change in the users’ behavior.
In addition, our solution had to be able to adapt to a new containment, to a partial or total lifting of the containment and even to the end of the pandemic.
To do this, we trained the model over a smaller time window and set up an infrastructure that allowed for easy re-training.
At the end of the day, the model predicts the number of passengers per train car, to within 3 people, which is excellent and was in line with the STM’s expectations.
You will see the results of this project on the screens of your favourite stations on the orange line and on the STM website.
We are proud to have been able to put our brains to work to give Montrealers who use the métro the confidence to return to the subway facilities safely.