Making Passenger

Building scale resilience into our m-ticketing servers

Switching our servers over to Amazon Web Service’s RDS has made Passenger’s mobile ticketing service more resilient to any potential outages.

15th Oct 2019

Hand holding a phone directly in front, travelling on public transport

Our Making Passenger articles are designed to provide more insight into the technological upgrades we have made behind the scenes.

Passenger’s mobile ticketing functionality is the highest-traffic service in the Passenger ecosystem. Our servers are busy every minute of every day processings requests, supplying tickets to customers and accepting payments.

Mobile ticketing is also our most critical service, so it is of paramount importance that this functionality presents the highest server reliability and uptime possible. We have little tolerance for disruptions around mobile ticketing; it is vital that our partners and the customers they service experience a consistently high level of reliability from the feature.

That’s why we must always be thinking ahead. As we’ve been onboarding new operators to the Passenger ecosystem we’ve been preparing our mobile ticketing servers for an increase in traffic and workload. Part of this process included making the switch to Amazon Web Services’ Relational Database Service (RDS).

How our mobile ticketing servers have worked in the past

The processes running our mobile servers have always been founded in two core requirements: robustness and dependability.

Prior to the implementation of AWS RDS, the mobile ticketing servers would run the Passenger application itself and the database of which it is comprised on the same server. This dual process reduced complexity and presented reduced latency between the two components.

If Passenger ever experienced a major ticketing outage – which thus far has never occurred outside of emergency tests carried out on our independent mobile ticketing staging environments – we would “failover” to a server that was on standby, restoring from the latest database backup. (This essentially means restoring a back-up of the primary system to a secondary system, whenever the primary solution becomes unavailable.)

This approach has always worked well for us. However, there is always scope for improvement – especially as we continue to onboard new operators. As such, we recently re-evaluated the configuration to see how it could be enhanced.

How we’ve improved our mobile ticketing servers

To improve our mobile ticketing server resiliency against outages, we have migrated our ticketing databases to AWS RDS.

AWS RDS’s solutions enable higher availability of mobile ticketing services, a faster failover time if a critical issue ever occurs and greater scalability as we onboard new operators and more mobile ticketing requests are made.

We performed this migration process quietly behind the scenes, with the aim that customers purchasing mobile tickets would observe no disruption in their ticket buying processes.

We planned and tested the migration on our staging environments multiple times, which enabled us to anticipate potential issues, and wrote multiple scripts to automate the migration process and minimise human error as much as possible.

A closer look at the technical process

Here’s a deeper, technical dive into this process for those keen to learn about the details.

We created what is known as a Multi-AZ (high availability) database cluster in RDS with automatic failover and backups. We loaded the most recent Passenger database backup, enabled replication from our old databases to the new AWS RDS ones (meaning we wouldn’t lose any data at all during the switch) then waited for the databases to come into synchronization.

Once we were ready and had confirmed everything was working, we configured our ticketing service to be read-only, switched the RDS cluster to become the parent server (the parent being the instance that handles any changes to the server cluster overall) and moved our Passenger mobile ticketing applications to use the new cluster in read-write mode. This switch happened so quickly (< 10 milliseconds) that we didn’t have a single write request get denied. Success!

Furthermore, this database migration happened with zero downtime, meaning customers didn’t experience a single blip in their mobile ticketing experience. If we had noticed any trouble, we would have been able to instantly revert to our old server solution at any stage of the migration. So – even if something did go wrong, customers would have continued to experience a solid mobile ticketing service.

After the database work was complete, we also removed any dependencies on disk caching to allow for improved horizontal scaling (that is, the ability to add more servers, rather than more CPU or RAM, to our system, making it more dependable).

We also reconfigured our application to augment performance when an increase in traffic occurs. If we do see more traffic on the mobile ticketing servers, we simply hit a button and a new read child will be added to the process. (A “read child” is a server controlled by the “parent” server instance.) We call this “load balancing”. It uses the overall cluster more effectively, as without this process a single server could slow down or become unavailable under too much traffic load. Thanks to this new process, we can effectively avoid this issue.

What does this mean for mobile ticketing?

The result is an improved mobile ticketing backend experience. If the customer finds that nothing has changed in their mobile ticketing purchase experience, then we’ve achieved exactly what we intended: a stronger, more resilient mobile ticketing server much less likely to falter under heavy load.Want to learn more about mobile ticketing and Passenger? Get in touch with the team or sign up to our newsletter.