Passenger Cloud presents itself to operators as a singular service, however, there are many individual services which come together in order to create Passenger Cloud, for example;

  • Mobile Ticketing
  • Journey Planning
  • Timetabling
  • Places Search
  • Live bus tracking

Each of these individual services has its own complexities and can rely on each other – changing one service may impact other services in unexpected ways. Some of these services span across multiple servers too, as this helps with things like high availability.

In the past

Passenger has made strong use of monitoring systems such as NewRelic and Sentry in the past to give alerts and monitoring across passenger services.

Previously, when an issue was discovered this would require some time from an engineer to explore manually by directly accessing each server and piecing information together from NewRelic and Sentry, in order to discover the underlying issue and develop a fix.

We were alerted on critical external failures and large changes in response times via NewRelic, however, we did not have a historical log to compare overall long-term trends. We also could not alert on specific problems with servers, only their impact upon serving end-user requests.

What we wanted to improve

  1. More data – the ability to collect any piece of information from a server/application into a single searchable database.
  2. Historical data – the ability to see historical trends.
  3. Dependency – reduce the dependency on third parties, i.e. cloud providers, monitoring services.
  4. Centralised management – singular place to change the configuration.
  5. Integration – the ability to integrate with third parties, e.g. AWS CloudWatch
  6. Alerts – the ability to alert on any metric collected, real-time or historical.
  7. Communication – the ability to send alerts via multiple transport mechanisms

What we implemented

After researching the available solutions, we decided to roll the NPG stack.

  • Netdata – “Real-time performance monitoring, done right”
  • Prometheus – “The Prometheus monitoring system and time series database”
  • Grafana – “The tool for beautiful monitoring and metric analytics & dashboards”

Netdata

Netdata is designed for the real-time monitoring of servers. It can collect and alert on metrics in real-time with a tiny footprint (~36MB for our most complex service). We added Netdata to each individual server across Passenger services.

These Netdata ‘child’ instances can be directly accessed by an engineer, however, they also stream data in real-time to a Netdata ‘parent’ instance which allows an easy centralised place to determine what is currently happening across all instances. The child and parent instances store up to 1 hour of real-time metrics.

If any single Netdata instance became inaccessible it does not affect any other services.

If the Netdata ‘parent’ became inaccessible, an engineer can still directly access each individual “child” instance.

Prometheus

Prometheus is designed for monitoring large numbers of metrics across multiple servers, it can poll each server for metrics and then store, query and alert on those metrics within its time-series database.

We deployed Prometheus, configured to poll every Netdata instance every 5 seconds, using the Netdata API exporting the Prometheus format.

We also deployed the Prometheus blackbox exporter – a tool used in conjunction with Prometheus to probe external endpoints to determine availability.

Prometheus allows us to write complex queries/alerts on this data, across long periods of time (currently 90 days max).

For example, we can predict based on the previous 72 hours of disk usage when a server’s disk will fill up – allowing us to add more resources ahead of time before any issues are caused.

Grafana

Grafana is designed to make dashboards and graphs.

We deployed Grafana with a variety of dashboards designed for different engineering teams with data most relevant to them. We also integrated Grafana with AWS CloudWatch so we can access more metrics in a single workflow, including billing metrics/alerts.

We integrated our deployment chatbot (affectionately named R2D2) and SSH logs using Grafana annotations. This allows us to see an annotation across all dashboards/graphs for all deployments, as well as which engineer accessed which server at what time. This allows us to easily see the impact of an application or infrastructure change.

Who watches the watchers?

  • Grafana alerts on Prometheus status
  • Prometheus alerts on Netdata status
  • Netdata alerts on everything (including Prometheus and Grafana)
  • Humans watch/are alerted by email and Slack

Results

  • Our Netdata master ingests over 100,000 series of data points in real-time.
  • Our Prometheus server ingests over 100,000 series of data points every 5 seconds.
  • We currently store 90 days worth of historical data with 5-second accuracy.
  • Most of our Prometheus rules are evaluated within 2 milliseconds, our more intensive historical ones which cover a larger window of data are still evaluated in less than a second.
  • We have made proactive improvements to services based on the insights from the historical metrics.
  • We have reduced our server costs by more accurately matching resources to demand.
  • We have avoided potential downtime by resolving issues quicker.
  • Everyone is more aware of the state of our infrastructure, we even have a TV in the office showing the state of our core services
  • Where we’ve had outages in this system due to configuration errors, we’ve never had an outage of every single component, we’ve always had access to the data we required.
  • We have greatly improved understanding of the impact of and confidence in changes we make to infrastructure / services.
  • We rebuilt all of our core services in less than a day when the Intel MDS/ZombieLoad incident happened.

Since making these changes, we’ve received some great internal feedback following the success of the project.

Alex Ross, Senior Software Engineer commented:

“With our new monitoring systems we have better insight into what’s going on during an event and we can quickly look at historical data using Grafana to compare a data series. During one of our busiest days in the snow, earlier this year, we were confident as we had enough insight into our infrastructure so that we could focus our efforts on ensuring that our services were stable and available for our passengers. This was a good return on investment as snow days have previously been tricky for us to manage.”

Matt Morgan, Operations Director says:

“The single viewport has helped the team increase efficiency and reduce overall costs, there are very few downsides to that. It’s great to see the data streaming to the dashboards and displayed on the TV’s as this gives everyone in the room an overview of how the platform is performing at a glance.”

Moving forward

We’d like to share our monitoring data with operators, and possibly even the end-users of our services as customer-facing status pages.

We’d like to instrument our applications with StatsD, which would put our application metrics in the same pipeline as our infrastructure metrics.

Through the monitoring, we’ve discovered some dependencies between services and identified some areas where we can increase resilience by using patterns like automatic circuit breakers, so one service failure has zero impact on other services.

Photo by edwin josé vega ramos