The evolution of incident response at Podia

Over the years, Podia has gone from having no incident response processes to a more structured system of on-call shifts and supportive tooling. You might be thinking of embarking on the same journey, or be confused why any developer would embrace the responsibility of on-call shifts, or wonder how you can introduce it without pissing off your entire team.

Here’s how I did it.

Life without on-call

It’s blissful, right? No late-night phone calls. No need to remember your laptop or modify your plans. Nothing ever goes wrong. The app never goes down, your providers never fail, deploys never break anything, and customers are never impacted.

I mean, that’s what you think happens but ignorance doesn’t make it true. And this is not the reality we lived under.

In the earliest days, we had no monitoring at all which is where all projects start out. Pretty soon though I added Rollbar to start tracking exceptions that happened in production and pipe those notifications into Slack. And then we noticed the app might be briefly unresponsive at times so we added a simple uptime checker (UptimeRobot) to tell us if the app stopped responding.

Back in the early days when Podia had just one, two, or three or four full-time developers we didn’t have any defined processes for handling these notifications. Instead, our typical daily routine for each of us went something like this: wake up, check Slack for errors (because of cause of it was on our phones), have breakfast, check Slack for errors, do some work whilst checking Slack for errors, run some errands whilst checking Slack for errors…

Having every developer poll Slack for news about problems is not very efficient and over the long-term it led to a lot of internalised stress and apprehension. When we were constantly logging into Slack, especially during those moments of personal time when we’re supposed to be resting, it quickly developed into a neurotic pattern of behaviour.

I remember being in town one Saturday with the family and a checked Slack out of habit as I was walking down the street. There I saw a ton of errors for, I think, our background jobs being killed due to OutOfMemory errors. Since I was about an hour from home, without a laptop, and supposed to be enjoying an afternoon with the family, there wasn’t much I could do about the problem. Luckily, Jason had woken up in Tennessee, saw the messages and was able to resolve the issue.

We were lucky because we didn’t have many users at the time and they weren’t actively using the platform at all times of the day and week. But that was changing fast…

Introducing on-call

I knew we needed an on-call system but I was also extremely wary of it becoming an exploitative system that the developers would come to resent.

In the 20 years of my career, I’d never had to be on-call for a system. I mostly worked in research or enterprise apps, and briefly in a Level 3 Support role, but if you’d asked me I’d have said that being on-call was for the IT staff. I also, frankly, would never have cared enough about those jobs or trusted any of the companies I worked for to manage the on-call shifts in a responsible manner. The web has changed all that. In particular, the global reach of our apps means we have a never-off cycle of traffic that cannot be confined to any normal business hours. Things will go wrong when we are asleep and they will impact our users.

I used these principles when designing our on-call system, though most flow from the first:

Minimise the impact on our personal lives
Being on-call should be as low-stress as possible
Restrict the number of things that can trigger an incident
Have reasonable expectations on response times etc
The whole team has responsibility for maintaining the app so we will work as a team on-call

How often?

First, I needed to decide on the shift structure for the team. The most popular rotation schedules are either weekly (on-call for 7 days), daily (on-call for 24 hours), or something more sophisticated like follow-the-sun or business-hours/out-of-hours cover. For me, the only reasonable choice was between weekly or daily.

Weekly schedules have the advantage that they don’t come around too often. With just four of us, you’d only be on-call for 1 week in every 4. The downside though is that you’re going to be carrying your laptop around for a whole week and that’s very disruptive. Selfishly, it would have meant that I couldn’t swim for an entire week and that was a major disadvantage to me.

In my opinion, 24-hour shifts are the ideal frequency so that’s what we went with and it hasn’t changed since. Your life is only mildly inconvenienced for a day, so it’s relatively easy to schedule things around being on-call. It’s also easier to find someone to cover a shift for you, doesn’t interfere with a whole weekend, and if we hit a patch of poor reliability those incidents are more likely to be distributed across the team. The downside is that you’ll be on-call more frequently but as the team size increases so the on-call frequency decreases.

What wakes us up?

As a kid I remember sleeping out one night in a tent in our back garden—long before mobile phones and clearly before my parents would trust us with a house key. But my dad had thought of an ingenious solution in case we needed to come into the house in the middle of the night: he tied a piece of string around his toe and we were to tug on it we needed to be let back in. What things could wake us up when we were on-call? What alerts would be tied to our toes with a piece of string to wake us up in the middle of the night? That’s a lot of trust to place is something.

I knew there would be pressure to expand the on-call system over time, especially as users encountered bugs or wrote in with irate support requests because we didn’t respond to their particular issue overnight. We had to be ready to resist the scope-creep of being on-call in order to make it sustainable so I felt it was important to place limits on this process.

I settled on having just three events being worthy of waking us up and, if we needed to add another alert, then one of these would have to go:

the app is down, as detected by some external monitor
our background job latency is over a defined period, meaning there are too many jobs piling up in the queue
our custom domain proxy was not working

Today, the first two events are still the main alerts that we use. Our custom domain proxy has been replaced by Cloudflare so it’s no longer a source of downtime, and is no longer monitored separately. Instead, our third alert source is the facility for anyone in our Slack—typically our support team—to declare an incident and page the on-call developer. It’s rarely used and I think it’s taken time to build that trust. I’m not sure we’d have introduced it on day 1.

It’s also important to have some expectations around how often we expect incidents. For us, it’s once or twice a month. They should not be daily or weekly occurrences.

I’m pretty satisfied with these criteria and if you were starting out today I would still recommend you only page the on-call developer if a) the app is down or b) the background jobs aren’t processing at a rate you expect. Everything else is noise and anything significant will usually trigger one of these alerts.

I am pretty definite that exceptions are not incidents. An exception can be generated by a malformed client, or a user encountering a transient issue just once, or a background job hitting a timeout and then retrying successfully. Or a validation triggered an error but the user completed the form anyway. Hell, even if they couldn’t complete the form, it’s still something you can solve tomorrow. Unless you are in an industry with severe consequences for errors, none of these things rise to the level that should wake your team up.

Expectations

I had read accounts of on-call engineers at Google being required to always be within arms-reach of their laptops and that didn’t sound like a life I wanted to live—or provide for the rest of the team. Especially as many of us had families and hobbies and our app being down for a few minutes was not going to kill anyone.

Our expectations are, in my opinion, much more reasonable: respond within 10-15mins. That means you can leave the laptop in the car when you go shopping; or go for a short walk near your house. We just expect that the developer will be ready to triage the incident within this reasonable timeframe

I even addressed whether you can drink when on-call: sure, just don’t operate production systems when drunk! We all might enjoy a beer or glass of wine after work and I didn’t want to restrict our team’s social lives, and neither did I want anyone to hide the fact they’d been out on a heavy session when on-call. If you’re drunk, please don’t acknowledge the incident!

Teamwork

I’ve always been very clear that the app is the whole team’s responsibility. It’s not just mine. And you can’t just claim that little corner over there as yours. We all have a collective responsibility to keep this thing alive. “It’s not my job” is not an attitude that’s tolerated at Podia: if you ship code at Podia then it’s your job to look after it.

And that means when you ship some dodgy code, it’s going to wake you, or one of your colleagues up at night. Or when you review a PR and don’t speak up about your concerns, that’s the thing which could wake you up. When we advocate for infrastructure choices, we do so knowing that all of us will be on-call to support it.

Incidents are best played as a team sport since, ideally, most of them should ideally be novel situations that we haven’t encountered before.

When a developer acknowledges an incident, they will start to triage it to understand what happened and why, what the resolution might be, etc. It’s rarely something that you can look up in a playbook and run through a sequence of commands to resolve. We used to have a “runbook” but most of those situations have now been engineered away so we don’t really encounter them now.

There’s nothing more frightening than receiving a 4am call, blurry-eyed, scrambling for your laptop, and then all on your own trying to decide how best to handle the backlog of jobs in the queue. Do you pause the process? Delete the enqueued jobs? Increase the concurrency? Yikes, no. You need some help.

It’s the general expectation that if you see an incident in-progress, you join it and offer to help. You never let your colleague handle the incident on their own as you continue to work on whatever you were doing—because nothing you could be working on is more important than helping out a teammate.

Since Day 1 we’ve announced all incidents in Slack immediately as soon as the alert is triggered, and then paged the on-call developer a minute or two later. This means that if the incident happens at 4am in New York, an EU developer working at 9am can spot the incident notification and acknowledge it, thereby saving their colleague from a very early morning wake-up call. It’s not only the on-call developer that can acknowledge the incidents.

This broadly works because Podia developers are spread across European and North American timezones, so we have pretty good coverage for 12-16 hours of the day during the week. But in cases when the developer is still left dealing with the incident on their own, they can page me (I’m CTO so I’m always on-call even when I’m not) or another developer on the team to help. We now even have a prompt in the incident channel after a short delay asking “Do you have everyone you need to resolve this incident?” and instructions for how to page someone else.

I don’t like incidents; they’re stressful for the team and each one impacts our users. But they are also a special bonding moment for us as a team. There’s a great camaraderie that comes from working together under a (hopefully!) brief stressful situation, solving a puzzle together, and remediating the problem. A little bit of adrenaline is good for the soul! There’s also a counter-intuitive argument that you can’t get good at something that only happens occasionally. We need incidents (😅) so that we can collectively learn how our systems function and how to respond to them when they fail.

Being on-call also has other ways we can demonstrate teamwork and strengthen relationships, particularly important on a remote team. Shift swapping is common and encouraged; being flexible is another way of limiting the impact to each of our lives, and another opportunity to say “I got you”. When someone joins the team, they’re not on-call for the first month. If people are up handling an incident overnight then I’ll take over the rest of their shift and tell them to take the morning off. When someone has a baby, we take them off the on-call rota for a few months. When someone leaves the team, I typically take over that person’s shifts for a while so the team doesn’t have to absorb the impact straight away. All small gestures of teamwork.

All in all, having an on-call rota is one of the few ways we can work together as a whole team rather than working on projects in pairs.

On-call, over time

Our on-call process was introduced a good few years ago now and it has largely remained unchanged. It’s still a 24-hour rota, only three things can wake us up, and we treat it as a whole-team responsibility. Some things have changed though.

Team size

When we started out, I think there was only 4, maybe 5, of us on the rota. At our largest, there was been 12 people on team, which led to a very relaxed on-call frequency. If anything, that was verging on being too infrequent and had the team scaled to >15 developers, I think I would have introduced a two-layer on-call system so there would always be a primary and secondary developer on-call.

At the moment, we’re just 8 people on the dev team which narrowly avoids the awkward situation with 7 people where the same person would always be on-call for the same day of the week.

I do think it’s worth having some on-call system even with just a single developer to avoid obsessively checking slack but, generally, the more people in the rota the better it is for everyone. Less than 4 developers is always going to be a tough situation.

We also now have the entire leadership team pageable (but not on a rota) so for larger incidents we can pull in Support or Marketing help to communicate with our users about the incident.

Tools

Our tool choice has also changed over the years. We started out using OpsGenie because the UI didn’t make my eyes bleed like PagerDuty did. Not that it was great but it was mostly useable.

We also used UptimeRobot to monitor the app’s uptime across a few different endpoints and assess the background queue latency by polling an endpoint that reported the metric.

Our stack today starts with Cronitor which has taken over app monitoring duties from UptimeRobot. When Cronitor detects a problem, it triggers an incident in incident.io. This provides us with a Slack app for managing the incident process: it auto-creates a slack channel, broadcasts the incident notification to a company-wide #podia-updates channel, pages the on-call developer, and triggers some timed-prompts during the incident (like “Do you have everyone you need to resolve this incident?”). We can also assign tasks inside the incident Slack channel, pin messages for an after-action incident report, and update our internal incident status. It also makes it incredibly easy to escalate the incident to another developer, knowing that it will break through their do-not-disturb settings if required. If necessary, anyone in the company can create an incident from within Slack using incident.io.

Our status page is also hosted by incident.io so anyone at the company can update it from within Slack. This is great during an incident when the devs might be heads-down trying to figure out the problem and someone from support can craft a more detailed message for our users.

Once the incident is resolved, we can write up an incident report using a timeline compiled automatically by incident.io from the posts in the Slack channel, including relevant screenshots, code snippets, PR links etc. We’re not particularly rigorous with this process but it still represents a valuable body of work, especially for new developers joining the team and wondering what they’re getting themselves into.

We also rely heavily on Dataset (a.k.a. Scalyr before they were acquired) to give us an at-a-glance dashboard and the ability to filter, search, and chart data from our logs. We use AppSignal for exception tracking and more fine-grained performance profiling.

But the tools don’t really matter. If I was doing this today from scratch, I’d probably look at BetterStack because they literally have everything you need under one roof.

Takeaways

What matters first is thinking about how an on-call rota can improve the lives of your team and business—being on-call sometimes is infinitely better than everyone constantly checking Slack on their time off and living with the permanent stress that the app might be down.

Once you’ve accepted the necessity and desire for having an on-call rota, you can design the process in such a way that it minimises the stress for those involved and promotes stronger tied within the team. You can place limits on it and make it flexible enough so that the impact is quite minimal.

These days, being on-call for Podia means I throw my laptop and AirPods in a bag when I leave the house. Or I choose to do jobs around the house. I can’t swim for that 24 hour period but that’s ok: I can still go to the gym, or run errands, go out for a meal, or watch a movie at the cinema.

I can do that because the demands are low, I know the team will have my back if they’re working anyway, and incidents are pretty rare. Typically we have ~1 incident per month and these rarely involve waking someone up: usually it’s caught by a developer who was working in their timezone or because it occurred during normal working hours (because—sssshhhhhh—the main cause of downtime is actually developers doing developer-things 😅. Who knew?!). At times, we have had spurious alerts that triggered an incident before the alert was auto-resolved. These were very frustrating for everyone but we try to investigate them and put in preventative measures to squash these alerts.

No system is perfect but I do believe you can structure the process to make it less burdensome.