How We Built an AI Copilot to Make On-Call Less Painful

Jul 08, 2025

Getting hit with a critical alert is never fun.
Getting one you don’t understand is even worse.

As our engineering team at Kraken scaled, we began to feel the pain of production alerts in all the usual ways: too many alerts, unclear meaning, and insufficient context.

Being on-call is challenging - alerts can come at any time, from any part of the system, and you are responsible for quickly triaging and mitigating potential crises. It reminded me of one of the hard lessons I learned at Amazon: alerts are only useful if someone can act on them quickly.

So during the Kraken AI Hackathon, we asked the obvious question: “Could AI make on-call a little less painful?”. Over two days, we hacked together PingPanda: an AI Slackbot that helps engineers triage alerts faster:

AI assessment for alerts: it reads stack traces and explains what might be going wrong, with suggested fixes.
Uses urgency emojis (🔴 🟠 🟡 🟢) so you can spot high-priority issues instantly.
Uses custom context per team, using Slack channel-specific prompts.

The annotated screenshot explains more, but the core idea is simple: reduce the mental overhead of responding to alerts.

The *PingPanda* AI Slack app we built automatically assesses production alerts, using our custom context to guide its reasoning.

The Problem: Alerts Don’t Scale

Back in 2023, when I joined the Telco department within the company, we had just 10 engineers and a handful of early customers. Since then, we’ve grown significantly, expanding to multiple teams, dozens of engineers, and serving hundreds of thousands of customers.

As we grew, so did the number of alerts. As our core product is a CRM, most of our alerts aren’t about infrastructure failures, but application failures. As customer accounts evolve over their lifetime, they can find complex edge cases and unexpected side effects. These kinds of things are hard to anticipate and even harder to debug.

What made it worse was that newer engineers didn’t yet understand the whole system. Even experienced engineers struggled with knowledge gaps, especially when an alert involved multiple teams/domains. The knowledge was getting more fragmented, and our engineers were struggling to keep up.

We needed something lightweight and immediate, with a deep understanding of the larger system and processes. Today’s LLMs are well-suited for this problem space: they have strong coding abilities and large context windows.

We Aimed Too High (at first)

Our original idea was too ambitious - a fully automated agent to handle production alerts and make the appropriate fix. The timing felt right. New tools like OpenAI’s Codex CLI, Claude Code, and GitHub CoPilot Agent were all released around May 2025 with big claims.

When we tried them, it didn’t go well.

Codex couldn’t handle the size of our main repo (we have a giant monolith). It was slow or completely unresponsive.
Claude Code was better, but it was too eager, suggesting quick fixes that would mask the underlying problem instead of tackling the more complex root cause.
Copilot Agent was the least reliable. It made inconsistent changes across files. It once made a change, then later denied making it.

That was the turning point, where we gave up on full automation.

GitHub Copliot was unreliable for changes across files. Here, it made a rogue change and then tried to deny it made the change! This was the moment when we decided to pivot from fully automated agents to something simpler.

The takeaway was clear: the tools are getting better, but we’re not yet at the point where we can have AI agents fix real production issues without a human in the loop.

Pivot: Don’t Fix, Just Explain

So we shifted our goal. Instead of trying to automate the fix, what if the AI just helped explain the alert?

We started by sending stack traces to Anthropic’s LLM, prompting it to explain what went wrong and its reasoning (known as Chain-Of-Thought Prompting). And honestly, the results were good. Clear, concise, and fast.

At first, I was sceptical, thinking, “I could just read the stack trace myself”. However, I soon discovered that starting with a human-friendly summary gave me a significant head start. When I then read the raw stack trace, I would focus on cross-checking the AI’s assessment, instead of trying to piece everything together from scratch.

This turned out to be especially helpful when dealing with areas of the codebase I didn’t know well. And for new engineers, those unfamiliar areas are everywhere!

Context Is King

Unsurprisingly, for our LLM, the big unlock was when we started giving the AI richer context. We built the Slackbot so it could recognize which channel it was in, and load a custom context. That meant each team could describe their domain and architecture, and the LLM would generate responses tailored to that specific domain.

This was a game-changer. The AI didn’t just explain the stack trace; it started reasoning about the alert in a way that matched how we thought about the system ourselves. It saw the bigger picture (or at least appears to). The explanations went from generic summaries to something more like: “This error is happening in the ordering process, which likely hasn’t received the confirmation message from [named] 3rd party.”

Testing Consistency

Now that we had a functioning system, we were keen to evaluate it more thoroughly - how would it fare for a variety of problems? And since LLMs are non-deterministic, how consistent would its responses be for the same problem?

So we tried evaluating the LLM by having our human engineers validate them. In most cases, we found its explanations were useful. And in 9/10 cases, the LLM suggested the exact same explanation. That level of consistency gave us more confidence in using the tool.

What happened when we deployed our tool

Once we had a working prototype, we rolled it out to a few teams and added a simple thumbs-up/down feedback button. Reactions were logged to a separate Slack channel so we could monitor how people were finding it.

User feedback for the AI’s assessment is logged on a separate Slack channel (and in structured application logs).

The feedback was better than expected. Most engineers found the summaries useful. Even senior engineers — who we expected to ignore it — were giving positive feedback.

What we didn’t expect was that non-engineers started using it too! People in product, and client managers began reading the summaries to get a sense of whether something was serious or not. They didn’t need to ping an engineer to ask, “Is this alert bad?” They could just read the explanation and get a rough idea.

In the last two weeks, we’ve had 8 teams across the wider Kraken organisation start using it. So far, we’ve seen positive feedback, but we’re keeping a close eye to see how this evolves.

Final Thoughts

The biggest lesson from this hackathon? You don’t need AI to do everything.
Just helping people understand alerts faster is already a big step forward. The tool isn’t perfect, but it’s already helping engineers move quicker and helping non-engineers stay in the loop.

There’s still a lot to improve, and we’re excited to keep building on this. But even in this early form, it’s already doing what we hoped it would: making production alerts a little less painful.

Qianny

Jul 27

Great article, great that the solution came full circle at the end- particularly enjoyed the confrontation between yourself and GitHub Copilot btw!

I tried Copilot Autofix natively on GitHub web for a repo alert, though it was only an option at the time. Denial is a river for sure.

Expand full comment

Muddle Map

Discussion about this post