Generative AI in the Real World: Phillip Carter on Where Generative AI Meets Observability

Phillip Carter, formerly of Honeycomb, and Ben Lorica talk about observability and AI—what observability means, how generative AI causes problems for observability, and how generative AI can be used as a tool to help SREs analyze telemetry data. There’s tremendous potential because AI is great at finding patterns in massive datasets, but it’s still a work in progress.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Timestamps

0:00: Introduction to Phillip Carter, a product manager at Salesforce. We’ll focus on observability, which he worked on at Honeycomb.
0:35: Let’s have the elevator definition of observability first, then we’ll go into observability in the age of AI.
0:44: If you google “What is observability?” you’re going to get 10 million answers. It’s an industry buzzword. There are a lot of tools in the same space.
1:12: At a high level, I like to think of it in two pieces. The first is that this is an acknowledgement that you have a system of some kind, and you do not have the capability to pull that system onto your local machine and inspect what is happening at a moment in time. When something gets large and complex enough, it’s impossible to keep in your head. The product I worked on at Honeycomb is actually a very sophisticated querying engine that’s tied to a lot of AWS services in a way that makes it impossible to debug on my laptop.
2:40: So what can I do? I can have data, called telemetry, that I can aggregate and analyze. I can aggregate trillions of data points to say that this user was going through the system in this way under these conditions. I can pull from these different dimensions and hold something constant.
3:20: Let’s look at how the values differ when I hold one thing constant. Let’s hold another thing constant. That gives me an overall picture of what is happening in the real world.
3:37: That is the crux of observability. I’m debugging, but not by stepping through something on my local machine. I click a button, and I can see that it manifests in a database call. But there are potentially millions of users, and things go wrong somewhere else in the system. And I need to try to understand what paths lead to that, and what commonalities exist in those paths.
4:14: This is my very high-level definition. It’s many operations, many tasks, almost a workflow as well, and a set of tools.
4:32: Based on your description, observability people are sort of like security people. WIth AI, there are two aspects: observability problems introduced by AI, and the use of AI to help with observability. Let’s tackle each separately. Before AI, we had machine learning. Observability people had a handle on traditional machine learning. What specific challenges did generative AI introduce?
5:36: In some respects, the problems have been constrained to big tech. LLMs are the first time that we got truly world-class machine learning support available behind an API call. Prior to that, it was in the hands of Google and Facebook and Netflix. They helped develop a lot of this stuff. They’ve been solving problems related to what everyone else has to solve now. They’re building recommendation systems that take in many signals. For a long time, Google has had natural language answers for search queries, prior to the AI overview stuff. That stuff would be sourced from web documents. They had a box for follow-up questions. They developed this before Gemini. It’s kind of the same tech. They had to apply observability to make this stuff available at large. Users are entering search queries, and we’re doing natural language interpretation and trying to boil things down into an answer and come up with a set of new questions. How do we know that we’re answering the question effectively, pulling from the right sources, and generating questions that seem relevant? At some level there’s a lab environment where you measure: given these inputs, there are these outputs. We measure that in production.
9:00: You sample that down and understand patterns. And you say, “We’re expecting 95% good—but we’re only measuring 93%. What’s different between production and the lab environment?” Clearly what we’ve developed doesn’t match what we’re seeing live. That’s observability in practice, and it’s the same problem everyone in the industry is now faced with. It’s new for so many people because they’ve never had access to this tech. Now they do, and they can build new things—but it’s introduced a different way of thinking about problems.
10:23: That has cascading effects. Maybe the way our engineering teams build features has to change. We don’t know what evals are. We don’t even know how to bootstrap evals. We don’t know what a lab environment should look like. Maybe what we’re using for usability isn’t measuring the things that should be measured. A lot of people view observability as a kind of system monitoring. That is a fundamentally different way of approaching production problems than thinking that I have a part of an app that receives signals from another part of the app. I have a language model. I’m producing an output. That could be a single-shot or a chain or even an agent. At the end, there are signals I need to capture and outputs, and I need to systematically judge if these outputs are doing the job they should be doing with respect to the inputs they received.
12:32: That allows me to disambiguate whether the language model is not good enough: Is there a problem with the system prompt? Are we not passing the right signals? Are we passing too many signals, or too few?
12:59: This is a problem for observability tools. A lot of them are optimized for monitoring, not for stacking signals from inputs and outputs.
14:00: So people move to an AI observability tool, but they tend not to integrate well. And people say, “We want customers to have a good experience, and they’re not.” That might be because of database calls or a language model feature or both. As an engineer, you have to switch context to investigate these things, probably with different tools. It’s hard. And it’s early days.
14:52: Observability has gotten fairly mature for system monitoring, but it’s extremely immature for AI observability use cases. The Googles and Facebooks were able to get away with this because they have internal-only tools that they don’t have to sell to a heterogeneous market. There are a lot of problems to solve for the observability market.
15:38: I believe that evals are core IP for a lot of companies. To do eval well, you have to treat it as an engineering discipline. You need datasets, samples, a workflow, everything that might separate your system from a competitor. An eval could use AI to judge AI, but it could also be a dual-track strategy with human scrutiny or a whole practice within your organization. That’s just eval. Now you’re injecting observability, which is even more complicated. What’s your sense of the sophistication of people around eval?
17:04: Not terribly high. Your average ML engineer is familiar with the concept of evals. Your average SRE is looking at production data to solve problems with systems. They’re often solving similar problems. The main difference is that the ML engineer is using workflows that are very disconnected from production. They don’t have a good sense for how the hypotheses they’re teasing are impactful in the real world.
17:59: They might have different values. ML engineers may prioritize peak performance over reliability.
18:10: The very definition of reliability or performance may be poorly understood between multiple parties. They get impacted by systems that they don’t understand.
22:10: Engineering organizations on the machine learning side and the software engineering side are often not talking very much. When they do, they’re often working on the same data. The way you capture data about system performance is the same way you capture data about what signals you send to a model. Very few people have connected those dots. And that’s where the opportunities lie.
22:50: There’s such a richness in connection production analytics with model behavior. This is a big issue for our industry to overcome. If you don’t do this, it’s much more difficult to rein in behavior in reality.
23:42: There’s a whole new family of metrics: things like time to first token, intertoken latency, tokens per second. There’s also the buzzword of the year, agents, which introduce a new set of challenges in terms of evaluation and observability. You might have an agent that’s performing a multistep task. Now you have the execution trajectory, the tools it used, the data it used.
24:54: It introduces another flavor of the problem. Everything is valid on a call-by-call basis. One thing you observe when working on agents is that they’re not doing so well on a single call level, but when you string them together, they arrive at the right answer. That might not be optimal. I might want to optimize the agent for fewer steps.
25:40: It’s a fun way of dealing with this problem. When we built the Honeycomb MCP server, one of the subproblems was that Claude wasn’t very good at querying Honeycomb. It could create a valid query, but was it a useful query? If we let it spin for 20 turns, all 20 queries together painted enough of a picture to be useful.
27:01: That forces an interesting question: How valuable is it to optimize the number of calls? If it doesn’t cost a tremendous amount of money, and it’s faster than a human, it’s a challenge from an evaluation standpoint. How do I boil that down to a number? I didn’t have an amazing way of measuring that yet. That’s where you start to get into an agent loop that’s constantly building up context. How do I know that I’m building up context in a way that’s helpful to my goals?
29:02: The fact that you’re paying attention and logging these things gives you the opportunity of training the agent. Let’s do the other side: AI for observability. In the security world, they have analysts who do investigations. They’re starting to get access to AI tools. Is something similar happening in the SRE world?
29:47: Absolutely. There are a couple of different categories involved here. There are expert SREs out there who are better at analyzing things than agents. They don’t need the AI to do their job. However, sometimes they’re tasked with problems that aren’t that hard but are time consuming. A lot of these folks have a sense of whether something really needs their attention or is just “this is not hard but just going to take time.” At that time, they wish they could just send the task to an agent and do something with higher value. That’s an important use case. Some startups are starting to do this, though the products aren’t very good yet.
31:38: This agent will have to go in cold: Kubernetes, Amazon, etc. It has to learn so much context.
31:51: That’s where these things struggle. It’s not the investigative loop; it’s gathering enough context. The winning model will still be human SRE-focused. In the future we might advance a little further, but it’s not good enough yet.
32:41: So you would describe these as early solutions?
32:49: Very early. There are other use cases that are interesting. A lot of organizations are undergoing service ownership. Every developer goes on call and must understand some operational characteristics. But most of these developers aren’t observability experts. In practice, they do the minimal work necessary so they can focus on the code. They may not have enough guidance or good practices. A lot of these AI-assisted tools can help with these folks. You can imagine a world where you get an alert, and a dozen or so AI agents come up with 12 different ways we might investigate. Each one gets its own agent. You have some rules for how long they investigate. The conclusion might be garbage or it might be inconclusive. You might end up with five areas that merit further investigation. There might be one where they’re fairly confident that there’s a problem in the code.
35:22: What’s preventing these tools from getting better?
35:34: There’s many things, but the foundation models have work to do. Investigations are really context-gathering operations. We have long context windows—2 million tokens—but that’s nothing for log files. And there’s some breakdown point where the models accept more tokens, but they just lose the plot. They’re not just data you can process linearly. There are often circuitous pathways. You can find a way to serialize that, but it ends up being large, long, and hard for a model to receive all of that information and understand the plot and where to pull data from under what circumstances. We saw this breakdown all the time at Honeycomb when we were building investigative agents. That’s a fundamental limitation of these language models. They aren’t coherent enough with large context. That’s a large unsolved problem right now.