What is observability?

The TL;DR

Observability is how businesses know what’s going on with their different systems and operations.

Businesses are made up of different software systems, from apps and infrastructure to data pipelines
Observability helps teams monitor when things go wrong and profile how things change over time; basically, knowing what’s going on under the hood
There are 4 major types of observability, from infrastructure to your whole business, and each carries different tools and responsible teams

When teams can clearly see what’s happening in their systems – whether it’s an app, server, or part of the business – then they can make better decisions and ultimately get better outcomes.

Observability basics: what the hell is going on?

Observability is a complicated word for a pretty simple thing: companies want to know what’s going on with their systems and their business. Startups, and especially big companies, have a lot happening, from software to operations to why my campaigns aren’t converting this week, and the teams responsible for those things want some visibility into them.

More fundamentally, businesses today are built on systems – software systems, sure, but also people and operational systems. In the same sense that your application is running on infrastructure in the cloud all by itself, and your engineering team wants to make sure it’s working well, your ad campaigns are running in Google AdWords, all by themselves, and your marketing team wants to make sure they, too, are working well.

In practice, observability usually manifests itself as a piece of software – one that you write yourself, or buy from someone else – that monitors the signals of your system and spits some things out. The consumers of that observability are usually interested in alerts when things go wrong, but also profiling and understanding their system better so they can make improvements. Here’s one example:

Who: your backend engineering team
What: how fast the database is running queries
How: query plans and times
Using: built-in Postgres functions or Datadog

Here’s another example:

Who: your data team
What: how clean a new data pipeline is
How: missing and null values
Using: custom queries or Datafold

And yet another example:

Who: your product team
What: how larger customers are using a new feature
How: feature adoption and user growth
Using: product analytics tools or Rupert

Examples abound. For a more conceptual backing though, you can pretty clearly split observability into 4 distinct categories. For each we’ll look at what teams want to monitor, whose job it is, and what tools they might use to do it.

Application observability

Let’s start with the most user-facing type of observability, that of the application variety. For a quick refresher, most applications today are at least in part delivered over the internet via the cloud. Twitter is an application, Google Sheets is an application, Salesforce is an application, and so on and so forth.

Engineers want their apps to be responsive, fast, and error/bug free – because you and I want to use applications that are responsive, fast, and error/bug free. Application observability is how developers monitor those very applications to measure and improve these things. Some common metrics that a developer might be tracking for their app:

How long it takes for different screens to load
Bugs and errors, and how common they are
How long API requests are taking
Most common API requests

The big dog (no pun intended) in this space is Datadog, but they’re not the only useful tool for app observability. Alternatives like NewRelic exist, plus many companies will “roll their own” system (read: build themselves) from open source tools like Prometheus and Grafana. Here’s what a sample Datadog dashboard might look like:

It’s also worth mentioning session recording tools like Fullstory and Hotjar, although these start to encroach more into the analytics universe and outside the observability one.

Infrastructure observability

Closely related to app observability (which is why the “how” is basically identical – thought you caught me slippin?) is infrastructure observability. Apps run on infrastructure, and monitoring how that infrastructure is performing is perhaps (probably?) the biggest section of this “market” so to say. If your servers aren’t efficient, that means your app isn’t going to be either.

At a basic level, every piece of software a company makes and runs is going to run on a server, which is a giant computer in the cloud. Developers will want to monitor:

How much CPU the server is using
How much RAM the server is using
How much disk the server is using
Peaks and valleys for this stuff

Kind of like, how much gas your car is using and what RPM it’s revving up to in which gears.

But infrastructure observability is more than just server metrics. It’s also about measuring the higher level components that you have running on those servers, like your database, or if you’re really fancy, microservices. For a database, you might want to track:

How long queries take
How much compute power your queries need
How much storage your data is using
How quickly your database vacuums up old data

So for infrastructure observability, the kinds of tools that a developer would use are highly dependent on the use case. One thing that almost everyone is going to use regardless is Datadog, but you might also use built-in analytics for your database if you’re using a provider like PlanetScale. Plus, there’s an entire universe of log management tools (what’s a log? Read this).

Data observability

It’s not only software engineers who get invited to the observability party. With the degree to which data is integrated into modern companies and how they operate, data teams now also need to monitor how their systems are performing. Data observability!

A big part of the data team’s job is building data sets via ETL from various source systems. They have pipelines, or “jobs,” that run on a regular basis, pull data from source systems, mess around with it until it’s right, and then deposit it somewhere nice like a data warehouse. With data teams, monitoring is usually about checking when this data has gone bad. Bad could mean:

A day of website click data is missing
Online orders are mistakenly marked as 10% higher cost than they should be
New signups aren’t getting added to the users table
The metrics table thinks the user base has shrunk by 50% overnight

There’s a saying among data teams that “incorrect data is worse than no data at all,” because wrong decisions get made on wrong data, and no decisions get made on no data. So it’s critical for data teams to be able to identify when data is incorrect and be able to fix it quickly. That’s what data observability is all about¹.

The difficult thing about this kind of observability is that it’s highly subjective relative to app or infrastructure observability. When data is missing? That’s simple enough. But how would someone know if data is wrong? What does right even mean? There are heuristics, like alerting if a metric or value is more than 2x what it was yesterday, but those only catch obvious problems. So the data team’s job is hard, and will always end up involving a lot of manual work and sanity checks (plus, collaboration with teams who know what the data means).

There’s a growing cadre of startups working on data observability. Tools like Datafold allow you to create automated data tests, and compare old and new datasets to make sure everything checks out. Here’s an example comparison between an old table and an updated one in Datafold:

Monte Carlo is another tool in this space that’s focused on larger companies.

Business observability

There’s an emerging category of observability that builds on the previous 3 types, and it’s called business observability. It doesn’t relate as much to a software system, but it’s a system nonetheless: how your business runs, how initiatives are performing, where revenue is trending, etc. Business observability is basically about monitoring:

How your metrics are trending– unexpected swings, anomalies, etc.
When any important events occur– new high profile customers, feature usage down, etc.

A few simple examples:

Product usage at your largest customer is tanking, and your CS team needs to reach out and figure out what’s happening
Your marketing team made an update to your marketing site homepage, and conversion rates of visitors clicking “sign up” have tanked since last week because the button is harder to find
Your product team released a small new feature last week, and a shockingly high percentage of the customer base is actively using it
An unknown bug at checkout has made revenue plummet to 0 this month

Business observability is about monitoring and staying on top of these kinds of changes.

If you’re thinking “hey, this sounds like data observability,” you’re right! There’s significant overlap across the two, much like the overlap between app and infrastructure observability. But there’s a major difference, and that difference is context, specifically context on what the data actually means. There’s an objectivity to most data observability: a value is missing, some data is wrong, a job didn’t run. But for business observability, each unique business situation and function at the company implies a different set of things to look out for, and what an anomaly in the data might mean.

🔍 Deeper Look

You might have heard of something in data science called “anomaly detection” – it’s the process of looking at a dataset and finding which data points don’t seem to make much sense. If your sales hover around $1M per month, and then one day the data shows $5M, that’s an anomaly, and one you probably want to get alerted about. Data teams will use ML models to automatically detect anomalies in all kinds of company data.

The default tool that teams have been using for decades to monitor their business is dashboards. Functional teams will have a marketing dashboard, a sales dashboard, etc. that they check regularly to make sure metrics make sense and are trending in the right direction. Companies can sometimes build these from scratch, but more often will use a tool like Looker or PowerBI.

The issue with dashboards (and I don’t want to give them too bad of a rap) is that they only give you a piece of the business observability puzzle. You have to check them regularly, they’re difficult to change, and they’re reactive, not proactive. Tools like Rupert take the next step and help teams proactively create alerts and take actions on them. Basically, instead of having to go find and check a dashboard, you can get the right data pushed to you when something changes.

Here’s an example of creating an alert in Rupert to notify your marketing team if there’s a >15% drop in campaign performance vs a 7-day moving average:

You can send these alerts to email or Slack, plus add action buttons to link out to whatever you’d want someone to do to “remedy” the issue. You also get a dashboard tracking all alerts that were sent out and who engaged with them:

Rupert is this post’s sponsor: you should check them out if your team is looking to send useful alerts like this.

Plus the unlikely scenario that a company has an actual app consuming a data product (like real time personalization or something). But this is very rare in practice. ↩