In part one of this two-part series of posts, I’ll be discussing my views on the fundamentals and key elements of Observability, as opposed to a technical deep dive. There are many great resources out there which already take a closer look at the key concepts. First off, let’s look at what Observability is.
What is Observability?
The CNCF defines Observability as “the capability to continuously generate and discover actionable insights based on signals from the system under observation”.
Essentially the goal of Observability is to detect and fix problems as fast as possible. In the world of monolithic apps and older architectures, monitoring was often enough to accomplish this goal, but with the world moving to distributed architectures and microservices, it is not always obvious why a problem has occurred by merely monitoring an isolated metric which has spiked.
This is where observability becomes a necessity. With observability basically being a measure of how well the internal state of a system can be understood based on its signals, it stands to reason that all the right data is needed! In a distributed system the right data is typically regarded to be logs, metrics and application traces, often referred to as the “three pillars of observability”.
While these are the generally agreed upon key indicators, it is important in my view to also look at including user experience data, uptime data, as well as synthetic data to provide an end-to-end observable system.
The analyst’s ability to then gain the relevant insights from this data to detect and fix root cause events in the quickest and most efficient way possible is the measure of how effectively observability has been implemented for the system.
There are a number of aspects which can determine the success of your observability efforts, some of which bear more weight than others. There are also tons of observability tools and solutions to choose from. What is fairly typical amongst customers that LSD engages with is that they have numerous tools in their stable but have not achieved their goals in terms of observability, and therefore haven’t achieved the desired state.
Let’s explore this a bit more by looking at what the desired state may look like.
What is the desired state?
This is best explained by looking at an example: A particular service has a spike in latency which is likely picked up through an alert. How does an analyst go from there to determine the root cause of the latency spike?
Firstly the analyst may want to trace the transaction causing the latency spike. For this, they would analyse the full distributed trace of the high latency events. Having identified the transaction, the analyst still does not know the root cause. Some clues may lie in the metrics of the host or container it ran in, so that may be the next course of action. The root cause is mostly determined in the logs, so ultimately the analyst would want to analyse the logs for the specific transaction in question.
The above scenario is fairly simple however achieving this in the most efficient way, relies on the ability to optimally correlate between logs, metrics and traces.
Proper correlation means being able to jump directly from a transaction in a trace to the logs for that specific transaction, or being able to jump directly to the metrics of the container it ran in. To me, the most effective way to achieve this is for all the logs, metrics and traces, to exist in the same observability platform and to share the same schema.
In the digital age, customers want a flawless experience when interacting with businesses. Let’s look at a bank for example. There is no room for error when a service is directly interacting with a customer’s finances. So when an online banking service goes down for three days (it happens), it will lose customers or at least suffer reputational damage.
The ultimate goal is to detect and fix root cause events as quickly and efficiently as possible, and in this, the approach of using multiple tools fails.
In part two of this series, I will discuss the most critical factors which contribute to a good Observability solution that will help businesses reach the goals set out above.
Learn more about Observability by reading this blog post by Mark Billett, an Observability engineer at LSD.
If you would like to know more about Observability or a Managed Observability Platform, check out our page.