In Part 1, I introduced Observability and what we are looking to achieve with its implementation. In this post, I will discuss the most critical aspects of a good Observability solution that together will help businesses reach the Observability goals I described in Part 1.
What makes up a good Observability solution?
Scale
So far, I have discussed that the solution needs support for logs, metrics, logs and other data types. This is very important, however equally important is the ability to process this data and deliver results and insights in real time. The scalability of the platform is therefore also of crucial importance.
Integration
Data is often collected by means of an agent or multiple agents. A good solution typically provides many out-of-the-box integrations to data sources but also the ability to ingest custom sources. The ability to manage a fleet of agents centrally can also simplify operational burdens significantly.
Correlation
I have mentioned the ability to effectively correlate between data sources as being important. There are different ways of achieving this however I find the most effective to be a common schema between sources.
Visualisation and navigation
A key component of an observability solution is a way to effectively visualise and provide navigational tools so that users can intuitively navigate and analyse data to determine the root cause of problems. This is the user interface that facilitates correlations and is often driven through dashboards which either allows the analyst to easily correlate between sources or visualise data from different sources in the same dashboards. Having the ability to create custom dashboards also adds a lot of value to a solution. Different audiences have different requirements on what they want to view and customisation allows dashboards to be built with a view of what is directly important to the audience.
Alerting
An effective and intuitive alerting framework is the next feature which adds a lot of value to a solution. In its simplest form, alerting rules are configured based on static thresholds where an email is sent when a threshold is exceeded. Alerting can however be the bane of many operational teams, with alert fatigue being as big a problem as new conditions arise for which no rules have been defined. Furthermore, alerts should trigger a managed response and not just an email. This is often achieved through integrations with existing incident management systems. To properly unpack alerting and how to solve problems such as alert fatigue, is beyond the scope of this post however a good solution would cater for this and provide methods for integration with existing incident management systems.
Machine Learning
Machine Learning is also playing an increasingly vital role in Observability. So much so that a good solution should incorporate some measure of it. This also goes a long way in helping to solve some common alerting problems as detailed above. Anomaly detection, based on machine learning, will learn what normal conditions are in time series data, over time. For example, it will learn that month-end cycles include many more transactions than what is typically expected and will therefore know an influx of certain events over that period is normal, preventing alerts from being generated. Static rules are not able to achieve this. Similarly, because it is monitoring for anomalies, it does not need a static rule in place for anomaly detection to, for example, detect rare or anomalous events in log files and alert on them.
Distributed architectures
I previously mentioned the critical importance of observability in new distributed architectures and services. Moving workloads to the cloud have further complicated architectures and introduced many more moving parts deployed across hybrid environments, both on-premise and in the cloud or even multi-clouds. There are a number of factors to consider in this regard which would determine what the right tool is. Firstly, the capability to analyse data from all environments and correlate across this data, regardless of the environment is important. This may entail a single platform hosted either in the cloud or on-premise, or multiple platforms hosted in different environments with the capabilities to seamlessly link the clusters in a way that makes navigating data across environments transparent to the users. There are multiple factors to consider when architecting an observability solution. Deep diving into these is not in the scope of this article but it is vital that the deployment models offered by a solution optimally meet all requirements for the business. This is typically achieved by solutions that are flexible and offer different options in terms of deployment. These solutions should be able to meet requirements today and in the future.
Solution management and support
When considering deployment models, it is also important to evaluate the operational effort to manage your observability solution. In other words, is it properly orchestrated according to modern architectural standards? How easy is it to scale or upgrade? How easy is it to support? If any of these factors fall outside of the capabilities of the team supporting and maintaining the Observability solution, consider a managed service approach and let Observability experts take care of it.
Security
Security is critical with modern standards requiring encryption, role-based access controls (RBAC), authentication with the identity provider of choice such as AD, etc. Additionally, there are sometimes requirements to secure data to a field or attribute level. A good solution will cater for enterprise-level security standards.
Resilience
How different Observability solutions achieve high availability (HA) varies. Furthermore, the deployment methods supported by the solution would also impact the level of HA achievable. For example, deploying in Kubernetes can take advantage of all the self-healing capabilities inherently available in such a platform. The targeted environment for the deployment also has an impact. Deploying to the cloud would for example allow distribution across availability zones or even regions. Then there are also considerations to be made on Disaster Recovery environments, should there be a requirement, and exactly how those may be supported by the solution. Without unpacking all the factors involved in making your decision, careful consideration should be taken on this. A good solution will offer you the flexibility to decide on the level of HA required, depending on your deployment destination and method of choice.
Costing model
It is not my intention to deep dive into costing models of differing solutions. Focusing on what makes a good solution, it must however be factored in that the solution will not be effective if it is not able to ingest all the data required due to cost. Costing models should be carefully evaluated based on current and future states. Many businesses find themselves in a position where costs are manageable in the beginning but then quickly spiral out of control as soon as features are added or the solution starts scaling.
Skills and Knowledge
Finally, the implementation of the Observability platform decided on is vital. The term ‘Observability’ is just a label printed on the tin of an Observability tool. Installing an Observability product does not mean that a business now has observability. It needs to be deployed and implemented in the right way to be effective. Finding experienced practitioners who understand the ins and outs of how to do this is also key to success.
The above is by no means a comprehensive list of attributes but to me, they are the more important ones to consider.
Where is Observability heading?
Observability is a vital component in modern distributed architectures and I do not see this changing any time soon. Observability solutions will keep expanding in terms of environment and data source coverage as well as capabilities which help push the Mean Time To Resolve (MTTR) to as close to zero as possible. The use of Machine Learning technologies is becoming ever more prominent and I see this continuing to evolve and provide more efficient ways to predict, detect and resolve issues. There is also a lot of work going into scale, with a drive to accommodate more and more data with improved efficiencies and costs. I can also see solutions moving to incorporate more automation in the resolution process. Observability solutions simply have to grow and evolve as the world it is trying to gain insights from keeps changing.
Learn more about Observability by reading this blog post by Mark Billett, an Observability engineer at LSD.
If you would like to know more about Observability or a Managed Observability Platform, check out our page.