Building an application in a cloud native way simply means that you will derive the full benefit that the cloud provides, even if your application is running in your on-premise data centre. In the first part of this two-part blog post, I will expand on some of the benefits that your application, infrastructure, business and users will experience by architecting cloud natively.
Keep in mind that the following benefits will all only be present if we make the following assumptions:
Every component in the environment is generated through Infrastructure-as-code and/or config-as-code;
Applications are built to be immutable;
Applications are designed to run within a pay-as-you-use cloud-pricing environment and so embrace elasticity (scale on demand).
Your application can run wherever you want it to
Cloud native applications are made up of microservices that each run in containers, which can simply be moved between container orchestraters like Kubernetes, Red Hat OpenShift or VMware Tanzu. Even then, the orchestrators can run on top of any public cloud platform, multiple cloud platforms in a hybrid or poly cloud configuration, or in your data centre. It all depends on your applications’ needs and your own preferences.
You will no longer need to troubleshoot differences between environments like developer laptops and testing when the application misbehaves. The container will have the same experience in any environment because the configuration and infrastructure are defined as code at the time of deployment.
Pull existing services directly from cloud vendors instead of creating your own
When your application needs a specific component to function, you can leverage existing vendor-built services that are available to your cloud environment. Let’s say, for example, your application requires a storage layer. Instead of spinning up and configuring one of your own, you’re simply able to call a storage layer service from the cloud provider’s library. That means less work on your side and your application will be using services that are vendor-approved and supported.
Scaling up your application doesn’t necessarily mean more hardware costs
Traditionally, applications hosted in a data centre would require the addition of more physical hardware and rack space to facilitate resource scaling. When using a hybrid model with cloud native architecture, scaling out the application can be done with cloud computing resources, at a much lower cost and lead time than provisioning physical hardware. Dynamic scaling also means that when you don’t use the additional resources, you don’t pay for them.
Fault tolerance and self-healing
Cloud native applications are built to be fault-tolerant, meaning it assumes the infrastructure is unreliable. This is done to minimise the impact of failures so that the entire application doesn’t fall over when one component is unresponsive. You’re also able to run more than one instance of an application at a time so that it remains available to customers while tolerating outages.
Containers (and therefore containerised applications) also have the ability to self-heal. If a component becomes unresponsive or suffers an issue, a new one can just be started and the old one stopped, without the need for someone to fix the problem first before bringing the application up again.
To summarise, your containerised applications or services can run wherever you need them to, with homogenised environments between development, testing and production. They can heal themselves and are ready for failures so that less time is spent fixing and restoring your application experience. You’re also able to make full use of the cloud for both scaling resources and making use of pre-built cloud services to complete your application.
In part 2, I will look at more benefits of architecting your application in a cloud native way.
Andrew 'Mac' McIver
At LSD Information Technology, I'm helping clients embrace OPEN including Cloud-Native, Containerization, Architecture, DevSecOps and Automation.
I'm an Open Source evangelist who loves helping people get the best out of their cloud experience.
There have been massive shifts in how services are consumed by customers over the last few years which is necessitating a change to event-driven architectures in order to effectively deliver rich customer experiences and facilitate back-end operations. Where it was once the norm to queue at a bank, customers now expect a fully digital banking experience, either online or through an app. They want instant notifications when transactions occur or if there are potential fraudulent activities happening on their account. They also want to apply for new banking products through an app and expect immediate approval or some interaction to confirm that something indeed happened to their request.
Another such example is grocery shopping where there is a definite shift in consumer behaviour with many people wanting to order their groceries online. They want delivery within an hour, live tracking of their order once it is placed, live tracking of the driver once they are on their way and they want to be notified when their groceries are arriving.
Do you notice something similar in these two examples? Both the rich front-end experiences and backend operations which support this are driven by events happening in real-time.
In this post, I will discuss event streaming, the paradigm many businesses are turning to make this all possible. I will also detail why this approach has become so immensely popular and is quickly replacing or supplementing the more traditional approaches in delivering on real-time back-end operations and rich front-end customer experiences.
What is Event Streaming?
To understand this, I will start by looking at what is meant by an event. This is pretty straightforward as a concept. An event is simply something that has happened. A change in the operation of some piece of equipment or device is an event. The change in the location of a driver reported by a GPS device is an event. Someone clicking on something on a web app or interactions on a mobile device are events. Changes in business processes, such as an invoice becoming past due is an event. Sales, trades, and shipments are all events. You get the idea.
Having the ability for all lines of business to harness, reason on top of and act on all events occurring in a business in real-time, can solve a multitude of problems, as discussed in our introduction, and is exactly what event streaming is designed to do.
A note before I continue, Apache Kafka is recognised as the de facto standard for event streaming platforms and many of my assertions in this post are based on this technology.
Event streaming uses the publish/subscribe approach to enable asynchronous communications between systems. This approach effectively decouples applications that send events from those that receive them.
In Apache Kafka we talk about producers and consumers, where your producers are apps or other data sources which produce events to what we refer to as topics. Consumers sit on the other end of the platform and subscribe to these topics from where they consume the events.
The decoupling of producers from consumers is vital in mitigating delays associated with synchronous communications and catering for multiple consumers or clients as opposed to point-to-point messaging.
The topics we mentioned above are broken up into partitions. The events themselves are written to these partitions. Simply put partitions are append-only logs which are persisted. This persistence means the same events can be consumed by multiple applications, each having its own offsets. It also means apps can replay events if required.
There are also a number of other aspects that characterise an event streaming solution, such as the scalability, elasticity, reliability, durability, stream processing, flexibility and openness afforded to developers.
This is a very high-level depiction of what characterises an event streaming platform but when you bring these aspects together, you have the building blocks for a fully event-driven architecture.
Some of the concepts I have discussed above may be familiar to some and there certainly are many technologies that fall into the integration or middleware space. The next section will detail what some of the key requirements are for an event streaming platform along with what differentiates event streaming from other similar technologies.
Event Streaming key requirements and differentiators
There are 4 key requirements for enabling event-driven architectures. I have detailed these below along with the main differentiators to traditional tools used in the middleware/integration space.
The system needs to be built for real-time events. This is in contrast to a data warehouse for example which is good at historical analysis and reporting. When considering databases, the general assumption is that the data is at rest, with slow batch processes which are run daily. Databases also only typically store the current state and not all events which occurred to reach that state. When considering ETL tools, they are traditionally good for batch processing but not real-time events.
The system needs to be scalable to cater for all event data. This goes beyond just transactional data, which a database, for example, has been historically used for. As we have described above, events stretch across the whole business and can include data from IoT devices, logs, security events, customer interactions, etc. The volumes of data from these sources are significantly more than transactional data. Message Queues (MQs) have traditionally been used as a form of asynchronous service-to-service communication but struggle to scale to efficiently handle large volumes of events and remain performant.
The system needs to be persistent and durable. Data loss is not acceptable for mission-critical applications. Furthermore, mission-critical applications typically have a requirement for events to be replayed meaning persisting events. MQs are transient in nature and messages are not persisted, meaning once the event is consumed, it is dropped. This has some obvious downsides if the intention is for events to be replayed or for multiple domains in the business to subscribe to the same events.
The system needs to provide flexibility and openness to development teams and have the capability to enrich events in real-time. Although not relevant for all use cases, having the capability to enrich data in flight has a multitude of benefits and can alleviate the efforts of multiple consumers having to build enrichment logic into their applications. The important aspect is the flexibility event streaming architectures provide where the system can either provide real-time enrichment pipelines or simply dumb pipelines where events can be passed through, enriched or directly consumed by the application. There have historically been many systems used in the application integration or middleware space, including ETL tools for batch pipelines, ESBs to enable SOAs and many more which fulfil very similar functions to each other. I am not going to fully unpack all the pros and cons across the technologies, suffice it to say that businesses are moving away from “black box” type systems such as ESBs where there are massive dependencies on critical skills which often cause bottlenecks in productivity. In contrast, event streaming provides businesses with flexible and open architectures which decouple applications, alleviate these dependencies and allow any kind of application or system to be integrated regardless of the technologies used.
When considering the 4 key elements above, Event Streaming is the only technology that ticks all the boxes and is quickly being entrenched as a critical component in the data infrastructure of businesses.
What is the end goal?
To circle back to the opening of my post, the expectations on businesses are high to continuously improve on the experiences delivered to customers as well as the back-end operations which support this.
The more focused a business is on delivering real-time event-driven architectures which are scalable, persistent, durable, open, flexible and remove critical dependencies, the more likely they are to succeed and rise above the competition.
There is a very distinct maturity curve when it comes to the adoption of event streaming. Becoming a fully event-driven organisation means that event streaming has been deployed across all lines of business and serves as the central nervous system of all business events. Getting to this point is a journey that often requires a shift in how things have traditionally been done.
This is luckily a journey that you do not need to go on alone. Please feel free to contact us and see how we can help and also stay tuned for my next post where I will go a bit more in-depth on use cases and benefits of adopting event streaming.
Doug Moll
Doug Moll is a solution architect at LSD, focusing on Observability and Event-Streaming for cloud native platforms.
On Friday 2 December 2022, LSD returned to community-focused events by hosting Tech This Out: Ansible at The Tryst in Woodlands, Johannesburg. Here is a wrap-up of what the presentation was about and what attendees got up to at the event.
LSD brought some friends, pizza and beers together to learn about cool technology and had a lot of fun. The Summer weather played along and it was a beautiful day to learn more about automation. The turnout was excellent, with new and familiar faces showing up. Before we dig into the event itself, let’s first take a look at what Tech This Out is and how it came to be.
What is Tech This Out?
LSD wanted to create a community-focused event series that worked similarly to Meetups in concept. Before and during the COVID-19 pandemic, we had the Tech and Tie-dye Community Meetups, which were primarily delivered in an online meeting/webinar-like setting, but they didn’t really capture the spirit of togetherness. We then decided to think of something else, a fresh start with regard to events now that the pandemic is over and we can all see each other in person again. This is where Tech This Out comes in.
The idea is to have a speaker (or two, depending on how the events grow) showing off interesting technology. Usually, our focus is on cloud native technologies like Kubernetes, Elastic, etc but we wanted to broaden the spectrum a little to include other cool technologies like home automation or robotics eventually. The event also needed to feel casual, so that the attendees can take their minds off of work pressures for a while to take in new technology and meet other people that share the same interests.
Another key aspect of keeping to a community focus was to completely remove the possibility of sales pitches from these events so that attendees could have a better overall experience. We get pitched to every waking second of every day, nobody needs more of that. Yes, the events do sometimes have enterprise technology sponsors, but much like us at LSD, they also just want people to see how cool the technology is, and many of the technologies presented have community or free editions that attendees can use.
The plan is to eventually scale these events out to include Cape Town with either independent events running in parallel, or using a hybrid model to bring the events together. There will be more information on this soon.
The presentation
Nuno Martins, an Ansible Tech Priest (expert) at Red Hat, presented feedback from AnsibleFest 2022, going over some of the highlights from the event in detail. Apart from all the flashy new features in Ansible, there was a big focus on how Ansible enables event-driven automation with popular technologies like Apache Kafka, with more OEM plugins in development.
He also shared some information on Project Wisdom, which Nuno played a part in creating. It is jointly run by Red Hat and IBM, where an AI model is being trained to infuse Ansible with new capabilities so that it’s easier for anyone to write Ansible Playbooks with AI-generated recommendations. You can learn more about project wisdom in the video below.
Afterwards, there was a great networking session with an open bar and pizzas where attendees got to know each other better. It’s clear that there is a thriving community of technology fans that want to learn together and relax, without being pitched to. Because of that, more Tech This Out events are on the cards for 2023, starting in late January or early February. The first event was very much a learning experience on how to run in-person community events in a post-Covid world. Who knows? Maybe Tech This Out shows up in Cape Town at some point next year. Keep your eyes peeled!
LSD would like to thank Red Hat for sponsoring the first Tech This Out, Nuno Martins for presenting and sharing insight, the team from The Tryst for making sure the venue was ready and everyone that attended the event.
If you would like to be kept in the loop about future Tech This Out events, please fill in the form and you will get added to the community email notifications.
Event Gallery:
Charl Barkhuizen, Marketing Plug-in
I'm the marketing plug-in and resident golden retriever at LSD Open. You can find me making a lot of noise about how cool Cloud Native is or catch me at a Tech & Tie-dye Meetup event!
Kubernetes and Cloud Native are two of the hottest topics in IT at the moment. From the CIO to the development teams, people are looking at Kubernetes and application modernisation to accelerate business innovation, and reduce time to market for new products and features. But how do these get managed? Do you need to do it in-house utilising your own talent, or would the better option be to find a managed service provider to do it for you?
Cloud Native is a much broader topic than just Kubernetes, and both are extremely complex new technologies. Let’s look at Kubernetes first as the foundation and work our way to cloud native.
So many options to choose from
We’ve already covered what Kubernetes is and where comes from in an earlier blog post. Companies are modernising their applications to run in a cloud native way to provide digital services to their customers wherever they are (I won’t go into much detail of the modernisation process, but you can read up on containers and microservices in an article published here by my colleague Andrew McIver). These modernised applications are refactored to run in containers, which are typically orchestrated with Kubernetes and can feature infrastructure both on-premise and in the cloud. Kubernetes is a popular choice as the orchestrator because of what it enables, yes – but it also has to do with the fact that it is an open source project with many different distributions that cater to different preferences of the organisations that use it.
We have seen vendors pick up Kubernetes solutions with Red Hat Openshift, VMware Tanzu and SUSE Rancher, being some of the top distributions adopted globally. We also see public cloud vendors come up with their own flavours of Kubernetes with EKS from AWS, AKS from Azure and GKE or Anthos from Google. Then there are the open source projects which are also in use, like OpenShift Community Edition (OKD), Rancher and Kubernetes itself, on which all of these editions are based.
Once your organisation has decided that containers and Kubernetes are right for you, how do you get started and what is the best for the business? Do you build and manage Kubernetes yourself, or do you partner up with a service provider to manage that for you? I think we can all agree that there is no one right answer to suit everyone, and not only each company, but each business unit in a large organisation will have different requirements. In general, we can look at the pros and cons, and hopefully help you make your decision.
I speak to many companies every day about this topic, and I can completely understand why this is an important consideration. Organisations want to make sure they empower their employees, reduce costs and reduce the reliance on outsourcing.
Start with the Business Goals
Firstly, we need to understand the business strategy, objectives and goals. We want to answer a few questions, which will really help us determine the best course of action:
What is the urgency? Do we need this platform done immediately, or can we take 12 to 18 months (unfortunately, this is typically the time it takes if you have a high-performing Linux team)
What kind of skills do we have internally? And what is their availability like with everything else on their plates?
Are you looking into public cloud? Do you have a timeframe in mind for including public cloud in our infrastructure? And are you going with a Hybrid- or Multi-Cloud configuration?
Do we have good automation, DevSecOps, Infrastructure-as-Code patterns and principles in our organisation?
Once you have answers to these questions, you can start to make sense of what the best options are. I won’t go into detail for each of them, but the important considerations are around urgency and skill.
Urgency and skill
If you are looking to move fast, getting a skilled Kubernetes-focused company to assist in the deployment and management of the platform makes a lot of sense. It removes the burden and worry from the organisation and gives your team the time to learn essential cloud native skills.
There is a major skills shortage at the moment, and it is very difficult to find people with not only skills but experience managing Kubernetes. Building it usually can be done with vendor documents, but managing and supporting Kubernetes is complex, and one of the reasons we get contacted the most from companies that aren’t LSD customers yet.
Doing it yourself
Let’s look at the Benefits of the DIY approach:
Your team will grow and learn a new technology
You do not need to rely on outside skills to ensure business continuity
It might be more cost-effective to use existing skills.
Internal growth for your employees
As for drawbacks of the DIY approach, let’s consider the following:
It can be a slow process to upskill people
As mentioned above, there is a big gap in skills and finding people is difficult, and keeping them is even harder.
When something goes wrong, your team is responsible for fixing it, and might not be able to just yet.
Using a Managed Service
If we look at managed services that are done correctly, they should give your business the following:
A platform deployed faster and to the vendor’s standards.
Bring all the skills needed for the job. Not just a certification, but actual experience having built these platforms. Too many companies get paid to learn by their clients.
Have 24/7 support, because even if you have some skill, expecting them to work around the clock is not sustainable, and those people will leave.
It also takes away the key person dependency that so many companies are plagued by.
If you take the cost of time, certifications, losing talent, rehiring and more, it actually works out more economically to go the managed service route.
Free up your internal resources to focus on what they do best, especially developers.
Platforms deployed by partners/consulting companies must be done to open standards from the vendors and must be done in such a way that you can take it over once your team has the skill.
The drawbacks of using a managed service are:
Your business is relying on external skills
The perceived threat of replacement of your team can be difficult to navigate and alleviate
One of the things I feel very strongly about is that we do not want to replace people. We want to grow and empower people. The goal with a managed service should never be to replace staff, it should be to give the business the best chance of success and to get up and running fast while giving the staff the time to learn and grow. It essentially becomes another tool in their toolbox.
I want to also add that services like EKS, AKS and GKE from the hyperscalers still require a lot of management and support. The management they provide is not enough, and I include a managed service to manage those nodes too.
Hopefully, you now have a better understanding of how a managed service weighs up with doing it yourself. There are benefits and drawbacks to both methods, and how effective a method performs depends on each unique scenario.
Deon Stroebel
Head of Solutions for LSD which allows me to build great modernisation strategies with my clients, ensuring we deliver solutions to meet their needs and accelerating them to a modern application future.
With 7 years experience in Kubernetes and Cloud Native, I understand the business impact, limitations and stumbling blocks faced by organisations.
I love all things tech especially around cloud native and Kubernetes and really enjoy helping customers realise business benefits and outcomes through great technology.
I spent a year living in Portugal where I really started to understand clear communication and how to present clearly on difficult topics. This has helped me to articulate complex problems to clients and allow them to see why I am able to help them move their organisations forward through innovation and transformation.
My industry knowledge is focused around Cloud Native and Kubernetes with vendors such as VMWare Tanzu , Red Hat Openshift, Rancher , AWS, Elastic and Kafka.
Now that we have explored what observability is and what makes up a good observability solution, we can dive a bit deeper into the benefits. This is again not an exhaustive list of benefits but I consider these to be the most impactful to businesses. Although some of these have been touched on in my previous posts, in this post I will consolidate these and add the missing pieces.
More performance, less downtime
Leaders in the observability space can detect and resolve issues considerably faster than businesses that are still relatively immature in this space. This includes issues relating to application performance or downtime.
Poorly performing applications or applications experiencing downtime have a direct impact on costs for any business. These can be in the form of tangible costs such as a direct loss in revenue or intangible costs such as brand and reputational damage.
Consider an eCommerce store which cannot transact due to a broken payment service, a social application that can no longer serve ads, a real-time trading application with super high latency, or a logistics application with a broken tracking service. There are literally thousands of examples across industries where the costs associated with downtime or poorly performing applications are very tangible.
When a banking application goes down, almost everyone knows about it the minute it happens. Twitter lights up, it appears on everyone’s news feeds and it even lands up on radio and television news broadcasts. Apart from the direct costs, the reputational damage caused by the downtime of an application can also be very costly, leading to increased customer churn, loss of new customers; as well as a host of other outcomes which impact the bottom line.
Measuring the true costs of downtime or poor-performing applications can be a difficult task, but the costs typically far outweigh the costs of making sure observability is done right; where issues are detected early and fixed before they can have a significant impact.
Higher productivity, better customer experience
A properly implemented observability solution provides businesses with massively improved insights across the entirety of the business. These insights improve efficiencies and workflows in detecting and resolving issues across the application landscape. This landscape is distributed in today’s modern architectures and extends to the infrastructure, networks and platforms on which the applications run, both on-prem as well as cloud environments. These insights and efficiencies ultimately provide multiple benefits across business operations.
One of the more tangible benefits is that if your developers and DevOps engineers are not stuck diagnosing problems all day, they can spend their time developing and deploying applications. This means accelerated development cycles that ultimately lead to getting applications to the market quicker as well as leading to better and more innovative applications.
With businesses being ever more defined by the digital experiences they provide to their customers, observability is one of the edges required to become leaders in the industry. The deeper insights also help to align the different functions of the business. Having visibility on all aspects of the system, from higher level SLAs to all the frontend and backend processes, enables operations and development teams to optimise processes across the landscape. These insights even enable businesses to introduce new sources of income.
Observability is also vital in providing businesses with confidence in their cross-functional processes and assurance that the applications that are brought to market are robust. This confidence is even more important in today’s complex distributed systems which stretch across on-prem and cloud environments.
Happy people, better talent retention
One of the often overlooked benefits of observability is talent retention. With highly skilled developers and DevOps engineers being a bit of a scarcity, it stands to reason that businesses would want to do what they can to retain their best talent.
The frustration of sitting in endless war rooms and spending the majority of the day putting out fires is a surefire way to ensure highly skilled talent will look for opportunities to work elsewhere, to be able to do what they enjoy.
Efficient observability practices and workflows drastically reduce the amount of time developers and engineers spend dealing with issues, making them happier and ultimately helping to retain them.
Fewer monitoring tools, look at all those benefits
One of the themes from my previous posts is that using multiple monitoring tools instead of a centralised observability solution creates inefficiencies and has a severe impact on a business’s ability to detect and resolve issues. From this post, it should be apparent that the insights gained – by a centralised observability solution across the landscape – have a number of other benefits too.
Although this post is dealing with the generic benefits of observability without necessarily comparing it to other approaches, I feel addressing a few drawbacks from the multiple tool approach will also highlight additional benefits of the central platform approach to observability. Below are some of these drawbacks:
Licensing multiple monitoring tools introduces unnecessary costs as well as complexity in administering multiple different licensing models.
Having multiple tools also introduces complexity across your environment with multiple different agents and tools to be managed and operationally maintained.
The diverse and often rare skills required to operate multiple different tools either introduce a burden on existing operations teams or cause reliance on multiple different external parties to implement, manage and maintain tools.
Data governance is vital in any tool or system that stores data. Monitoring tools are no different and often contain sensitive data. Governance for a single observability solution is far simpler to achieve and less costly than multiple tools.
Storing data also has a cost burden which is often far higher when you have multiple tools, each with its own storage requirements.
The main thing to highlight is that the above drawbacks are really secondary to the most important benefit of centralised observability over the multiple monitoring tools approach. That is detecting and resolving issues in the most efficient and quickest way possible. This is best achieved with seamless correlation between your logs, metrics and APM data in a centralised platform.
Realising your benefits
To be a leader in the observability space is a journey. As I mentioned in previous posts, observability is not simply achieved by deploying a tool. It starts with architecture and design to ensure the solution adheres to best practices and can scale and grow as the business needs it to. It then extends to ingesting all the right data, formatted and stored in a way that can facilitate efficient correlations and workflows. Then all the other backend and frontend pieces need to fall in place, such as retention management, alerting, security, machine learning, etc.
LSD has been deploying observability solutions for our customers for many years and we help accelerate their journey through our battle-tested solutions and experience in deploying and implementing these solutions. Please follow this link to learn more.
Doug Moll
Doug Moll is a solution architect at LSD, focusing on Observability and Event-Streaming for cloud native platforms.
In Part 1, I introduced Observability and what we are looking to achieve with its implementation. In this post, I will discuss the most critical aspects of a good Observability solution that together will help businesses reach the Observability goals I described in Part 1.
What makes up a good Observability solution?
Scale
So far, I have discussed that the solution needs support for logs, metrics, logs and other data types. This is very important, however equally important is the ability to process this data and deliver results and insights in real time. The scalability of the platform is therefore also of crucial importance.
Integration
Data is often collected by means of an agent or multiple agents. A good solution typically provides many out-of-the-box integrations to data sources but also the ability to ingest custom sources. The ability to manage a fleet of agents centrally can also simplify operational burdens significantly.
Correlation
I have mentioned the ability to effectively correlate between data sources as being important. There are different ways of achieving this however I find the most effective to be a common schema between sources.
Visualisation and navigation
A key component of an observability solution is a way to effectively visualise and provide navigational tools so that users can intuitively navigate and analyse data to determine the root cause of problems. This is the user interface that facilitates correlations and is often driven through dashboards which either allows the analyst to easily correlate between sources or visualise data from different sources in the same dashboards. Having the ability to create custom dashboards also adds a lot of value to a solution. Different audiences have different requirements on what they want to view and customisation allows dashboards to be built with a view of what is directly important to the audience.
Alerting
An effective and intuitive alerting framework is the next feature which adds a lot of value to a solution. In its simplest form, alerting rules are configured based on static thresholds where an email is sent when a threshold is exceeded. Alerting can however be the bane of many operational teams, with alert fatigue being as big a problem as new conditions arise for which no rules have been defined. Furthermore, alerts should trigger a managed response and not just an email. This is often achieved through integrations with existing incident management systems. To properly unpack alerting and how to solve problems such as alert fatigue, is beyond the scope of this post however a good solution would cater for this and provide methods for integration with existing incident management systems.
Machine Learning
Machine Learning is also playing an increasingly vital role in Observability. So much so that a good solution should incorporate some measure of it. This also goes a long way in helping to solve some common alerting problems as detailed above. Anomaly detection, based on machine learning, will learn what normal conditions are in time series data, over time. For example, it will learn that month-end cycles include many more transactions than what is typically expected and will therefore know an influx of certain events over that period is normal, preventing alerts from being generated. Static rules are not able to achieve this. Similarly, because it is monitoring for anomalies, it does not need a static rule in place for anomaly detection to, for example, detect rare or anomalous events in log files and alert on them.
Distributed architectures
I previously mentioned the critical importance of observability in new distributed architectures and services. Moving workloads to the cloud have further complicated architectures and introduced many more moving parts deployed across hybrid environments, both on-premise and in the cloud or even multi-clouds. There are a number of factors to consider in this regard which would determine what the right tool is. Firstly, the capability to analyse data from all environments and correlate across this data, regardless of the environment is important. This may entail a single platform hosted either in the cloud or on-premise, or multiple platforms hosted in different environments with the capabilities to seamlessly link the clusters in a way that makes navigating data across environments transparent to the users. There are multiple factors to consider when architecting an observability solution. Deep diving into these is not in the scope of this article but it is vital that the deployment models offered by a solution optimally meet all requirements for the business. This is typically achieved by solutions that are flexible and offer different options in terms of deployment. These solutions should be able to meet requirements today and in the future.
Solution management and support
When considering deployment models, it is also important to evaluate the operational effort to manage your observability solution. In other words, is it properly orchestrated according to modern architectural standards? How easy is it to scale or upgrade? How easy is it to support? If any of these factors fall outside of the capabilities of the team supporting and maintaining the Observability solution, consider a managed service approach and let Observability experts take care of it.
Security
Security is critical with modern standards requiring encryption, role-based access controls (RBAC), authentication with the identity provider of choice such as AD, etc. Additionally, there are sometimes requirements to secure data to a field or attribute level. A good solution will cater for enterprise-level security standards.
Resilience
How different Observability solutions achieve high availability (HA) varies. Furthermore, the deployment methods supported by the solution would also impact the level of HA achievable. For example, deploying in Kubernetes can take advantage of all the self-healing capabilities inherently available in such a platform. The targeted environment for the deployment also has an impact. Deploying to the cloud would for example allow distribution across availability zones or even regions. Then there are also considerations to be made on Disaster Recovery environments, should there be a requirement, and exactly how those may be supported by the solution. Without unpacking all the factors involved in making your decision, careful consideration should be taken on this. A good solution will offer you the flexibility to decide on the level of HA required, depending on your deployment destination and method of choice.
Costing model
It is not my intention to deep dive into costing models of differing solutions. Focusing on what makes a good solution, it must however be factored in that the solution will not be effective if it is not able to ingest all the data required due to cost. Costing models should be carefully evaluated based on current and future states. Many businesses find themselves in a position where costs are manageable in the beginning but then quickly spiral out of control as soon as features are added or the solution starts scaling.
Skills and Knowledge
Finally, the implementation of the Observability platform decided on is vital. The term ‘Observability’ is just a label printed on the tin of an Observability tool. Installing an Observability product does not mean that a business now has observability. It needs to be deployed and implemented in the right way to be effective. Finding experienced practitioners who understand the ins and outs of how to do this is also key to success.
The above is by no means a comprehensive list of attributes but to me, they are the more important ones to consider.
Where is Observability heading?
Observability is a vital component in modern distributed architectures and I do not see this changing any time soon. Observability solutions will keep expanding in terms of environment and data source coverage as well as capabilities which help push the Mean Time To Resolve (MTTR) to as close to zero as possible. The use of Machine Learning technologies is becoming ever more prominent and I see this continuing to evolve and provide more efficient ways to predict, detect and resolve issues. There is also a lot of work going into scale, with a drive to accommodate more and more data with improved efficiencies and costs. I can also see solutions moving to incorporate more automation in the resolution process. Observability solutions simply have to grow and evolve as the world it is trying to gain insights from keeps changing.
In part one of this two-part series of posts, I’ll be discussing my views on the fundamentals and key elements of Observability, as opposed to a technical deep dive. There are many great resources out there which already take a closer look at the key concepts. First off, let’s look at what Observability is.
What is Observability?
The CNCF defines Observability as “the capability to continuously generate and discover actionable insights based on signals from the system under observation”.
Essentially the goal of Observability is to detect and fix problems as fast as possible. In the world of monolithic apps and older architectures, monitoring was often enough to accomplish this goal, but with the world moving to distributed architectures and microservices, it is not always obvious why a problem has occurred by merely monitoring an isolated metric which has spiked.
This is where observability becomes a necessity. With observability basically being a measure of how well the internal state of a system can be understood based on its signals, it stands to reason that all the right data is needed! In a distributed system the right data is typically regarded to be logs, metrics and application traces, often referred to as the “three pillars of observability”.
While these are the generally agreed upon key indicators, it is important in my view to also look at including user experience data, uptime data, as well as synthetic data to provide an end-to-end observable system.
The analyst’s ability to then gain the relevant insights from this data to detect and fix root cause events in the quickest and most efficient way possible is the measure of how effectively observability has been implemented for the system.
There are a number of aspects which can determine the success of your observability efforts, some of which bear more weight than others. There are also tons of observability tools and solutions to choose from. What is fairly typical amongst customers that LSD engages with is that they have numerous tools in their stable but have not achieved their goals in terms of observability, and therefore haven’t achieved the desired state.
Let’s explore this a bit more by looking at what the desired state may look like.
What is the desired state?
This is best explained by looking at an example: A particular service has a spike in latency which is likely picked up through an alert. How does an analyst go from there to determine the root cause of the latency spike?
Firstly the analyst may want to trace the transaction causing the latency spike. For this, they would analyse the full distributed trace of the high latency events. Having identified the transaction, the analyst still does not know the root cause. Some clues may lie in the metrics of the host or container it ran in, so that may be the next course of action. The root cause is mostly determined in the logs, so ultimately the analyst would want to analyse the logs for the specific transaction in question.
The above scenario is fairly simple however achieving this in the most efficient way, relies on the ability to optimally correlate between logs, metrics and traces.
Proper correlation means being able to jump directly from a transaction in a trace to the logs for that specific transaction, or being able to jump directly to the metrics of the container it ran in. To me, the most effective way to achieve this is for all the logs, metrics and traces, to exist in the same observability platform and to share the same schema.
In the digital age, customers want a flawless experience when interacting with businesses. Let’s look at a bank for example. There is no room for error when a service is directly interacting with a customer’s finances. So when an online banking service goes down for three days (it happens), it will lose customers or at least suffer reputational damage.
The ultimate goal is to detect and fix root cause events as quickly and efficiently as possible, and in this, the approach of using multiple tools fails.
In part two of this series, I will discuss the most critical factors which contribute to a good Observability solution that will help businesses reach the goals set out above.
Sounding like the next enterprise technology buzzwords, the terms ‘containerisation’ and ‘Kubernetes’ feature in many of today’s business meetings about technology platforms. Although containerisation has been around for a while, Kubernetes itself has only been on the scene for just over half a decade. Before looking at those concepts, we first need to look at the history of cloud native to understand how it got to where we are today.
The past
In the past when computing became critical to businesses, they started out with mainframe servers with the idea to have a huge, expensive server with almost 100% uptime. Their computing power was incredible and over the past 70+ years evolved to have even more. In the 80s and 90s, a move to commodity hardware took place instead of bulky proprietary mainframe servers.
Commodity servers are based on the principle that lots of relatively cheap, standardised servers running in parallel were used to perform the same tasks as mainframes. If one experienced failure, it was quick and easy to replace it. The model worked by having redundancy at scale, rather than the mainframe which was a small number of servers with high redundancy built in. This meant that data centres started growing exponentially to make space for all the servers.
The other problem that appeared with commodity hardware is that it lacked the features needed to isolate individual applications. This meant that multiple applications would run on a single server or would be required to have one server for each application. This meant that another solution was required, which is where enterprise x86 server virtualisation comes in. The ability to run multiple virtual servers on a single physical server changed the game yet again, with companies like VMWare dominating the market and growing at an unimaginable scale. Virtualisation used the principle that a virtual server with an operating system is attached to a network, with its own storage block. It would have anti-virus, the libraries for the application, the code, and everything else required installed on it, just like any other regular computer. It would behave like a regular server with its own network interface(s) and drives. It would have its own operating system as well as application libraries, application code and any other supporting software.
The problem soon became clear: the number of resources, disk space, memory and licenses used to run multiple instances of the same operating system across all of these virtual computers together with all the installed components were adding up and were essentially wasted.
This is where containers shine
These problems are what containers are designed to solve. Instead of providing isolation between workloads by virtualizing an entire server; the application code and supporting libraries are isolated from other workloads via containerization.
Let’s create two viewpoints to understand containers even further: a development angle and an infrastructure & operations angle.
Infrastructure & Operations
From an infrastructure & operations perspective, IT teams had their hands full managing all of these virtual servers, which all had a host of components to maintain. Each had an operating system with anti-virus, needed to be patched and upgraded, and security tools that needed to be managed (usually agents installed to monitor the server and the application). Beyond that, there would also be the application’s individual components that were needed to make it function. Usually, a developer would finish their code and send it to IT. IT would copy it to the development server and start testing to see if it was fit for the production environment. Is the development server the same as the developer’s laptop? Chances are that it probably is not, which meant that the code would need fixing for the differences between the two environments. Next, it moves to User Acceptance Testing (UAT), System Integration Testing (SIT) or Quality Assurance (QA) (all of which also had to be identical to the servers before them). Finally, it would reach the production environment which again also needed to be identical to the previous environments to function correctly. The IT team spent so much time having to fix, debug and keep track of each server, and making sure there is a repeatable process to do it all over again. Containers solved this problem by enabling code to run the same regardless of the environment – on a developer’s laptop, a server or in the cloud. The team needed only to ensure that the container could reach its network and storage locations, making it far easier to manage compared to the previous problem.
Development
Looking at it from a development perspective, the idea with containers and cloud native is that an application needed to be broken down and refactored into microservices. This means that instead of one single, monolithic code base, each part of the application will operate on its own, and it looks for the data it needs to output its programmed function. An example of this is Uber: its application consists of hundreds of services, all with their own function. One service maps out the customer’s area, the next service looks for drivers, the next compares drivers for the best fit for the trip, one for route mapping, one for calculating costing for the route, etc. Running each service in its own virtual machine would be near impossible, but for each of these services to run on its own container is a completely different story. The container runs on a container orchestration engine (like Kubernetes) which handles the load, network traffic, ingress, egress, storage and more on the server platform.
That doesn’t mean monolithic applications can’t run in a container. In fact, many companies follow that route as it still provides some of the cloud native benefits, without having to undertake the time-consuming task of completely rewriting an application.
The next point to consider is how all of these containers are going to be managed. The progression started with a handful of mainframes, then a dozen servers, to hundreds of virtual machines, to thousands of containers.
Enter Kubernetes
As discussed in a previous blogpost, Kubernetes comes from a project in Google called
“Borg”, which was renamed and open-sourced as “Kubernetes” in 2014. There have been many other container orchestration engines like Cattle, Mesos and more, but in the end, Kubernetes won the race and it is now the standard for managing your containerised applications.
Kubernetes (also known as K8s) is an open-source system for automating the deployment, scaling, and management of containerised applications. The whole idea is that Kubernetes can run on a laptop, on a server or in the cloud, which means that it can deploy a service or application in a container and move it to the next environment without experiencing a problem. Kubernetes has some amazing features that make it a powerful orchestration tool, including:
Horizontal scaling – containers are created to keep up with demand in seconds so that a website or application will never go down because of user traffic. And additional compute nodes can be added (or removed) as needed according to workload demand.
Automated rollouts and rollbacks – New features of an application can be tested on let’s say. 20% of the users, slowly rolling out to more as it proves safe. If it experiences a problem, it can simply be rolled back on the container to the previous version that worked, with minimal interruption.
Self-healing is the functionality that enables the restarting of failing containers, replacing or rescheduling containers when nodes die or killing containers that aren’t responding. It all happens automatically.
Multi-architecture: Kubernetes can manage x86 and arm-based clusters
Kubernetes is what allows companies like Netflix, Booking.com and Uber to handle customer scale in the millions, and give them the ability to release new versions and code daily.
What about serverless? ‘Serverless’ does not literally mean “without a server”, and it is still based on Kubernetes. Simply put, it means that a cloud provider makes resources available on-demand when needed, instead of them being allocated permanently. It is an event-driven model and will be discussed in more detail in a later blog post.
In future posts, application modernisation will be discussed and why it is so important to use real-world examples from actual teams. This will show how businesses that adopt containers, DevOps and cloud-native are moving ahead of their competitors at an exponential rate.
Deon Stroebel
Head of Solutions for LSD which allows me to build great modernisation strategies with my clients, ensuring we deliver solutions to meet their needs and accelerating them to a modern application future.
With 7 years experience in Kubernetes and Cloud Native, I understand the business impact, limitations and stumbling blocks faced by organisations.
I love all things tech especially around cloud native and Kubernetes and really enjoy helping customers realise business benefits and outcomes through great technology.
I spent a year living in Portugal where I really started to understand clear communication and how to present clearly on difficult topics. This has helped me to articulate complex problems to clients and allow them to see why I am able to help them move their organisations forward through innovation and transformation.
My industry knowledge is focused around Cloud Native and Kubernetes with vendors such as VMWare Tanzu , Red Hat Openshift, Rancher , AWS, Elastic and Kafka.
Let’s start with a definition. According to the Cloud Native Computing Foundation (CNCF), ‘cloud native’ can be defined as “technologies that empower organisations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds”. Essentially it is a technology that is purposefully built to make full use of the advantages of the cloud in terms of scalability and reliability, and is “resilient, manageable and observable”.
Even though the word “cloud” features heavily in the explanation, it doesn’t mean that it has to operate exclusively in the cloud. Cloud native applications can also run in your own data centre or server room and simply refers to how the application is built (to make use of the cloud’s advantages) and doesn’t pre-determine where it should run (the cloud).
How did it start?
While virtualization and microservices have been around for decades, they didn’t really become popular until 2015, when businesses were pouncing on Docker for virtualization because of its ability to easily run computing workloads in the cloud. Google open-sourced their container orchestration tool Kubernetes around that same time and it soon became the tool of choice for everyone using microservices. Fast forward to today and there are various different flavours of Kubernetes available both as community and enterprise options.
How does it work?
As this piece has explained, Cloud Native means you have the ability to run and scale an application in a modern dynamic environment. Looking at most applications today, this is just not possible as they are monolithic in nature, which means the entire application comes from a single code base. All its features are bundled into one app and one set of code. Applications need to know what server they are on, where their database is, where it sends their outputs to and which sources it expects inputs from. So taking an application like that from a data centre, and placing it in the cloud doesn’t really work as expected. Applications can be made to work on this model, but it’s not pretty, costs a lot of money and it won’t have the full benefit of the cloud.
This is not true for all monolithic applications, but the ideal situation is to move toward microservices. A microservice means that each important component of the application has its own code base. Take Netflix, for example, one service handles profiles, the next handles a user account, the next handles billing, the next lists television shows and movies etc. The end result is thousands of these services, which all communicate with each other through an API (Application Programming Interface). Each service has a required input and produces an output, so if the accounts service needs to run a payment, it would send the user code and the amount to the payment service. The payment service receives the request and checks the banking details with the user data service, then processes the payment and sends the successful completion or failed completion status back to the accounts service. It means that they have a smaller team dedicated to a single service, ensuring it functions properly.
Now moving a set of services to the cloud is fairly simple, as they usually have no state (so they can be killed and restated at will) and they don’t have storage so it doesn’t matter where they start.
Where is it going?
The latest cloud native survey by the Cloud Native Computing Foundation (CNCF) suggests that 96% of organisations are either evaluating, experimenting or have implemented Kubernetes. Over 5.6 million developers worldwide are using Kubernetes, which represents 31% of current backend developers. The survey also suggests that cloud native computing will continue to grow, with enterprises even adopting less mature cloud native projects to solve complicated problems.
In our future posts, application modernisation will be discussed in more detail and used to explain how businesses are really growing and thriving with this new paradigm.
Deon Stroebel
Head of Solutions for LSD which allows me to build great modernisation strategies with my clients, ensuring we deliver solutions to meet their needs and accelerating them to a modern application future.
With 7 years experience in Kubernetes and Cloud Native, I understand the business impact, limitations and stumbling blocks faced by organisations.
I love all things tech especially around cloud native and Kubernetes and really enjoy helping customers realise business benefits and outcomes through great technology.
I spent a year living in Portugal where I really started to understand clear communication and how to present clearly on difficult topics. This has helped me to articulate complex problems to clients and allow them to see why I am able to help them move their organisations forward through innovation and transformation.
My industry knowledge is focused around Cloud Native and Kubernetes with vendors such as VMWare Tanzu , Red Hat Openshift, Rancher , AWS, Elastic and Kafka.
LSD today announced that it is augmenting its Managed Kubernetes Platform with SUSE Rancher solution offering by achieving SUSE Platinum partner and Managed Service Provider (MSP) status. Currently, LSD is the only partner to achieve Platinum status in Sub-Saharan Africa. These achievements are a key part of LSD’s strategy of delivering certified expert services to customers that now features SUSE Rancher, SUSE NeuVector and Harvester. The partnership level means that LSD can offer certified managed services to their customers, including enterprise-grade container security through SUSE’s recent acquisition of NeuVector.
”LSD is a long-standing partner with SUSE. They have proven their commitment by their high level of technical certification held” says Ton Musters, SVP Channel & Cloud EMEA, APJ, GC for SUSE.
“LSD has been working with Rancher for many years and manages retailers, banks and a Telco’s primary estate. Their management of other Kubernetes clusters is also fantastic, especially the hyperscale versions such as AWS EKS. LSD has incorporated SUSE Rancher, NeuVector and Harvester into our main offering as the value it brings to our clients is fantastic and our technology team loves working on it” says Deon Stroebel, Head of Solutions for LSD Open.
LSD
LSD was founded in 2001 and wants to inspire the world by embracing OPEN philosophy and technology, empowering people to be their authentic best selves, all while having fun. LSD is your cloud native digital acceleration partner that provides a fully managed and engineered cloud native accelerator, leveraging a foundation of containerization, Kubernetes and open- source technologies. LSD is a silver member of the Cloud Native Computing Foundation (CNCF) and also a Kubernetes Certified Services Provider (KCSP).
Charl Barkhuizen, Marketing Plug-in
I'm the marketing plug-in and resident golden retriever at LSD Open. You can find me making a lot of noise about how cool Cloud Native is or catch me at a Tech & Tie-dye Meetup event!