Operating 24/7 - Life as a DevOps Engineer - TriNimbus

Operating 24/7 – Life as a DevOps Engineer

A variety of factors must be considered when operating as a DevOps Engineer on a 24/7 basis. DevOps does not stop at just provisioning systems and programmatically deploying updates—consistent efforts focused on monitoring and alerting are essential to perform 24/7 in order to catch small issues before they potentially impact your systems.

Some of the key characteristics of a properly engineered environment are scalability and immutability. A system that can self heal and scale automatically will not only save copious amount of time down the road, it will also reduce operational costs. On the contrary, some systems that a DevOps Engineer will encounter may not be easily made self-healing or auto-scaling. In these cases, other measures can be applied to minimize interruptions such as by increasing data redundancy and distributing compute capacity. We will discuss examples of each below.

Reactive and Proactive Scaling

When do you suppose e-commerce websites receive most of their transactions? I can tell you it’s not during business hours when the DevOps Engineer is at their desk hard at work. Most online transactions happen closer to the evening when you plan to spend time with friends and family. So how do we design an online store to run 24/7 and handle varying system loads, while still balancing compute costs? This is what we call scalability and there are two main ways that we plan for it.

The first is through reactive monitoring of system metrics that clearly represent system load. Auto-scaling is implemented by setting upper and lower thresholds, where capacity is added or removed as needed when system metrics cross them. Typically for production workloads, the minimum capacity will be 3 instances hosted in separate data centers (referred to as Availability Zones at AWS) in order to remain highly available. This provides stability for situations where a data center experiences an outage, such as during a network, electrical, or cooling system failure.

A second way to scale is proactively, by using historical data (e.g. concurrent visitors or conversion rates) from previous busy seasons (e.g. Christmas holidays or during Cyber Monday) to help predict system requirements, and then manually setting or scheduling the minimum capacity to meet these requirements in advance.  This method can even be combined with auto-scaling to further guard against unexpected spikes.

It is the responsibility of the DevOps Engineer—and in their own best interest—that the systems remain resilient to unforeseen outages. By tuning your environment to monitor and detect system load and scale accordingly you will preserve the precious time that you spend with your friends and family without interruptions.

Detecting a Pattern in System Metrics

Effective monitoring of production systems does not end with creating and monitoring the metrics for system scaling. Take for instance a situation in which I was involved, where an alert on the database queue depth was spiking to over 1 second on a half hour basis.

The first thing that came to mind was that a scheduled job might be running periodically causing all queries to queue up, like a traffic jam on a highway. We began to review app server metrics for CPU spikes that coincided with the queue depth spikes on the database, and voilà, we found a server that matched the profile.

Upon examination of this server, we found a variety of Quartz jobs being run on this server. One job had been scheduled to run on the hour, the other was to run on the half hour; both jobs were in charge of cleaning up abandoned transactions. We began searching for relevant errors in our centralized logging service. We determined that both jobs were failing due to a foreign key constraint error in the query. As a result, the size of the shopping cart table was growing continuously, and over time would impair database performance.

A few database queries later, the foreign key constraint issue was resolved and the tables were cleaned up. Monitoring and alerting on system metrics played a huge role in catching the issue early, as well as providing the insight that led us to discovering the cause of the issue.

Troubleshooting: Lessons Learned

A major lesson learned when troubleshooting backend services is that you cannot simply rely on restarting a service, which is a classic IT misconception to fix every problem. With a stateless application, self-healing with automatic reprovisioning of the application can be very reliable, but this may not be as easy when dealing with a stateful application. If you have experienced a service disruption once, it is likely that you are going to see it again, therefore you must tackle the problem head on.

I have seen this happen before, when the search functionality of an application server all of a sudden stopped working and the easy fix was to restart the service. This would return the service back to regular operation, but only temporarily. The big clue that led me to determining a permanent fix was that the disruption occurred after each code deployment. It turned out that during development, a configuration file was entered into the code base that would alter the search configuration and ultimately cause the search service to stop working. A restart was a temporary fix because it would revert the configuration to its previous state. After discussions with the development team, it was determined that the file was configured incorrectly, which was then fixed for the next deployment.

Building the Ideal DevOps Environment

As a DevOps Engineer, your ideal environment should be as self-healing and auto-scaling as possible. A centralized logging service will complement this greatly by improving the retention of system logs, especially for when a system is terminated due to a self-healing or scale-down event. Also ensure that you are monitoring system metrics, another crucial part of a DevOps Engineer’s tool belt. Alerting based on anomalies in the metrics will assist you in detecting and resolving system disturbances. Adhering to these practices will save yourself countless hours when a system becomes impaired or reaches capacity limits.

In cases where you don’t have these features available and a system becomes impaired, rather than circumventing by simply restarting the service, it is advisable to solve the underlying problem head-on through the use of system logs, historical events and metric data. Being aware of these tips and potential challenges can save you a great deal of time in the long run.

DevOps is core to our practice at TriNimbus. As a leading DevOps solutions provider, we love working with customers to continually improve their products and operational processes by adopting the principles of DevOps, reaping the benefits of continuous delivery and reducing the cost associated with such a change. If you are looking for help with DevOps, we’d love to hear from you!