52343.jpg

The path to software delivery is laden with challenges and roadblocks. But once delivery to production is complete, another game starts.

It is a digital age with industry 4.0 revolution. Every business is a digital business. If their applications are down, then technically, their business is down.

If we go back in time, around 15-20 years back, Google was the pioneer in this area. During that time frame, Amazon matured rapidly, and that's how AWS as a new business was triggered. If Google capitalized in this area earlier, they could have been the market leader on the cloud platform.

Moving from the Cloud platform, lets come back to our topic of discussion - any hosting application has to factor security, availability, and scalability into their plans. Why have these factors recently become more significant? Site Reliability Engineering can address all of these factors. 

Why Site Reliability Engineering? 

Site reliability engineering helps in estimating, preventing, and managing uncertainty and risks of failure. Although it cannot completely eliminate all failures, what it really does is evaluate the inherent dependability of an application (or process), spot outliers, and recommend actions to mitigate the impact of those failures. 

Although delivering software applications is a complex endeavor, what’s more plaguing is ensuring they function in the production environment as intended. 

Incorporating a handful of features into software applications does not guarantee its success. It depends on the ability of the production ops teams to ensure the above factors of the application - as proposed. Even companies like Walmart that deal with physical goods are heavily dependent on software. As mentioned above, software applications are no longer just support systems for businesses. They are mission-critical and, hence, reliability is the area of focus. 

What does Site Reliability mean?

Site Reliability Engineering, to a large extent, augments the capabilities of DevOps. I consider this as one of the categories of DevOpsThe users of applications like Google, Amazon, and Netflix always expect security, availability, and scalability. If any parameter is compromised, it is a lost business opportunity.

Security and Privacy:

• Users are concerned about both Security and Privacy.

• Cloud (For example, AWS/Azure) brings certain good practices and frameworks. Private Data Centers managed by companies have their own challenges.

• The breaches can be at 4 levels. Data Center, System, Application, Data. It can impact availability in case of a breach. 

Availability:

• The entire value chain of the applications has to be up and running. 

• Proactive Implementation: Monitoring, along with logging, play critical roles in detecting the issues proactively. 

• Reactive Implementation: Issue management systems like Jira Service Desk shall be in place so that users can report the problems.

• Apart from the above, at the infrastructure level, Backup, Disaster Recovery, and Change Management processes (Blue-Green Deployments, Rollback) are very critical.

Scalability:

• For B2C applications, the difference between peaks is quite high. Low resources will cause performance issues, and high resources can waste a lot.

• Technologies like clusters (nodes), containers, and micro services are quite important along with scale up and scale down functionalities. This is where Cloud can be utilized at its best.

Practices, including tools, to manage these aspects is what Site Reliability Engineering is. Adopting Cloud technologies like AWS and Azure, will make this easy for any company.

Is SRE Applicable for every company? 

For any industry, and for any size, each company will fall into one of the below categories.

  1. Software Product companies hosting applications for customers
  2. IT Service providers that host applications for customers
  3. Any company hosting applications for internal users
  4. Any company that doesn't have host applications
    1. Service Providers like Marketing consulting agencies
    2. Software / Hardware product companies ( OEMs )

For examples 1 and 2, it will be very critical to impliment SRE with the highest maturity.

For example 3, SRE is definitely needed but not as critical as 1 and 2.

For example 4, SRE is not applicable. 

Important Metrics to track Site Reliability Engineering

When embracing Site Reliability Engineering, it is important to constantly monitor, track, and measure the application across various metrics, to evaluate its reliability. Some important metrics include up-time, mean time to and between failure, mean time to repair, rate of failure occurrence, probability of failure and many others. 

These metrics help teams determine the level of software quality as well as the volume and variety of potential failures – so they can take steps to overcome issues in the quickest possible time. 

What does site reliability engineering focus on?

Assessing the inherent reliability of a software application and suggesting appropriate actions to mitigate issues requires teams to embrace certain concepts or practices. 

Focus by the Operations Team: 

• Measuring the Metrics 

• Security Implementation

• APM Implementation

• ITIL or JSD Implementation

• Automation

• Cloud Migration

Focus by the Development or Engineering Teams: 

• Logging Framework

• Scalable Architecture

Delivering high-quality applications does not mean just high performance - it requires teams to also ensure applications are reliable. Engineering teams need to design and develop for reliability apart from the functional, technical, and regulatory requirements. You can read more on this topic in this great e-book on How Google manages SRE.


Please comment below and let us know your thoughts on Site Reliability Engineering and Software Reliability, as we plan to explore more on these topics. 

To know more get in touch with us.

Contact Us

You May Also Like

It is a well-accepted truth that DevOps helps organizations in faster application development, more frequent delivery of features, more productive teams, and improved communication and collaboration.  However, some CXOs fail to achieve the best of speed and reliability. This is mainly due to the fact that organizations make use of aspects of DevOps that best suit their needs, and not aspects that they should actually be focusing on. To break from these practices,…
Since the past two decades or so, the software development industry has seen unprecedented innovation in technology and its use across all functions of the application development lifecycle. Globalization, unprecedented levels of competition, the ever-evolving regulatory landscape, and the tech-savvy customers have brought in a sea of changes, and challenges. Because of this, the need for reliable, distributed, and quality applications has grown considerably.…
The DevOps trend has resulted in the inception of a number of methodologies that aim to accelerate software delivery and accuracy. It gives developers the tools to deliver high-quality applications, with fewer bugs, and implement changes easily and continuously. In the world of DevOps, the terms CI (Continuous Integration) and CD (Continuous Deployment) are used frequently to describe modern development practices.…