Fundamentals of cloud reliability

1. Fundamentals of cloud reliability

Within any IT team, developers are responsible for writing code for systems and applications, and operators are responsible for ensuring that those systems and applications operate reliably. Developers are expected to be agile and are often pushed to write and deploy code quickly. Their aim is to release new functions frequently, increase core business value with new features, and release fixes fast for an overall better user experience. In contrast, operators are expected to keep the system stable, and so they often prefer to work more slowly to ensure reliability and consistency. Traditionally, developers pushed their code to operators who often had little understanding of how the code would run in a production or live environment. When problems arise, it can be very difficult for either group to identify the source of the problem and resolve it quickly. Worse, accountability between the teams isn’t always clear. DevOps is a software development approach that emphasizes collaboration and communication between development and operations teams to enhance the efficiency, speed, and reliability of software delivery. It aims to break down silos between these teams and foster a culture of shared responsibility, automation, and continuous improvement. One particular concept within the DevOps framework is Site Reliability Engineering, or SRE, which ensures the reliability, availability, and efficiency of software systems and services deployed in the cloud. SRE combines aspects of software engineering and operations to design, build, and maintain scalable and reliable infrastructure. Monitoring is the foundation of product reliability. It reveals what needs urgent attention and shows trends in application usage patterns, which can yield better capacity planning and generally help improve an application client's experience and lessen their pain. There are “Four Golden Signals” that measure a system’s performance and reliability. They are latency, traffic, saturation, and errors. Latency measures how long it takes for a particular part of a system to return a result. Latency is important because it directly affects the user experience, changes could indicate emerging issues, its values might be tied to capacity demands, and it can be used to measure system improvements. Traffic measures how many requests reach your system. Traffic is important because it’s an indicator of current system demand, its historical trends are used for capacity planning, and it’s a core measure when calculating infrastructure spend. Saturation measures how close to capacity a system is. It’s important to note, though, that capacity is often a subjective measure that depends on the underlying service or application. Saturation is important because it's an indicator of how full the service is, it focuses on the most constrained resources, and it’s frequently tied to degrading performance as capacity is reached. And errors are events that measure system failures or other issues. Errors are often raised when a flaw, failure, or fault in a computer program or system causes it to produce incorrect or unexpected results, or behave in unintended ways. Errors are important because they can indicate something is failing, configuration or capacity issues, service level objective violations, or that it's time to send an alert. Three main concepts in site reliability engineering are service-level indicators (SLIs), service-level objectives (SLOs), and service-level agreements (SLAs). They are all types of targets set for a system’s Four Golden Signal metrics. Service level indicators are measurements that show how well a system or service is performing. They’re specific metrics like response time, error rate, or percentage uptime–which is the amount of time a system is available for use–that help us understand the system's behavior and performance. Service level objectives are the goals that we set for a system's performance based on SLIs. They define what level of reliability or performance that we want to achieve. For example, an SLO might state that the system should be available for 99.9% of the time in a month. Service level agreements are agreements between a cloud service provider and its customers. They outline the promises and guarantees regarding the quality of service. SLAs include the agreed-upon SLOs, performance metrics, uptime guarantees, and any penalties or remedies if the provider fails to meet those commitments. This might include refunds or credits when the service has an outage that’s longer than this agreement allows.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.