Google Cloud’s Site Reliability Engineering: Measuring and Managing Reliability course teaches the theory of Service Level Objectives (SLOs), a principled way of describing and measuring the desired reliability of service. Upon completion, learners should be able to apply these principles to develop the first SLOs for services they are familiar within their own organizations.
Learners will also learn how to use Service Level Indicators (SLIs) to quantify reliability and Error Budgets to drive business decisions around engineering for greater reliability. The learner will understand the components of a meaningful SLI and walk through the process of developing SLIs and SLOs for an example service.
What you will learn
- How to make systems reliable
- Understanding SLIs, SLOs and SLAs
- Quantifying risks to and consequences of SLOs
Who can attend?
- Technical Solutions Engineers
- Technical Leads
- IT Managers
- System Administrators
- Systems Analysts
Syllabus – What you will learn from this course
- Introduction to SRE: This module is intended to bring you up to speed on the concepts underpinning SRE, CRE, and SLOs. If you’re already familiar with these concepts, you may still find new information and perspectives in this module, but it is not necessary to complete it.
- Targeting Reliability: In this module we’re going to talk about how you measure the desired reliability of a service. We will address what to consider when setting SLOs for your application within your organization. We’ll look at the three principles we use to measure the desired reliability of a service.
- Operating for Reliability: In this module, we’ll start by introducing a mechanism for quantifying unreliability using something called an error budget. We’ll show how error budgets help you decide when to focus on making a service more reliable.
- Choosing a Good SLI: In this module we will start off by taking a look at some characteristics of monitoring metrics that can make them useful as SLIs and contrast these against other metrics that are less useful.
- Developing SLOs and SLIs: Introduce the fictional company that created our example mobile game, the infrastructure that we’ll be working with, and the simple user journey we’ll be applying the four step process to.
- Quantifying Risks to SLOs: In this module we’ll be taking a critical look at the availability risks for our example service. We want to answer the question: “are our SLO targets and error budgets realistic?”
- Consequences of SLO Misses: In this module, we’ll cover best practices for documenting your SLOs, the rationale behind a formal error budget policy and how best to create one and finally, we’ll look at an example error budget policy in order to understand the trade-offs and incentives that play out during negotiations when trying to write an error budget policy.
To enroll for this course, click the link below.
Note: NoticeBard is associated with Coursera through an affiliate programme.