Service Level Ojbective

為什麼 SLO 這麼重要?

SLO 是為滿足 SLA 的目標設立, 將客戶對系統穩定度的期待轉換成目標. 之所以重要是因為 IT 團隊直接關注客戶所在乎的重點, 讓系統的穩定度保持在可接受的範圍內. 當然服務越可靠; 成本就越高.

References

  1. Measuring and evaluating Service Level Objectives (SLOs)

Summary

Service level objective (服務水準目標) 是系統監控的目標, 為了滿足 SLA 所承諾之協議. SLO 的組成是由 SLI、一段時間區間、目標(通常以百分比呈現), 公式如下:

$$ SLO = SLI + period\ of\ time + valid\ conditions $$

在建立 SLOs 的時候目標必須是明確且清晰的. 在文中出指出五點來測量與評估 SLOs, 我個人的經驗裡, 在執行上會有些困難點, 在下面一併列出.

上述幾點是循環且持續執行的優化項目, 不會是一次性任務.

Note

原文 :: Measuring and evaluating Service Level Objectives (SLOs)

In this context, SLAs (Service Level Agreement) are likely familiar. An SLA is a written agreement between the client and the service provider to ensure a healthy level of quality. If specified conditions aren’t met there are consequences, and they are often financial.

However, the real world isn’t this simple. Service owners are accountable to serve both outside and inside stakeholders. These stakeholders depend on the services to meet their business objectives. This is especially common in microservices architectures, where one service is dependent on another. As it doesn’t make sense to have written contracts for everything, service owners should be held responsible by defining clear objectives. There are no severe penalties if those objectives aren’t met. Yet, this doesn’t mean they are there for nothing. There are some consequences, or rather– corrective actions, needed to improve those services.

A simple equation to define SLA and SLO relationship is:

$$ SLA = SLO + written\ and\ signed\ consequences $$

Let’s focus on 5 key steps while measuring and evaluating SLOs.

Set the right objectives

Setting the right objectives is the first important step towards building proper SLOs. There are some important things to consider at this point:

Although these points are important and seem obvious, it is really hard to identify the right metrics. Talk openly with users and be clear on what is promised.

Collect monitoring data

Once important metrics have been identified, they need to be collected. This stage depends heavily on SLOs and what the service means to others. Different things may need to be monitored depending on the level of abstraction. Often what is needed is a monitoring tool like DataDog to collect and visualize the data. These tools allow for aggregation and alerting when the metric reaches the threshold defined.

Alert on collected metrics

Alerting is a critical and a complex job by itself. Filtering out low priority alerts and letting the team know about these are important for the health of on-call. But these are not the only places where an incident management solution such as Opsgenie helps. A proper incident management tool does “a lot” more than that. It centralizes all alerts from different monitoring tools in one dashboard and allows users to categorize important alerts for later analysis.

Create reports from alerts

Once all of the alerts are in one place it’s important to setup alert reporting, which makes it easy to see important data points in a structured view. To report on SLOs, Service and Infrastructure Health Reports are used at Opsgenie which include key indicators that can be used to evaluate metrics and share with customers as a team. Examples of these metrics are mean time to resolve and close incidents per service, Service health percentage (healthy/unhealthy state by outages and disruptions), severity of incidents that arise in a service and the alerts associated with all incidents (so that insight is gained into which monitoring systems reported the incident in which way) and how stakeholders were affected by the service disruptions – whether they were notified in a timely and proper way. The infrastructure health reports provide infrastructure-wide context by allowing stakeholders to see all alerts and incidents across an entire infrastructure in a single view.

Evaluate and share the reports

Reports mean nothing if left un-evaluated. As they are the written proof of performance on the service level indicators defined internally, and they help to see if SLOs were met or not. Evaluation should include every team member and stakeholder. This means transparency is crucial– be open about them and share the results with others. To dig a little bit deeper with analytics tools or create more sophisticated reports for stakeholders, export the reports for easy sharing.

SLOs don’t matter if the cycle isn’t repeated

Once the cycle is completed– from creating the objectives and finishing with evaluating– the job still isn’t done. It starts all over again. Reevaluate objectives and take corrective actions either by refining the indicators or making services more robust. Clearly examine error budgets to make sure that overachievement is avoided (yes, that is bad too). It is important to design objectives taking into account that tools and services will fail, because they will.