Google 所定義的 SRE 角色

Evergreen Note

Question :: 這篇文章主要在說什麼?

Answer :: Google SRE 的定義是將運維(operations)視為軟體問題. 而運維之核心價值在於維持系統的穩定度, 依據不同企業會有不同的方式. 也就是説 SRE 在每間企業所做的事情都不盡相同.

Summary

此篇原文來源於 Google Site Reliability Engineering 其中提供了許多值得一看的 Resources.

其內容節錄至 Google Site Reliability Engineering 的網站首頁, 簡要介紹了什麼是 SRE. 其中的核心概念為 “當我們將運維[[operations-what-it]]視為軟體問題時就是 SRE”. 這句話我理解為將運維任務視為軟體工程的一部分來處理, 意指要制定運維任務的標準流程, 將其流程使用軟體或自動化來進行, 以確保系統的可靠性和穩定性.

在傳統的運維觀念中, 系統運維和軟體開發通常視為兩個獨立的領域, 這可能導致隔閡存在. 因此提倡 DevOps 的文化, 旨在促進開發團隊和運維團隊之間的合作與溝通. 然而, SRE 與 DevOps 在本質上有所不同, 市場上也容易將兩個混為一談.

Note

原文 :: What is Site Reliability Engineering (SRE)?

What is Site Reliability Engineering (SRE)?

SRE is what you get when you treat operations as if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and App Engine, to name just a few — with an ever-watchful eye on their availability, latency, performance, and capacity.

On top of that, in Google, we have a bunch of rules of engagement, and principles for how SRE teams interact with their environment – not only the production environment, but also the development teams, the testing teams, the users, and so on. Those rules and work practices help us to keep doing primarily engineering work and not operations work.

What we do as SRE

Our job is a combination not found elsewhere in the industry. Like traditional operations groups, we keep important, revenue-critical systems up and running despite hurricanes, bandwidth outages, and configuration errors.

How We SRE At Google

As SRE, we flip between the fine-grained detail of disk driver IO scheduling to the big picture of continental-level service capacity, across a range of systems and a user population measured in billions.