Definition
SRE is what happens when you ask a software engineer to design an ops team. We want systems that runs and repairs itself: automatic, not just automated.
Topics
- On-Call
- Monitoring
- Emergency Response
- Change Management
- Capacity Planning
- Risk Management
- Eliminating Toil
- Simplicity
- SRE teams should
- Push back when accidental complexity is introduced into the systems for which they are responsible.
- Constantly strive to eliminate complexity in systems they onboard and for which they assume operational responsibility.
- SRE teams should
- Hierarchy of Needs (from most basic to most advanced)
- Monitoring
- Incident Response
- Postmertem w/ RCA
- Testing and Release procedures
- Capacity Planning
- Development
- Product