Definition
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
Balance with Engineering Work
Engineering work is novel and intrinsically requires human judgment. It produces a permanent improvement in your service, and is guided by a strategy. It is frequently creative and innovative, taking a design-driven approach to solving a problem—the more generalized, the better. It helps your team or the SRE organization handle a large service, or more services, with the same level of staffing.
Typical SRE activities involve
- Software engineering
- Involves writing or modifying code, in addition to any associated design and documentation work.
- Systems engineering
- Involves configuring production systems, modifying configurations, or documenting systems in a way that produces lasting improvements from a one-time effort.
- Examples include monitoring setup and updates, load balancing configuration, server configuration, tuning of OS parameters, and load balancer setup.
- Also includes consulting on architecture, design and productionization for developer teams.
- Toil
- Overhead: Administrative work not tied directly to running a service.
Toil tends to be spiky, so a steady 50% may not be realistic.