An integral lesson every business (of any size) learns is that failure is inevitable at some point in the production cycle. There might be times where things go haywire at critical junctures sending teams scrambling to rectify the root issue and reinstate service. The underlying causes are often many and varied especially in large scale systems with complex architecture and interdependence. The initial response of any supervisor or manager is to figure out which individual of their team messed up and assign blame. Consequently the hapless team member may be reprimanded or reassigned based on the severity of the issue.
Is this the best way to handle an incident with team members?
In reality this type of blame assigning is actually counterproductive and will contribute to creating a workplace culture of fear. John Lunney and Sue Lueder who are senior site reliability engineers (SREs) at Google wrote the following in their paper “Postmortem Culture: Learning from Failure”
“ Blameless postmortems are a tenet of SRE culture. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.”
Team individuals fearing retribution for past errors will hesitate to escalate issues which ultimately harm the company in the long run. Fear leads to stress, stress leads to poor choices, leading to further chaos down the road.
How do we ensure that our team is performing unrestrained by fear?
Fostering blameless post mortems in the workplace is the key to understanding problems when they arise, getting on top of it and avoiding bigger problems from arising. The other great advantage of this is that it builds trust within teams which ensures they get the support they need to build great projects.
There are some ground rules:
- Words have inherent power, instead of asking who/why/when, ask how an incident happens. Write down results from discussions to increase visibility across the team. This will help establish workflows and worm out system weaknesses.
- Build a safe space for people who made the mistake to step up and talk about the mistakes they’ve made. This will instruct the rest of the team on how to avoid making the same mistake in the future strengthening the system as a whole.
- Make sure the entire team understands blamelessness in the workplace. Conduct meetings to help them grasp the concept and understand that failure is a unique opportunity to learn.
Fundamentally it all comes down to creating an environment of trust and identifying areas for improvement across the board. This will also promote transparency within the organization and exponentially increase incident handling. At Zenduty we believe in creating an environment where every employee is treated with respect and incidents are viewed as opportunities for making our teams stronger.