Zenduty Blog

Zenduty Blog

  • Home
  • Incident Management
  • Holistic Wellness
  • New on Zenduty
  • Sign Up

›Browse posts

Browse posts

  • Zenduty — SRE Puzzle of the week — Forgotten password
  • Incident Alert Routing — reducing noise and getting woken up only by alerts that matter
  • On-call doesn't have to be stressfull
  • The importance of GameDays
  • Site Reliability Engineering-Why you should adopt SRE
  • Relationships between Operation and Devlopment Teams
  • ChatOps-The future of collaboration
  • Post Mortems- Bringing clarity to incident reviews
  • The importance of Incident Roles
  • Fostering blamelessness at the workplace
  • The true cost of unreliability
  • Real-time incident management with HetrixTools
  • How to manage Incident Response Efficiently
  • Real-time incident management with Ghost Inspector
  • 10 Superfoods to Boost Productivity
  • Tackling the challenges of Incident management
  • Real-time incident management with AppOptics
  • Vacay Your Woes Away
  • Real-time incident management with Server Density
  • Nailing the Incident Management Process
  • Real-time incident management with Logzio
  • Responders make Incident Resolution Better
  • Real-time incident management with Humio
  • Meditate: Nip the stress in the bud.
  • Real-time incident management with Hosted Graphite
  • Real-time incident management with Scout
  • The Zen of SRE
  • Travel and Work: Wholesome or Troublesome?
  • Real-time incident management with Wavefront
  • A beginner's guide to Incident Management
  • Real-time incident management with Site24x7
  • Wellness- A Way of Life
  • Incident Management in an Agile Setup
  • My Hate-Love-Love Relationship With Sleep
  • Real-time incident management with LogDNA
  • The key to keeping your retail stores online
  • Real-time incident management with Bugsnag
  • Six tips to help you relax without leaving your desk!
  • Real-time incident management with Firebase Crashlytics
  • Real-time incident management with Atatus
  • Practicing Wellness for Increased Output at Work
  • Incident Management for Emergency Services!
  • Zenduty Now Alerts your Slack!
  • 4 Breathing Exercises to Help You at Work and On Call
  • Avoiding Burnout for SREs
  • Why on-call teams need ChatOps
  • Real-time incident management with Freshdesk
  • How You Can Boost Your Mental Health at Work
  • What ITOps Teams Can Learn From Sports Teams
  • Why Young Startups Should Invest in On-Call Right Now
  • Mental Health Myths Busted
  • Preparing for On-Call

Zenduty — SRE Puzzle of the week — Forgotten password

December 11, 2019

Vishwa Krishnakumar

Zenduty — SRE Puzzle of the week — Forgotten password

Let’s say you have a safe, and you forgot the password. You know it is a four digit-password, and you don’t want to type 10,000 4-digit numbers into the safe.

The safe has the property that if the password is “1234”, you could type “01234” “85624359571234” and it would open; it only cares about the last 4 numbers pushed.

Write a function(in the comments section below) to produce a sequence of numbers to enter into the keypad. Do share the joy of puzzle-solving with your friends.

Incident Alert Routing — reducing noise and getting woken up only by alerts that matter

December 10, 2019

Vishwa Krishnakumar

Incident Alert Routing — reducing noise and getting woken up only by alerts that matter

Site reliability engineers have one of, if not the, toughest roles in any organization. While dealing with incidents is one part of the job, the other is to build reliable systems. Google’s SRE book sums this approach nicely.

At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil as a second-order effect. We share this 50% goal because toil tends to expand if left unchecked and can quickly fill 100% of everyone’s time. The work of reducing toil and scaling up services is the “Engineering” in Site Reliability Engineering. Engineering work is what enables the SRE organization to scale up sublinearly with service size and to manage services more efficiently than either a pure Dev team or a pure Ops team.

One of the most important challenges for an SRE when it comes to balancing work between firefighting and toil reduction is the issue of alert noise. An alert, for a particular engineer, can be considered as noisy if the alert does not provide any information about an event that has already happened, or an event that is about to happen, that significantly affects the end-user or customer, and the alert is not actionable on the part of the said engineer. Alert noise, not only distracts your engineers from their engineering duties but also causes needless stress, increases anxiety and exacerbates mental and physical health issues that your engineers may already be facing. It will come as no surprise that organizations with the noisiest alerts also have the highest engineering churn rates in the industry.

Every Monitoring or IT management tool sends a lot of alerts daily. While some alerts are important and actionable, most of them are noise. Nobody likes to wake up to an alert at 2 AM in the morning unless it’s a major incident or an outright catastrophe. We took a close look at the incident management services and how they help teams reduce alert noise. We were surprised to see that all of them have implemented just bare-bones features to mitigate alert noise, and despite being in the market for over 7–10 years, have failed to see alert noise reduction as something that needed to be solved in a meaningful way.

How Zenduty can help you shrink a river of alerts to a trickle

When we started building Zenduty, solving the problem of alert noise was top on our list of priorities. And so we did, with Alert Rules. Following are the criterions on which you can route your alerts on Zenduty along with operations like <, ≤, >, ≥, ==, !=, contains(substring), not contains, regex match, is empty, is not empty, in(multi-select), not-in(multi-select), between and not between:

  1. Payload Search — Every alert from a monitoring integration sends its own custom payload along with the alert parameters. For example, an critical alert from Jenkins will also contain a payload like — {“job”: “build-prod”, “job-number”:5, “status”: “SUCCESS”}. You can run a key search on the alert payload and use the above operators on the value

  2. Incident Message and Incident Summary— Route if you find a keyword in the alert message or summary

  3. Alert Time of the day —Route depending on the time of the day

  4. Alert Day of the week — Route depending on the day of the week

  5. Alert Date — Route if the incident falls on or between specific dates

  6. Alert Type — Route depending on the type of the alert(critical, error, warning, acknowledged, resolved and info)

Following are the actions you can take if the alert routing conditions are met:

  1. Suppress — Suppress the incident

  2. Change Alert Type to —Let’s say the alert type of the incoming alert is critical(which will create an incident), and you do not want the alert to create an incident, you can change the alert type to “info”.

  3. Add Note — Automatically add a not to the resultant incident that will help the on-call engineers and incident commander to determine the RCA and triaging steps

  4. Route to Escalation Policy — Route to an escalation policy that is not the default on the service affected. This may be useful in cases where you might need a more aggressive or less aggressive escalation depending on the type of alert that has come in.

  5. Assign User — Assign the incident to a specific user. This is useful where a known component is affected and the fastest way to resolve the incident is by assigning it to a known subject matter expert

  6. Change Incident Urgency — change the urgency of the incident. If the alert creates a low priority incident and the your conditions detect a potentially high priority incident, you can conditionally change the priority to high and vice-versa.

  7. Change Alert Message — create your own custom Incident Title from the alert payload

  8. Change Alert Summary — create your own custom Incident Summary from the alert payload

  9. Change Alert Entity ID — change the alert entity_id to implement custom deduplication

  10. Assign Role to User — after the alert creates an incident, assign a particular role to a particular user

  11. Assign incident tag — automatically add a tag to the newly created incident depending on the matching condition. For example, if a backend component is affected, add a tag “backend”

  12. Add incident task and template(coming soon) — Automatically add an incident task template or a task to the incident

The Zenduty Alert Rules allow you to fine-tune your incident alerting at many levels and make sure that the right team members are alerted when specific conditions are met, add notes and tags, route to a different escalation policy than the service default policy, customize the incident title and summary, and most importantly suppress the incident.

If you’re looking for an end-to-end incident management platform with awesome alert routing and response management capabilities, do give Zenduty a spin and leave us your feedback in the comments below.

And finally, be zen!

On-call doesn't have to be stressfull

November 29, 2019

Amrit Balraj

“Being on-call is a critical duty that many operations and engineering teams must undertake to keep their services reliable and available. However, there are several pitfalls in the organization of on-call rotations and responsibilities that can lead to serious consequences for the services and the teams if not avoided. This chapter describes the primary tenets of the approach to on-call that Google’s Site Reliability Engineers (SREs) have developed over the years, and explains how that approach has led to reliable services and sustainable workload over time.” - Google SRE book

On-call schedules are a necessary evil while maintaining the reliability of your services. Being on-call is a foreboding experience, even when things are going well, it can end up making you feel drained and exhausted. The culture of DevOps fosters rapid deployment, SRE and continuous integration which creates a need for quick remediation of incidents when they occur. Several organizational initiatives help make the on-call experience more bearable like flexible working hours, remote-friendly offices, etc. There are several measures that you can take to make your experience better:

Optimize your environment Ensure that your home is a well organized on-call environment, with all of your devices at easily accessible places. Even if you are woken up from deep sleep, your brain shouldn’t have to waste time figuring out where everything is. This will vastly improve your response time when notifications are blowing up your sleep.

Clean up your notifications Scrubbing up your notifications is an effective tactic for on-call as notifications will be flying at you from one or multiple sources. Set up customized notification rules depending upon the severity of the incidents and SLAs.

Urgent notifications from family members can be filtered through based on rules you’ve set eliminating chaos during times of duress.

Personal Care On-call shifts can stress you out, even when everything is going according to plan. Prioritize your mental health, take mini-breaks whenever you can, stretch, grab a snack or get some fresh air. Night shifts can throw a spanner into your sleep schedule, however, regular sleep is vital. Your brain will incrementally slow down without proper sleep which might cause you to make more mistakes, ultimately affecting your response efficiency.

Distressed routines can also lead to stress eating which will ruin your physical well being in the long run. Make sure to get enough rest, sunlight and exercise to help you find your zen during on-call schedules.

Having your incident management procedure Additionally, with your organization’s incident management plan, have your plan during contingencies. An incident management plan is only as strong as the sum of all it’s parts.

Have a personal checklist or questionnaire to tick off when you go through the motions. Post-incident focus on the end-user experience and work backward, quickly identify what went wrong, time to detection and resolution for future post-mortem meetings. A big part of making the on-call experience better is building more resilient infrastructure through learning from our experiences.

Set up the right tools Using the right tools will go a long way in helping you keep your focus on the issues that need to be prioritized. Zenduty is a state of the art incident management platform which gives you a unified view of all your notifications from your infrastructure where every member of the team is aware of the incident no matter where they are.

The platform also provides incident analytical metrics to help you prepare for post-mortem meetings and leverage your incident management plans. With Zenduty you can improve your team’s MTTR while helping them focus on alert that matter.

Custom escalation routes can be defined pre-incident depending on work schedules along with pre-defined incident roles so that everyone is aware of who is working on what eliminating chaos.

Zenduty has support for 100+ services like Sentry, Freshdesk, AWS cloud watch, New Relic and Zabbix.

Ultimately it all comes down to your physical and mental well-being, prioritizing this will make on-call much easier to manage. A perfect system is not one that never fails, but one which provides a sustainable environment for response and restoration.

The importance of GameDays

November 18, 2019

Amrit Balraj

GameDays were first coined by Amazon’s “Master of Disaster” Jesse Robbins when he created them intending to increase reliability by purposefully creating major failures on pre-planned dates. Game Days help facilitate the values of chaos engineering. Chaos engineering is the disciplined practice of injecting failure into healthy systems. With modern IT services becoming increasingly sophisticated continuously changing systems, outages are inevitable. Chaos engineering ensures that IT teams learn from mistakes that can happen, within a controlled environment.

Netflix was one of the pioneers in the field when they first decided to move their services to the cloud in 2010. They developed [Chaos Monkey] (https://netflix.github.io/chaosmonkey/) to test their system stability through random disruption of services within their architecture. Buoyed by the success of their tool, they went on to develop additional tools known as the Simian Army for different aspects of chaos testing.

What are the benefits of GameDays?

IT ecosystems are dependant on cloud engineering with infrastructure based on microservices, which means that disruption of service can be a costly affair, especially for up and coming small companies. Game days help to find unique blindspots for teams, making them better prepared when something happens in real-time.

How to start running your game days:

Preparation : Ensure that every member of the team is aware of the date and time beforehand. All information must be shared on team calendars and clear communication channels have to be established. Game Days typically run between 2-4 hours primarily involving the team of engineers who work closely with the targeted systems. Install chaos engineering tools and best processes before beginning to ensure observability of everything that is breaking, along with methods of mitigation. Assign someone to document everything that happens to create a knowledge base for new and existing members of the workforce.

Selection of experiments : With everyone present, have a good look at your system architecture on the whiteboard ask yourself what could go wrong. There are plenty of tools that can be used to inject failure into your systems, you can even design your internal tools to have better control of everything. Some of the popular tests to run are latency tests, failure of data center, faulty code injection, etc. You have to contain your blast radius initially by testing the least dependent system, before scaling it up as you gain confidence in your systems and processes. Collect all relevant data from these experiments for the next stage.

Retrospective Meeting : Recap everything that happened and ask team members to come forward to explain what they learned from the experiments. Have in-depth discussions to understand what can go wrong, MTTR required architectural updates and how well your incident response systems kicked into action.

This is a typical workflow of what a GameDay would look like, different teams can implement their changes based on their requirements. GameDays help teams to stay on their toes while managing critical distributed systems. The goal is to run experiments in a way that doesn’t cause harm to the users and stakeholders.

Site Reliability Engineering-Why you should adopt SRE

November 11, 2019

Amrit Balraj

Site reliability engineering was a term coined by Google engineer Benjamin Treynor in 2003 when he was tasked with making sure that Google services were reliable, secure and functional. He and his team eventually wrote the book on SRE which is available online for free for anyone interested in research and implementation of SRE best practices.

Site reliability engineering is the application of facets of software development to operational issues, basically bridging the gap between development and operations teams. Treynor himself has spoken about SRE:

“ SRE is what happens when you ask a software engineer to design an operations team. When I joined Google in 2003 and was tasked with running a “Production Team” of seven engineers, my entire life up to that point had been software engineering. So I designed and managed the group the way I would want it to work if I worked as an SRE myself. That group has since matured to become Google’s present-day SRE team, which remains true to its origins as envisioned by a lifelong software engineer.”

The primary goal of SRE is creating scalable, available and highly sound software systems. This applies to not just companies with the scale of Google, small companies can also utilize SRE to strengthen their architecture. Site reliability engineers work with production, operations, and end-users to develop metrics and goals for service reliability. In other words, SREs help companies define SLIs and SLOs for uptime and availability.

At this point, you may be wondering what is the difference between SRE and DevOps?

DevOps and SRE share similar core philosophies of unifying development and operations teams, automating processes and saving time during deployment. However there are key differences between the two, DevOps is a culture that emphasizes on streamlining development and non-development environments, whereas SRE is more focussed on the sysadmin role and the production environment. DevOps embraces failure as a learning experience to make systems stronger while SRE prioritizes on balance between incidents and the release of new features.

Both DevOps and SRE help organizations deliver software more efficiently while bridging gaps between IT and non-IT departments. Both of these methodologies also advocates the use of automation tools to decrease manual toil increasing reliability further.

Why should you implement SRE?

SLA driven performance monitoring: Site reliability engineers help measure and analyze your performance against predefined SLA guidelines. The SRE team ensures an efficient incident response, monitoring, and performance of systems so there is no breach in agreements and subsequent loss in revenue. There are also allowances for error budgets, which is a clear standard of how unreliable a service can be in a single quarter. SRE helps companies find a balance between reliability and innovation.

Embrace risk: No service can be reliable 100% of the time, it’s an unrealistic and expensive goal. SRE culture understands that risk is a part of the game and the challenge is identifying what can go wrong and protecting systems from it.

Continuous delivery: Successful SREs practice something known as “chaos engineering”, this is a practice where faulty code is deliberately injected into healthy working systems in a controlled manner to test response times, MTTR and incident management. This is crucial in a continuous delivery environment where new features are pushed every few hours to safeguard against catastrophic failure.

Cross-team skills: As already mentioned, SRE leverages the skills of developers and sysadmins to build stronger more balanced systems. Collaboration between these two disciplines is integral to developing high-quality software. SREs mainly focus on improving reliability while engineers can focus on new features fostering innovation.

Common areas to focus on while adopting site reliability engineering:

Make sure that all teams understand and believe in SRE while introducing it to your teams. If an insufficient amount of members do not get on board during implementation, your organizations might end up getting ineffective results. Communication is key when it comes to adoption, make sure all teams are on the same page when you are making changes in operations to ensure cross-team implementation. Integration should be a seamless process, with tasks being as simple as possible. Develop customized checklists for different roles to ease the transition. “Blameless post mortems” are another important aspect of SRE, meetings where there are no accusations, but productive learning from mistakes and not repeat them.

Documentation is another key aspect, as memory can be a fickle thing. Make sure that post mortem meetings are documented and published internally to make sure that searching is an easy process.

These are basic outlines of the advantages of adopting SRE, which is a never-ending process. SRE teams are constantly learning, evolving and fine-tuning their skills to improve the reliability of modern systems. Once successfully implemented, review your team strength every six months to ensure that they are not making the same mistakes they initially made.

Relationships between Operation and Devlopment Teams

October 16, 2019

Amrit Balraj

Modern businesses are evolving rapidly with the advent of cloud, CI/CD and microservices. However, there still exists an extensive and obvious divide between principle business stakeholders and developmental teams. Development teams are often unaware of the challenges faced by operations teams and vice-versa.

This is where a need for adoption of DevOps principles comes into the picture. DevOps which came into existence as the natural successor to Agile practices in software development. For a versatile delivery process that is adaptive to continuous change and delivery, there is a need for the introduction of the agile concept of continuous feedback. Continuous feedback is a way where negative and positive feedback is accepted and implemented to production constantly.

For instance, daily standup meetings allow members of the development team to share status updates and highlight challenges in the way. Additionally, the sprint review meeting is the opportunity to deploy changes to a broader user group, including executives and end-users. In addition to evaluating goals for the project, the group can provide feedback on the current solution and the unforeseen needs before the next sprint meeting.

Feedback loops have to be a combination of agile best practices, automation tools, and efficient documentation. Testing is also an integral part of developing fast feedback loops. Implementing DevOps best practices will add to the overall worth of an organization, including the following:

  • Efficient code delivery
  • Quicker deployment to market
  • Higher-quality in output
  • Improved collaboration between developers and operations
  • A culture that unites business, development, and operations together for excellent responsiveness to market demands

To become necessary to customers, operations and developers need to be a collaborative unit. Development teams are required to design, deploy, test and troubleshoot their software swiftly and authentically while in parallel operations teams are compelled to monitor and identify issues before they happen. The adoption of DevOps culture is an organizational change that requires a shift in mindset primarily. When successful communication channels are established between managers, developers, and operations through DevOps practices, there will be lesser conflicts of interest within teams, and an increase in overall quality and speed of delivery for the customers.


Zenduty is a cutting edge incident management platform designed by developers keeping the well-being of engineers in mind. Sign up for free here.

ChatOps-The future of collaboration

October 7, 2019

Amrit Balraj

ChatOps is the implementation of chatbots to unify communication and collaboration. Through ChatOps every single member of a team will be aware of what the other members are working on. It is the logical next step in the evolution of communication among teams after email and IM.

Projects of today are developed at a global scale with millions of people as potential users, this means that teams are larger and often work in shifts or even remotely. ChatOps tools can also be integrated with other tools of monitoring, diagnosis, notifications, and data mining to make any team more efficient and focused.

Github opened to door to the adoption of this concept way back in 2011 when they introduced Hubot . Initially, it was developed as a way to automate their internal company chat room, today it is an open-source highly customizable chatbot with custom scripts and support for a host of plug-ins. Jesse Newland, principal product engineer at Github defined Hubot as:

“Putting tools directly in the middle of conversation”

The major advantages of ChatOps can be summarized as:

Automation: RepetiSStive manual tasks can become tedious and consequently prone to error. Automation of tasks frees up a lot of precious man-hours, which will increase the overall productivity of the team. This is especially valuable in incident monitoring and initial response. ChatOps ensures a seamless communication channel where everyone can be brought up to speed on what updates are happening and where they can help.

Visibility: With ChatOps every team member’s work is in full view. If your team is distributed with different shift timings and/or locations chatbots make the workflow transparent. This benefits CI as workers can focus their time on day to day without trying to find out who introduced what and when

There are plenty of powerful bots in the market which are customizable as and how your team functions as a whole. Some of the more popular ones are:

  • Hublot
  • Err- developed with python.
  • Lita- developed with ruby.

Teams are always evolving to get faster, more productive and dynamic, ChatOps gives them tools to leverage their time. Through the integration of ChatOps platforms, you can have a unified communication channel where everyone can see what the workflow is. This is especially crucial in incident management as everyone can see the changes, fixes, and updates across the board.

Post Mortems- Bringing clarity to incident reviews

October 3, 2019

Amrit Balraj

An incident post mortem is known by many names- incident review, root cause analysis (RCA), learning review, but what do they entail?. A post mortem is a post-incident activity to help organizations understand how the incident happened and to learn from it. Service incidents are an unavoidable hurdle for any company when they do happen, the teams working will be wholly focussed on restoring service as quickly as possible.

Most times they don’t have the time and find the root of the problem as they focus on fixing the problem at hand. Organizations for the most part always have post-mortem meetings to assess what happened once a service has been restored. An incident resolution time-frame only ends once the entire team participates in the post-mortem meetings.

The entire team needs to participate in post mortems as people who worked on the frontlines of incident management can share their insight from experience and increase the collective knowledge of the entire team. Post mortems tend to be structured differently depending upon the organization and there is no single playbook to help you run them efficiently. Teams need to view post mortems as a learning exercise to understand what they have to do to avoid such an issue in the future.

Here are some guidelines to help your company conduct streamlined post mortems:

Document Everything : Post mortems are not just meetings where people talk, it is a record of everything that was done during an incident. The documentation from these meetings will be reference summaries for future teams to help them resolve downtimes when they happen. If all the steps taken while the incident is happening are documented immediately after it happens, it will serve as a game plan for engineers down the line when they are also facing an incident.

Devise a system to ensure that this documentation is available to the entire team for transparency and preparedness. This will serve as effective for stakeholder communication as well.

Foster Blamelessness: Industry leaders like Google talk about this extensively in their Site Reliability Engineering handbook, the importance of blamelessness during post mortems. Post mortems are time-consuming processes and need to be freed from unnecessary assigning of blame on individuals who may or may not be at fault. This creates an environment of fear with people unwilling to come forward and explain where they went wrong. Studies have shown that this negatively impacts the growth of an organization in the long run.

Blamelessness encourages team members to talk about the actions they took, assumptions they made and establish timeframes. This also builds trust among team members where they can openly talk about their concerns without fearing repercussions. A trusting team contributes to the strength of the system as a whole.

Track Post mortems: As companies perfect their incident review game they build up a database of valuable documentation for future incident response professionals. This can benefit current teams as well, especially if the tracking system has been set up for quick access during times of crisis.

Over time they can be reviewed by executives to understand patterns where potential weakness can be exploited.

Every technology worker knows that downtimes are unavoidable, they come when least expected, leaving behind a trail of chaos. Post mortems help teams to view incidents calmly without the stress of an ongoing incident. There are dozens of publications and handbooks by industry leaders to help organizations with best practices in post-incident reviews. ../


Zenduty is a cutting edge incident management platform designed by developers keeping the well-being of engineers in mind. Sign up for free here.

The importance of Incident Roles

September 30, 2019

Amrit Balraj

Modern technology organizations are required to be adaptive in their approach to incident management. A single project will have multiple teams working as different branches on integrated systems. Even if all the members have unified communication channels when an interruption occurs in the service there’s bound to be chaos. The frontline response team will have to be on their toes to get to the root issues at the first signs of trouble.

According to IBM’s Cost of a data breach report 2019 the average cost of an outage globally was 3.92 M. The US holds the highest average cost at $ 8.19 million. Any disruption in service is a potential breach in SLA agreements and puts a company’s profitability at stake. The report further highlights the long term costs of an incident linger for up to 2 years after an incident.

An integral part of any good response plan is preset incident roles. Incident roles ensure that the responsibilities are made clear pre-event to eliminate noise and consequent chaos. Team leads who manage day to day operations of the service will be aware of which team member will be effective in front line response roles.

The guidelines for developing a framework for incident response are:

  • Encourage on call engineers to take initiative to delegate important tasks to team members based on their core strengths.
  • Establish roles pre incident and test out scenarios with different team members playing roles they are comfortable with.
  • Keep your stakeholders informed with regular status updates. They need to know if an issue is ongoing or has been resolved.

With incident roles predefined, your IRPs can be adaptive and reiterated as you face new challenges and learn from experience. With Zenduty you can automate your IRP by defining incident roles before an incident which will put an end to chaos when an outage occurs. You can also create task templates as per your choice and assign them to your incident roles making incident mitigation a seamless procedure.

In the long run these steps will make your IRP more transparent and open to evaluation for strengthening your system as a whole. You can safely make decisions to streamline your service delivery and reduce the effects of an incident or outage.


Zenduty is a cutting edge incident management platform designed by developers keeping the well-being of engineers in mind. Sign up for free here.

Fostering blamelessness at the workplace

September 20, 2019

Amrit Balraj

An integral lesson every business (of any size) learns is that failure is inevitable at some point in the production cycle. There might be times where things go haywire at critical junctures sending teams scrambling to rectify the root issue and reinstate service. The underlying causes are often many and varied especially in large scale systems with complex architecture and interdependence. The initial response of any supervisor or manager is to figure out which individual of their team messed up and assign blame. Consequently the hapless team member may be reprimanded or reassigned based on the severity of the issue.

Is this the best way to handle an incident with team members?

In reality this type of blame assigning is actually counterproductive and will contribute to creating a workplace culture of fear. John Lunney and Sue Lueder who are senior site reliability engineers (SREs) at Google wrote the following in their paper “Postmortem Culture: Learning from Failure”

“ Blameless postmortems are a tenet of SRE culture. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.”

Team individuals fearing retribution for past errors will hesitate to escalate issues which ultimately harm the company in the long run. Fear leads to stress, stress leads to poor choices, leading to further chaos down the road.

How do we ensure that our team is performing unrestrained by fear?

Fostering blameless post mortems in the workplace is the key to understanding problems when they arise, getting on top of it and avoiding bigger problems from arising. The other great advantage of this is that it builds trust within teams which ensures they get the support they need to build great projects.

There are some ground rules:

  • Words have inherent power, instead of asking who/why/when, ask how an incident happens. Write down results from discussions to increase visibility across the team. This will help establish workflows and worm out system weaknesses.
  • Build a safe space for people who made the mistake to step up and talk about the mistakes they’ve made. This will instruct the rest of the team on how to avoid making the same mistake in the future strengthening the system as a whole.
  • Make sure the entire team understands blamelessness in the workplace. Conduct meetings to help them grasp the concept and understand that failure is a unique opportunity to learn.

Fundamentally it all comes down to creating an environment of trust and identifying areas for improvement across the board. This will also promote transparency within the organization and exponentially increase incident handling. At Zenduty we believe in creating an environment where every employee is treated with respect and incidents are viewed as opportunities for making our teams stronger.


Zenduty is a cutting edge incident management platform designed by developers keeping the well-being of engineers in mind. Sign up for free here.

Next →
Copyright © 2019 Zenduty Created by YellowAnt