Zenduty Blog
Product updates and our thoughts and ideas on incident management and SRE best-practices
Category incident-management
Sprint planning: How to prioritize urgent production issues?
It is with a great deal of anticipation and excitement that I’m announcing our $1.9M round of funding, led by StartupXSeed Ventures along with participation from marquee enterprise SaaS investors...
In the past year of covid related working from home, we are increasingly seeing more burnouts in engineering teams worldwide. More and more devs are partially checked out and may...
Recommended reads on Resilience engineering and SRE
Prometheus is by far, one of the most popular open-source monitoring tools used by millions of engineering teams globally with a robust community and continued adoption and evolution.
As companies today are racing to build site reliability engineering(SRE) practices within their engineering teams, site reliability engineering has become one of the hottest and highest paying jobs in tech....
Providing customers with a world-class and seamless user experience is critical for the success of any business. It is therefore important that you have a robust on-call strategy that optimizes...
One of the primary things you need to figure out whenever your team is formulating your incident management process is describing in words what a Sev0(your highest incident priority) looks...
Nagios is one of the most widely used open-source network monitoring software used by thousands of NOC teams globally to monitor the health of a vast array of their hosts...
One of the first things incident managers do when they get an alert page from Zenduty is to check the “Context” tab of the incident. Incident context is extremely critical...
Recently, one of our customers, a 20-member NOC team of a large B2C company, had set up Zabbix to monitor a network of over 1000+ servers, routers, and switches. The...
Zendesk is one of the most popular ticketing support and customer service platforms available in the market. Two metrics that measure the effectiveness of your customer support are the response...
Google Cloud Platform (GCP) is a collection of Google’s computing resources, made available via services to the general public as a public cloud offering.
Microsoft Azure is a cloud computing service providing infrastructure as a service (IaaS), software as a service (SaaS) and platform as a service (PaaS) supporting multiple Microsoft Specific and third-party...
Grafana is one of the most popular open-source visualization tools that can be used on top of a variety of different data stores but is most commonly used together with...
As businesses close more deals and add more accounts, it is still imperative for businesses to maintain their SLA levels and resolve customer support tickets within SLA timeframes. Having a...
Task management best practices for remote/work-from-home teams
Incident management for remote/WFH teams
Teams is Microsoft’s versatile chat and collaboration solution for enterprise communication. Teams come bundled with Office365, offering chat, file sharing, and a host of other collaborative features. The platform also...
An incident response plan is a pre-devised action stratagem for IT teams on how to respond to critical IT events efficiently. As modern applications continue to grow in scale and...
DevOps is an organizational philosophy that enables continuous delivery and continuous deployment with a focus on continuous testing, automation and collaboration among dev teams, business, and operations teams. Consequently, continuous...
Grafana is an open-source platform for data visualization, monitoring, and analysis. It’s designed around providing context-rich visualizations, mainly though graphs but also supports other ways to present data through pluggable...
As we head into 2020, it’s clear that DevOps has finally crossed the divide and gone mainstream. With DevOps firmly ingrained as a standard practice, we now look at how...
Incident Alert Routing — Getting woken up only by alerts that matter to you
Building a world-class service is as much about maintaining software as it is about developing it. On-call engineers are typically responsible for ensuring the reliability and availability of your service...
We are super excited today to introduce our latest Zenduty integration with Slack, which we are calling the Zenduty Slack Incident Command System(Slack-ICS). This was many months in the making...
Incident Alert Routing — Reducing Alert Noise
GameDays were first coined by Amazon’s “Master of Disaster” Jesse Robbins when he created them intending to increase reliability by purposefully creating major failures on pre-planned dates. Game Days help...
An incident post mortem is known by many names- incident review, root cause analysis (RCA), learning review, but what do they entail?. A post mortem is a post-incident activity to...
Modern technology organizations are required to be adaptive in their approach to incident management. A single project will have multiple teams working as different branches on integrated systems. Even if...
An integral lesson every business (of any size) learns is that failure is inevitable at some point in the production cycle. There might be times where things go haywire at...
Every organization is different in the way it functions as a whole, with different approaches to operations management, functionality and legal structure. However every company whether big or small face...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. HetrixTools is a blacklist check and monitoring software that monitors if your...
Organizations try to be proactive about incidents and down-time. An outage or downtime can be devastating to the bottom line of businesses, not to mention a poor experience for their...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Ghost Inspector helps you build or record automated website tests in your...
An incident is an event or an occurrence that could lead to the disruption of services and operations of an organisation or could lead to them making losses. Incident management...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. AppOptics helps you monitor applications, infrastructure, and servers in one platform.
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Server Density is a hosted server monitoring service that provides server and...
Incident management is a process where a team of engineers working in a firm who identify, analyse and correct such incidents to prevent recurrences. Incidents can make or break organisations....
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Logzio’s services enable one to manage logs and get log analysis services....
Incident-respondents are like superheroes. They get distress-calls at all times of the day, and they try their best to resolve the problem before the fire spreads. Like any superhero, they...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Humio is a log management software that provides instant monitoring, analysis and...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Hosted Graphite is a graphite monitoring, alerting and Grafana dashboard platform for...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Scout’s ATS Integration and Machine Learning pairs the right search firms with...
Our guide to making on-call and managing incidents Zen. Inspired by Tim Peters’ The Zen of Python
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. With Wavefront’s services cloud monitoring and analytics will reduce downtime and boosts...
An incident is an event or an occurrence that could lead to the disruption of services and operations of an organisation or could lead to them making losses. Management of...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Site24x7 offers comprehensive monitoring for critical network devices such as routers, switches...
Agile systems and adoption of DevOps has posed new challenges for ITSM teams. Agile development, with it’s focus on continuous deployment, often poses a challenge to the ITSM teams. Incident...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. LogDNA is an advanced dashboard that allows you to instantly centralize, monitor,...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Bugsnag monitors application stability, so you can make data-driven decisions on whether...
Whether you’re a small, medium or large company with a mobile strategy, it is critical that you monitor our app’s performance constantly. Uptime and reliability can make or break your...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. When these incidents come attached with problems, changes, releases or assets for...
According to NENA, an estimated 240 million calls are made to 9–1–1 in the U.S. each year. These calls are handled by approximately 6000 public safety answering points (PSAP) located...
Looking for an inexpensive way to keep yourself and you SRE team updated on all alerts and collaborate with them faster? Look no further than Zenduty Slack Alerts!
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. When these incidents come attached with problems, changes, releases or assets for...
Teams responsible for incident management have so much in common with professional sports teams. Members of both have to be reactive, responsive, strategic and agile! Since they are so similar,...
This article talks about why you should spend the next 1 hour setting up alerting and on-call management systems for your startups.
So you are starting a new job as an SRE, and expect to be on call anytime now. This is your first time in the role, and you want to...
Category best-practices
Sprint planning: How to prioritize urgent production issues?
Recommended reads on Resilience engineering and SRE
Prometheus is by far, one of the most popular open-source monitoring tools used by millions of engineering teams globally with a robust community and continued adoption and evolution.
As companies today are racing to build site reliability engineering(SRE) practices within their engineering teams, site reliability engineering has become one of the hottest and highest paying jobs in tech....
One of the primary things you need to figure out whenever your team is formulating your incident management process is describing in words what a Sev0(your highest incident priority) looks...
Nagios is one of the most widely used open-source network monitoring software used by thousands of NOC teams globally to monitor the health of a vast array of their hosts...
One of the first things incident managers do when they get an alert page from Zenduty is to check the “Context” tab of the incident. Incident context is extremely critical...
Zendesk is one of the most popular ticketing support and customer service platforms available in the market. Two metrics that measure the effectiveness of your customer support are the response...
Grafana is one of the most popular open-source visualization tools that can be used on top of a variety of different data stores but is most commonly used together with...
As businesses close more deals and add more accounts, it is still imperative for businesses to maintain their SLA levels and resolve customer support tickets within SLA timeframes. Having a...
Task management best practices for remote/work-from-home teams
Incident management for remote/WFH teams
Datadog is an application performance monitoring and analytical SaaS for cloud infrastructure. Datadog enables DevOps teams, SREs and IT operation teams to optimize their systems for uptime and availability. Modern...
An incident response plan is a pre-devised action stratagem for IT teams on how to respond to critical IT events efficiently. As modern applications continue to grow in scale and...
Grafana is an open-source platform for data visualization, monitoring, and analysis. It’s designed around providing context-rich visualizations, mainly though graphs but also supports other ways to present data through pluggable...
Incident Alert Routing — Getting woken up only by alerts that matter to you
Building a world-class service is as much about maintaining software as it is about developing it. On-call engineers are typically responsible for ensuring the reliability and availability of your service...
We are super excited today to introduce our latest Zenduty integration with Slack, which we are calling the Zenduty Slack Incident Command System(Slack-ICS). This was many months in the making...
Incident Alert Routing — Reducing Alert Noise
“Being on-call is a critical duty that many operations and engineering teams must undertake to keep their services reliable and available. However, there are several pitfalls in the organization of...
GameDays were first coined by Amazon’s “Master of Disaster” Jesse Robbins when he created them intending to increase reliability by purposefully creating major failures on pre-planned dates. Game Days help...
Modern businesses are evolving rapidly with the advent of cloud, CI/CD and microservices. However, there still exists an extensive and obvious divide between principle business stakeholders and developmental teams. Development...
An incident post mortem is known by many names- incident review, root cause analysis (RCA), learning review, but what do they entail?. A post mortem is a post-incident activity to...
Modern technology organizations are required to be adaptive in their approach to incident management. A single project will have multiple teams working as different branches on integrated systems. Even if...
An integral lesson every business (of any size) learns is that failure is inevitable at some point in the production cycle. There might be times where things go haywire at...
Organizations try to be proactive about incidents and down-time. An outage or downtime can be devastating to the bottom line of businesses, not to mention a poor experience for their...
An incident is an event or an occurrence that could lead to the disruption of services and operations of an organisation or could lead to them making losses. Incident management...
Incident management is a process where a team of engineers working in a firm who identify, analyse and correct such incidents to prevent recurrences. Incidents can make or break organisations....
Incident-respondents are like superheroes. They get distress-calls at all times of the day, and they try their best to resolve the problem before the fire spreads. Like any superhero, they...
Our guide to making on-call and managing incidents Zen. Inspired by Tim Peters’ The Zen of Python
Agile systems and adoption of DevOps has posed new challenges for ITSM teams. Agile development, with it’s focus on continuous deployment, often poses a challenge to the ITSM teams. Incident...
According to NENA, an estimated 240 million calls are made to 9–1–1 in the U.S. each year. These calls are handled by approximately 6000 public safety answering points (PSAP) located...
Looking for an inexpensive way to keep yourself and you SRE team updated on all alerts and collaborate with them faster? Look no further than Zenduty Slack Alerts!
Teams responsible for incident management have so much in common with professional sports teams. Members of both have to be reactive, responsive, strategic and agile! Since they are so similar,...
This article talks about why you should spend the next 1 hour setting up alerting and on-call management systems for your startups.
So you are starting a new job as an SRE, and expect to be on call anytime now. This is your first time in the role, and you want to...
Category mental-health
Zenduty — SRE Puzzle of the week — Forgotten password
“Being on-call is a critical duty that many operations and engineering teams must undertake to keep their services reliable and available. However, there are several pitfalls in the organization of...
An integral lesson every business (of any size) learns is that failure is inevitable at some point in the production cycle. There might be times where things go haywire at...
Scientists have discovered a direct link between foods consumed by people and their productivity levels. This forces us to rethink our dietary habits and consume superfoods that are rich in...
Neglecting yourself and your family and your work is still not getting better even though you work the extra hours and have no free time? Your solution might not be...
Meditation is an immediate stress-buster and can calm your nerves in minutes. You can find your balance and energy to work easily by practising meditation for just a while.
Our guide to making on-call and managing incidents Zen. Inspired by Tim Peters’ The Zen of Python
Do you travel for hours together to get to your workplace? Does the travel exhaust you even before your day at work begins? Do you get stressed because of being...
Wellness is a choice that we make every day to lead a better, healthier and fulfilling life. Wellness is more than just being disease free; it is a dynamic operation...
_“Life has a tendency to fall apart when you’re awake, so stay in bed!”_
When your meeting begins in 5 minutes and you are panicking or you have an important assignment due by the end of the day, heading out to relax and cool...
A sound mind and body is the only way ahead to achieve one’s full potential at the work place. There are diseases and irregularities now affecting workers at their workplaces...
Believe it or not, being mindful of your breathing can help improve your immune system and reduce stress. According to a study in the University of South Carolina, adults who...
The term SRE, or Site Reliability Engineering has been around for over a decade. SRE aims to create ultra scalable and highly reliable software systems by incorporating priciples from software...
We spend a large part of our lives working, and how we feel often affects how we do at work. Here is a list of small steps you can take...
Mental Health problems are more common than you may think- at least 1 in 5 people of working age suffer from mental health problems of some degree. Surprisingly, though, the...
Category business
Highlights of the Redesign
Highlights of the New Release
It is with a great deal of anticipation and excitement that I’m announcing our $1.9M round of funding, led by StartupXSeed Ventures along with participation from marquee enterprise SaaS investors...
In the past year of covid related working from home, we are increasingly seeing more burnouts in engineering teams worldwide. More and more devs are partially checked out and may...
As companies today are racing to build site reliability engineering(SRE) practices within their engineering teams, site reliability engineering has become one of the hottest and highest paying jobs in tech....
Zendesk is one of the most popular ticketing support and customer service platforms available in the market. Two metrics that measure the effectiveness of your customer support are the response...
As businesses close more deals and add more accounts, it is still imperative for businesses to maintain their SLA levels and resolve customer support tickets within SLA timeframes. Having a...
Task management best practices for remote/work-from-home teams
Incident management for remote/WFH teams
Datadog is an application performance monitoring and analytical SaaS for cloud infrastructure. Datadog enables DevOps teams, SREs and IT operation teams to optimize their systems for uptime and availability. Modern...
As we head into 2020, it’s clear that DevOps has finally crossed the divide and gone mainstream. With DevOps firmly ingrained as a standard practice, we now look at how...
Building a world-class service is as much about maintaining software as it is about developing it. On-call engineers are typically responsible for ensuring the reliability and availability of your service...
“Being on-call is a critical duty that many operations and engineering teams must undertake to keep their services reliable and available. However, there are several pitfalls in the organization of...
GameDays were first coined by Amazon’s “Master of Disaster” Jesse Robbins when he created them intending to increase reliability by purposefully creating major failures on pre-planned dates. Game Days help...
Modern businesses are evolving rapidly with the advent of cloud, CI/CD and microservices. However, there still exists an extensive and obvious divide between principle business stakeholders and developmental teams. Development...
ChatOps is the implementation of chatbots to unify communication and collaboration. Through ChatOps every single member of a team will be aware of what the other members are working on....
An incident post mortem is known by many names- incident review, root cause analysis (RCA), learning review, but what do they entail?. A post mortem is a post-incident activity to...
Modern technology organizations are required to be adaptive in their approach to incident management. A single project will have multiple teams working as different branches on integrated systems. Even if...
Every organization is different in the way it functions as a whole, with different approaches to operations management, functionality and legal structure. However every company whether big or small face...
Organizations try to be proactive about incidents and down-time. An outage or downtime can be devastating to the bottom line of businesses, not to mention a poor experience for their...
An incident is an event or an occurrence that could lead to the disruption of services and operations of an organisation or could lead to them making losses. Incident management...
Incident management is a process where a team of engineers working in a firm who identify, analyse and correct such incidents to prevent recurrences. Incidents can make or break organisations....
Do you travel for hours together to get to your workplace? Does the travel exhaust you even before your day at work begins? Do you get stressed because of being...
Agile systems and adoption of DevOps has posed new challenges for ITSM teams. Agile development, with it’s focus on continuous deployment, often poses a challenge to the ITSM teams. Incident...
Retail is a sector that increasingly requires cutting edge technology. Whether shoppers purchase online or in-store, a good, organised point-of-sale system keeps your store functioning seamlessly. Ideally, all aspects of...
According to NENA, an estimated 240 million calls are made to 9–1–1 in the U.S. each year. These calls are handled by approximately 6000 public safety answering points (PSAP) located...
Teams responsible for incident management have so much in common with professional sports teams. Members of both have to be reactive, responsive, strategic and agile! Since they are so similar,...
This article talks about why you should spend the next 1 hour setting up alerting and on-call management systems for your startups.
Category use-cases
It is with a great deal of anticipation and excitement that I’m announcing our $1.9M round of funding, led by StartupXSeed Ventures along with participation from marquee enterprise SaaS investors...
In the past year of covid related working from home, we are increasingly seeing more burnouts in engineering teams worldwide. More and more devs are partially checked out and may...
Recommended reads on Resilience engineering and SRE
Prometheus is by far, one of the most popular open-source monitoring tools used by millions of engineering teams globally with a robust community and continued adoption and evolution.
One of the primary things you need to figure out whenever your team is formulating your incident management process is describing in words what a Sev0(your highest incident priority) looks...
Nagios is one of the most widely used open-source network monitoring software used by thousands of NOC teams globally to monitor the health of a vast array of their hosts...
One of the first things incident managers do when they get an alert page from Zenduty is to check the “Context” tab of the incident. Incident context is extremely critical...
Recently, one of our customers, a 20-member NOC team of a large B2C company, had set up Zabbix to monitor a network of over 1000+ servers, routers, and switches. The...
Zendesk is one of the most popular ticketing support and customer service platforms available in the market. Two metrics that measure the effectiveness of your customer support are the response...
Google Cloud Platform (GCP) is a collection of Google’s computing resources, made available via services to the general public as a public cloud offering.
Microsoft Azure is a cloud computing service providing infrastructure as a service (IaaS), software as a service (SaaS) and platform as a service (PaaS) supporting multiple Microsoft Specific and third-party...
Grafana is one of the most popular open-source visualization tools that can be used on top of a variety of different data stores but is most commonly used together with...
As businesses close more deals and add more accounts, it is still imperative for businesses to maintain their SLA levels and resolve customer support tickets within SLA timeframes. Having a...
Task management best practices for remote/work-from-home teams
Teams is Microsoft’s versatile chat and collaboration solution for enterprise communication. Teams come bundled with Office365, offering chat, file sharing, and a host of other collaborative features. The platform also...
DevOps is an organizational philosophy that enables continuous delivery and continuous deployment with a focus on continuous testing, automation and collaboration among dev teams, business, and operations teams. Consequently, continuous...
Grafana is an open-source platform for data visualization, monitoring, and analysis. It’s designed around providing context-rich visualizations, mainly though graphs but also supports other ways to present data through pluggable...
Incident Alert Routing — Getting woken up only by alerts that matter to you
We are super excited today to introduce our latest Zenduty integration with Slack, which we are calling the Zenduty Slack Incident Command System(Slack-ICS). This was many months in the making...
Incident Alert Routing — Reducing Alert Noise
ChatOps is the implementation of chatbots to unify communication and collaboration. Through ChatOps every single member of a team will be aware of what the other members are working on....
An incident post mortem is known by many names- incident review, root cause analysis (RCA), learning review, but what do they entail?. A post mortem is a post-incident activity to...
Modern technology organizations are required to be adaptive in their approach to incident management. A single project will have multiple teams working as different branches on integrated systems. Even if...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. HetrixTools is a blacklist check and monitoring software that monitors if your...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Ghost Inspector helps you build or record automated website tests in your...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. AppOptics helps you monitor applications, infrastructure, and servers in one platform.
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Server Density is a hosted server monitoring service that provides server and...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Logzio’s services enable one to manage logs and get log analysis services....
Incident-respondents are like superheroes. They get distress-calls at all times of the day, and they try their best to resolve the problem before the fire spreads. Like any superhero, they...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Humio is a log management software that provides instant monitoring, analysis and...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Hosted Graphite is a graphite monitoring, alerting and Grafana dashboard platform for...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Scout’s ATS Integration and Machine Learning pairs the right search firms with...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. With Wavefront’s services cloud monitoring and analytics will reduce downtime and boosts...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Site24x7 offers comprehensive monitoring for critical network devices such as routers, switches...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. LogDNA is an advanced dashboard that allows you to instantly centralize, monitor,...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Bugsnag monitors application stability, so you can make data-driven decisions on whether...
Whether you’re a small, medium or large company with a mobile strategy, it is critical that you monitor our app’s performance constantly. Uptime and reliability can make or break your...
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. When these incidents come attached with problems, changes, releases or assets for...
Looking for an inexpensive way to keep yourself and you SRE team updated on all alerts and collaborate with them faster? Look no further than Zenduty Slack Alerts!
Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. When these incidents come attached with problems, changes, releases or assets for...
Category chatops
Development and DevOps teams increasingly focus on collaboration for faster CI/CD and keeping their systems agile. But collaboration is easier said than done when teams are distributed across geography and...
Category microsoft-teams
One of the primary things you need to figure out whenever your team is formulating your incident management process is describing in words what a Sev0(your highest incident priority) looks...
Nagios is one of the most widely used open-source network monitoring software used by thousands of NOC teams globally to monitor the health of a vast array of their hosts...
Category nagios
One of the primary things you need to figure out whenever your team is formulating your incident management process is describing in words what a Sev0(your highest incident priority) looks...
Nagios is one of the most widely used open-source network monitoring software used by thousands of NOC teams globally to monitor the health of a vast array of their hosts...
Category on-call
Providing customers with a world-class and seamless user experience is critical for the success of any business. It is therefore important that you have a robust on-call strategy that optimizes...
Category team-management
Providing customers with a world-class and seamless user experience is critical for the success of any business. It is therefore important that you have a robust on-call strategy that optimizes...
Category product-management
Providing customers with a world-class and seamless user experience is critical for the success of any business. It is therefore important that you have a robust on-call strategy that optimizes...
Category product update
Highlights of the Redesign
Highlights of the New Release
Category android
Highlights of the New Release
Category ios
Highlights of the New Release
Category redesign
Highlights of the Redesign
Category web
Highlights of the Redesign