Archive of posts with category 'incident-management'

On-call compensation models

Providing customers with a world-class and seamless user experience is critical for the success of any business. It is therefore important that you have a robust on-call strategy that optimizes...

Defining your Sev-1s

One of the primary things you need to figure out whenever your team is formulating your incident management process is describing in words what a Sev0(your highest incident priority) looks...

Sending Nagios alerts to Microsoft Teams and rapid incident response with Zenduty

Nagios is one of the most widely used open-source network monitoring software used by thousands of NOC teams globally to monitor the health of a vast array of their hosts...

Using context to triage change-triggered incidents

One of the first things incident managers do when they get an alert page from Zenduty is to check the “Context” tab of the incident. Incident context is extremely critical...

Real-time alerts from Zabbix and escalation with Zenduty

Recently, one of our customers, a 20-member NOC team of a large B2C company, had set up Zabbix to monitor a network of over 1000+ servers, routers, and switches. The...

Accelerating your Zendesk customer support response times by 50% and meeting support SLAs

Zendesk is one of the most popular ticketing support and customer service platforms available in the market. Two metrics that measure the effectiveness of your customer support are the response...

Monitoring service health and downtime events within your Google Cloud with Zenduty

Google Cloud Platform (GCP) is a collection of Google’s computing resources, made available via services to the general public as a public cloud offering.

Sending Azure Monitor outage notifications to Microsoft Teams

Microsoft Azure is a cloud computing service providing infrastructure as a service (IaaS), software as a service (SaaS) and platform as a service (PaaS) supporting multiple Microsoft Specific and third-party...

Azure service health alerts and escalation with Zenduty

Microsoft Azure is a cloud computing service providing infrastructure as a service (IaaS), software as a service (SaaS) and platform as a service (PaaS) supporting multiple Microsoft Specific and third-party...

Grafana alerts to SMS/Phone/Slack/Microsoft-Teams and incident escalation with Zenduty

Grafana is one of the most popular open-source visualization tools that can be used on top of a variety of different data stores but is most commonly used together with...

Meeting customer support SLAs on Freshdesk using proactive alerting and escalations with Zenduty

As businesses close more deals and add more accounts, it is still imperative for businesses to maintain their SLA levels and resolve customer support tickets within SLA timeframes. Having a...

Efficient task management for remote/work-from-home teams

Task management best practices for remote/work-from-home teams

Incident management for remote/WFH teams

Incident management for remote/WFH teams

Incident management with Microsoft Teams and Zenduty

Teams is Microsoft’s versatile chat and collaboration solution for enterprise communication. Teams come bundled with Office365, offering chat, file sharing, and a host of other collaborative features. The platform also...

Incident Response - how great companies do it

An incident response plan is a pre-devised action stratagem for IT teams on how to respond to critical IT events efficiently. As modern applications continue to grow in scale and...

Monitoring with New Relic - Everything you need to go to get started

DevOps is an organizational philosophy that enables continuous delivery and continuous deployment with a focus on continuous testing, automation and collaboration among dev teams, business, and operations teams. Consequently, continuous...

Grafana - Everything you need to know

Grafana is an open-source platform for data visualization, monitoring, and analysis. It’s designed around providing context-rich visualizations, mainly though graphs but also supports other ways to present data through pluggable...

Site reliability engineering - Predictions for 2020

As we head into 2020, it’s clear that DevOps has finally crossed the divide and gone mainstream. With DevOps firmly ingrained as a standard practice, we now look at how...

Incident Alert Routing — Getting woken up only by alerts that matter to you

Incident Alert Routing — Getting woken up only by alerts that matter to you

Making on-call superheros

Building a world-class service is as much about maintaining software as it is about developing it. On-call engineers are typically responsible for ensuring the reliability and availability of your service...

Incident Response 2.0 — The Zenduty Incident Command System(ICS)

We are super excited today to introduce our latest Zenduty integration with Slack, which we are calling the Zenduty Slack Incident Command System(Slack-ICS). This was many months in the making...

Incident Alerts — Reducing Alert Noise

Incident Alert Routing — Reducing Alert Noise

The importance of GameDays

GameDays were first coined by Amazon’s “Master of Disaster” Jesse Robbins when he created them intending to increase reliability by purposefully creating major failures on pre-planned dates. Game Days help...

Site Reliability Engineering - Why you should adopt SRE

Site reliability engineering was a term coined by Google engineer Benjamin Treynor in 2003 when he was tasked with making sure that Google services were reliable, secure and functional. He...

Post Mortems - Bringing clarity to incident reviews

An incident post mortem is known by many names- incident review, root cause analysis (RCA), learning review, but what do they entail?. A post mortem is a post-incident activity to...

The importance of Incident Roles

Modern technology organizations are required to be adaptive in their approach to incident management. A single project will have multiple teams working as different branches on integrated systems. Even if...

Fostering blamelessness at the workplace

An integral lesson every business (of any size) learns is that failure is inevitable at some point in the production cycle. There might be times where things go haywire at...

The true cost of unreliability

Every organization is different in the way it functions as a whole, with different approaches to operations management, functionality and legal structure. However every company whether big or small face...

Real-time incident management with HetrixTools

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. HetrixTools is a blacklist check and monitoring software that monitors if your...

How to manage Incident Response Efficiently

Organizations try to be proactive about incidents and down-time. An outage or downtime can be devastating to the bottom line of businesses, not to mention a poor experience for their...

Real-time incident management with Ghost Inspector

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Ghost Inspector helps you build or record automated website tests in your...

Tackling the challenges of Incident management

An incident is an event or an occurrence that could lead to the disruption of services and operations of an organisation or could lead to them making losses. Incident management...

Real-time incident management with AppOptics

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. AppOptics helps you monitor applications, infrastructure, and servers in one platform.

Real-time incident management with Server Density

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Server Density is a hosted server monitoring service that provides server and...

Nailing the Incident Management Process

Incident management is a process where a team of engineers working in a firm who identify, analyse and correct such incidents to prevent recurrences. Incidents can make or break organisations....

Real-time incident management with Logzio

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Logzio’s services enable one to manage logs and get log analysis services....

Responders make Incident Resolution Better

Incident-respondents are like superheroes. They get distress-calls at all times of the day, and they try their best to resolve the problem before the fire spreads. Like any superhero, they...

Real-time incident management with Humio

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Humio is a log management software that provides instant monitoring, analysis and...

Real-time incident management with Hosted Graphite

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Hosted Graphite is a graphite monitoring, alerting and Grafana dashboard platform for...

Real-time incident management with Scout

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Scout’s ATS Integration and Machine Learning pairs the right search firms with...

The Zen of SRE

Our guide to making on-call and managing incidents Zen. Inspired by Tim Peters’ The Zen of Python

Real-time incident management with Wavefront

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. With Wavefront’s services cloud monitoring and analytics will reduce downtime and boosts...

A beginner's guide to Incident Management

An incident is an event or an occurrence that could lead to the disruption of services and operations of an organisation or could lead to them making losses. Management of...

Real-time incident management with Site24x7

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Site24x7 offers comprehensive monitoring for critical network devices such as routers, switches...

Incident Management in an Agile Setup

Agile systems and adoption of DevOps has posed new challenges for ITSM teams. Agile development, with it’s focus on continuous deployment, often poses a challenge to the ITSM teams. Incident...

Real-time incident management with LogDNA

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. LogDNA is an advanced dashboard that allows you to instantly centralize, monitor,...

Real-time incident management with Bugsnag

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. Bugsnag monitors application stability, so you can make data-driven decisions on whether...

Real-time incident management with Firebase Crashlytics

Whether you’re a small, medium or large company with a mobile strategy, it is critical that you monitor our app’s performance constantly. Uptime and reliability can make or break your...

Real-time incident management with Atatus

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. When these incidents come attached with problems, changes, releases or assets for...

Incident Management for Emergency Services!

According to NENA, an estimated 240 million calls are made to 9–1–1 in the U.S. each year. These calls are handled by approximately 6000 public safety answering points (PSAP) located...

Zenduty Now Alerts your Slack!

Looking for an inexpensive way to keep yourself and you SRE team updated on all alerts and collaborate with them faster? Look no further than Zenduty Slack Alerts!

Real-time incident management with Freshdesk

Incident management works best when all of your incidents and alerts can be tracked from a centralized hub. When these incidents come attached with problems, changes, releases or assets for...

What ITOps Teams Can Learn From Sports Teams

Teams responsible for incident management have so much in common with professional sports teams. Members of both have to be reactive, responsive, strategic and agile! Since they are so similar,...

Why Young Startups Should Invest in On-Call Right Now

This article talks about why you should spend the next 1 hour setting up alerting and on-call management systems for your startups.

Preparing for On-Call

So you are starting a new job as an SRE, and expect to be on call anytime now. This is your first time in the role, and you want to...