TL;DR: RCAs, Postmortems and Task Prioritization

Learn how to prioritize tasks, get stuff moving by performing non-blocker tasks first, effectively create postmortems, perform RCAs faster and not have an overburdened high priority(P0) dashboard. The below article should help you plan your product/feature launch faster without having to compromise on the reliability of the existing services.

“All P0s are equal. But some P0s are more equal than others” - Napoleon

I had a rather spirited conversation this week with one of our customers - an engineering leader at a Ecommerce company that is a market leader in its category by a long shot. One of the things that kept him up at night was the fact that every postmortem or RCA meeting after a major incident led to a set of action items, a lot of them being marked as high-priority(P0).

But due to the sheer pace of innovation and deployments, the engineering team was already pushed to its limits and a lot of those “high-priority” tasks stayed in the Jira backlog, with resolution dates being pushed forward every month.

He conceded that the major “business-impacting” items were resolved quickly, but at the same time lamented the fact that he needs to wait a quarter or two for that pristine, all-green SLO dashboard. I have never seen anyone so focused on eliminating tech debt and toil - he knew the long-term ramifications of technical debt like the back of his hand.

“Move fast and break things”

At some point in your journey, you’ll have to deal with the shards of glass on the floor that will make it awfully difficult for you to even move. After the call, we sat in our conference room on a cool Friday afternoon and tried to make sense of his predicament, which is all too common, especially in early-stage engineering teams. A mental “word-cloud” of the call would look something like this:

The conversation helped me break down the enigma of prioritizing tasks and delegating effectively while P0 incidents are lurking.

Chapter 1: Priorities and Dependencies

Are we choosing the right metrics here? Is assigning a “priority” enough to rally engineers from multiple teams to come together to solve a problem? It was abundantly clear that business impact, specifically revenue and churn impact was an instant motivator. But Why?

Because growing the revenue and reducing churn was an organization-wide objective, and not just a high-priority line item. Objectives are motivation, not priorities. That is not to say we can ditch priorities altogether. Priorities can be a tool to achieve objectives in the fastest time possible.

The priority of something can be anything depending on who you ask (with the exception of that priority being defined by the top cats, that’s real).

A high-priority task can be defined as “high-priority” if and only if it is dependent on another “high-priority” task.

Most of us have a long list of high priority “to-do” items, some like to keep them in an old-school way by writing everything down, some just keep a notepad open and some people use tools like Asana. But the real challenge that most of us run into is, “where to start?”

One way we go about it is by creating a priority list, P0, P1, P2, and so on… but the conversation around these priorities keeps moving back and forth. The reason is dependency!

THE CONUNDRUM: Even when you define and unanimously agree on x being a P0, P0 can’t be done without y which is a P2 task but how can you do a P2 task before P0, given the defined priority?

Does the above sound familiar?

Let’s take a team that is trying to work on toil reduction for a service as an example. And for whatever reason, let’s say they can’t break out of this vicious cycle.

So how does this hypothetical team break out of it?

For the purposes of not overly repeating the word team, let’s call the team “The Boys” - popular culture reference, for those who haven’t watched it yet, give it a shot after reading this article and thank me later.

We understand the problem at hand, now let’s look at what we can do about it.

First thing, let’s put all the tasks that The Boys think are important on a blank canvas, once we have a bunch of items, let’s ask The Boys if there is any dependency for the items listed on the canvas and connect the dots between these tasks!

Once the process is complete, The Boys are amused, to say the least, to see there are a bunch of line items that have no dependency whatsoever. A simple question’s answer will now help The Boys here to break the cycle.

What are you trying to achieve? An enthu cutlet said, “improving reliability for a service” and how are you going to achieve this? with a smile on his face, “By making progress”.

That’s it! Get started on the tasks that have no dependency to make progress, throw the P0, P1 out for now, and as these tasks get completed, the progress will unlock other tasks and hence The Boys will be able to achieve improved reliability!

To put that into perspective, to build demo_project, you need to build httpclient.o and logging.o. To build httpclient.o, you need to have libcurl on the system. You don’t have libcurl on the system yet, so you can’t build httpclient yet. But you can build logging, since it doesn’t depend on anything. So, you can either work on logging or get libcurl going. Or, if you have two workers, do both at the same time. Voila!

The best way to use priorities is to see what tasks you want to get done first, if you have two chains of tasks, one chain is blocking a P0 task, and another is blocking a P3 task, do the first chain before. Stop treating priority listing as mandates, rather use it as a comparing guideline.

Chapter 2: The No Roadmap

After completing the first mission successfully 3 fleet members of The Boys are now out to do something extraordinary, build something from scratch, but how does one do that without a roadmap?

One sane way to approach this problem is to break this down into tasks and start executing these tasks manually while keeping notes.

For the purposes of keeping it interesting, let’s say The Boys have a task to build the “death star”. They know what a successful output would look like but they also need to be aware to adapt to the conditions they could encounter along the way. For that reason, The Boys start off by manually building a prototype and keeping notes of what parts of the process were awful.

Once they have a handle on what they were and how it could be made better, they can automate that step. Then they would move one level up the process and do it again. This continues until the thing pretty much runs itself.. and creates a half-decent list of “stuff needed to build a death star sanely”.

But, during this process, The Boys always had the “ooh, what if we tried this thing” idea every now and then so they needed a way to handle those in a humane way without having to make any promises that couldn’t be kept.

Let’s say Hughie, Queen Maeve and Starlight were the ones working on this. So how did they make the system work?

The three of them working on the project have different initials (H, Q, B) which the team uses to their advantage. During the meeting when someone has an idea, they list it as a set, i.e., Hughie’s first idea would be H1, then H2, then H3, while Starlight’s ideas would go like S1, S2, S3, and Queen Maeve’s ideas would be Q1, Q2, Q3 and so on.

This lets all our heroes pitch ideas out there while giving the ideas a unique identifier so they don’t get lost, and without having to coordinate with each other in order to avoid collision while incrementing the counter. All of the ideas go onto their Wiki page, and that’s about as concrete as they get for a while. They then are all able to see what everyone else was thinking about, look for duplicated ideas, overlap, and generally look for things that would help or hurt other things.

These are merely ideas, not commitments. Also, just because they had the idea didn’t mean they had to do anything with it. The only reason to tag them the way The Boys did is so they could be uniquely identified, assigned in a non-blocking fashion and so they could know who to ask for clarification if the gist of the idea was not clear from the wiki page.

“Hey Starlight, where were you thinking about placing the superlaser?”

“I wanted it at the center of the Imperial battle station”

“Oh, ok, that makes much more sense to me now, thanks!”

….you get the idea.

At some point, they start seeing the ideas as a part of a bigger picture, such as H1 might depend on H5, but then H5 depends on Q2. After doing that for a while, they ended up with several different types of list items.

Some items like in Chapter 1, didn’t have any blockers on them. They could just be picked and taken a hack on at any time but at the same time, doing these wouldn’t unblock anything else or anyone else. It would make for a nice afternoon of work on a self-contained project.

There are items that were blocked by one or more other items. They tended to be rather large.

Then there are items that themselves block one or more items.

The ones which are blockers but which are not themselves blocked are the obvious places to start working on stuff if they are looking for something to do. This is because if they started on something that was blocked, they would hit the blocker! Pretty obvious if you think about it, right?

By drawing this out as a bunch of chains (think labeled circles with arrows pointing at other circles) courtesy of the Wiki’s built-in “dot”/graphviz renderer, they can just look at the ends of them to figure out what was possible to start on right now. They can also see which items would have the most impact in terms of unblocking future work.

At the point when someone decided they were going to do something, they’d create a task for it. A “task” was the unit of currency in this empire’s bug tracking software. That meant it got a number and a nice little shortcut URL, so the wiki could be updated with a link to it. Anyone who was curious could follow that link into the tasks tool to see what was up.

A side note: the graph had been arranged to show things with different colors and rendering styles based on whether they were blocked or not, and whether they had been completed or not.

From time to time, they would go through and prune back the graph in order to remove the nodes which had been completed and were just taking up space. Watching things get struck through and then disappear from the list and the graph quite obviously felt great.

One important bit here: some of those items never left the list. Those items were just ideas that someone had one day, but as the death star went into production, it turned out they didn’t need it that badly or had the wrong idea altogether. Then, that was the end of that. Some of them were actually struck through on the list without being completed along with a note explaining why, just so they didn’t have to keep thinking about them.

This system of doing things will definitely make you happy and more organized because it doesn’t leave you holding a bag full of promises. Disconnecting the talk from the commitment can give you the freedom to talk about all kinds of stuff without the worry of somehow ending up with tons of stupid and useless items down the road. You will still make commitments, of course - that’s what the tasks are for. Like The Boys did in the making of Death Star.

How do you keep from having a soul-crushing load of incident tasks, jira tickets or whatever that will never go away?

You don’t create one until you are damn sure that it needs to exist.

Author’s note: if you have so many things going on that you run the risk of “losing” something due to it only being a line item on a wiki page somewhere, you probably don’t want to try this on your project. You probably also want to run away from that project, since there’s far too much going on there.