Breaking Production the Right Way


Here are two completely opposite situations I’ve experienced. You might be surprised to learn that the situation that seemed easier was actually the worst.

Situation 1 - The Wrong Way

Context

I joined an early-stage startup founded by two former colleagues as a founding engineer. At the time, we had a few customers who were barely using our app.

I was working on a feature, and although I don’t remember the exact reason, I needed to implement an adapter pattern to hide some messy code behind a clean interface.

The idea was to create a sexier interface while keeping the existing code as it was.

  • The existing code was doing his job, so there was no need to change it, even though it was messy.
  • The code had no test, making it easy to introduce bugs.
  • It saved time. Focusing on the right tasks is critical in an early-stage startup.
  • But most importantly, it avoided a complex and tricky data migration. Basically, the migration involved JSON columns in a PostgreSQL table with an ill-defined schema.

What Happened

I was pleased with this solution, and the feature worked as expected.

However, my CEO was not happy with the adapter pattern. He was an excellent data engineer and incredibly good at quickly shipping features, but he wasn’t a strong software engineer.

He insisted that the entities in the code must directly reflect the database tables. This made no sense, particularly when the adapter pattern was already working so well.

I told him the reasons why I was against this migration

  • It was a waste of time
  • It was too risky

But in the end, he had the final say and insisted on the migration.

So, I went ahead with it, and as you might expect, it broke production.

Basically, I missed a few edge cases in the JSON schema, that my tests didn’t helped me detect. As mentioned earlier, this wasn’t an easy migration.

Fortunately, I only broke a part of the app, not the entire thing.

I had made a backup of the table before starting the migration, so I was able to restore the data, fix the issues, and complete the migration properly.

In the end, we didn’t lose any data, and the customers didn’t even notice that the feature had been broken for a while.

When I realized production was broken, I immediately informed my CEO and asked for his help in debugging and fixing the issue.

I expected him to jump on a call with me to work through it together. Instead, he simply sent a message saying it was unacceptable to break production—and that was it.

He remained upset with me for the rest of the day.

We had a team dinner at a restaurant planned for this evening. Our third colleague made a little joke about the production incident we had during the day. Suddenly, my CEO’s face turned red, and he started yelling at me.

At that moment, I knew I was going to leave the company. It wasn’t the kind of atmosphere I wanted to work in, especially since I had worked at another company where something like this would never have happened.

Situation 2 - The Right Way

Context

I worked several years in a start up in Finance, that became a Unicorn. When I joined, the company had fewer than 40 employees, and by the time I left, it had grown to over 500.

Despite having thousands of customers, we still experienced production issues a few times a year.

How We Handled Issues in Production

Usually, everything started with a message on Slack—either an automatic alert from the monitoring system or a message from someone who discovered the problem.

From there, a group of people would quickly jump on a call to address it together. When production was down, it became the top priority for the entire team to resolve it. However, it wasn’t necessary for all engineers to be involved.

Once the issue was resolved, we tried to understand what went wrong.

Then, we wrote a postmortem that detailed the problem, the steps taken to fix it, how it could have been prevented, and any suggested improvements to the system.

Basically, we learned from our mistakes and grew as a team.

We might have teased the developer responsible a bit over a beer after work, but then we moved on.

Key Takeaways

Situation 1 occurred in an early-stage startup. Although no customer noticed the issue, my CEO’s reaction (along with similar reactions over the previous months) ultimately led me to leave the company.

Situation 2 took place in a scaling finance company with thousands of customers.

Despite breaking production several times, we

  • Became a unicorn.
  • Didn’t have anyone leave the company due to a production issue.
  • People grew a lot within the company.

Production should not be broken. But it’s important to accept that it will happen. The key is to respond appropriately when it does.

When production is down, it becomes the team’s top priority to resolve the issue.

Afterward, it’s crucial to gather and learn from our mistakes so it doesn’t happen again.

Pointing fingers at the responsible engineer is counterproductive. They are likely already aware of their mistake and may feel bad about it. It is less likely that they will repeat the same error.