New KPI: What's Your Deployment Pullback Ratio?

The larger your deployment to production, the more time you will spend fixing it

May 08, 2023

It’s that time of the week again. Time for a big deployment. Everyone is excited and nervous. The development team stands in a circle, all staring at each other while asking, “Are we sure we are going to do this?”.

You have done all the testing and planning you can do at some point. You pray to the clouds and kick off the deployment. Here. We. Go!

You hope that Newton’s Third Law doesn’t come true:

Every action has an equal and opposite reaction

Most deployments don’t have an equal reaction, but they definitely have a reaction!

We immediately get sent into reaction mode, looking for problems and bugs. We are watching our application monitoring systems and looking for potential errors. Or at least the good teams are. The bad teams push code and just go to bed, hoping to not wake up to a nightmare.

With every deployment and code push forward, there is always a pullback.

Release, hotfix, hotfix

If you aren't familiar with AMC theaters, they are one of the largest movie theater chains in the world, and they are based in my hometown of Kansas City.

One day their engineering leader told me a funny story I will never forget.

“Every Monday, we do a production release. On Tuesday, we do 2 hotfixes, on Wednesday 1 hotfix, and if we are lucky, none on Thursday.”

That sounds terrible, but this same sort of reality is quite average.

Production releases always have some sort of fallout. Even with perfect planning and QA, nothing is perfect. All the unit testing and QA in the world can’t fix human error, config changes, missing requirements, and other common problems. Sometimes the code deployment is just the start of the work for getting code changes live and thoroughly tested.

Introducing the pullback ratio

The larger a deployment is, the more effort it takes to fix bugs, correct issues, and apply hotfixes. I am going to call this the pullback ratio.

I call it the pullback ratio because it reminds me of the technical analysis of stocks or cryptos. During a bull run, or upward movement, there is a pullback to retest previous highs before going higher again. It looks very similar to how much effort it takes to fix issues from a deployment.

If you have done any software development for any time, you know what I am talking about. After every release, we spend time fixing bugs from the release.

After 1-2 weeks of development time, it isn’t unusual to spend 1-2 days fixing bugs from the release.

As engineering leaders, we always try to improve how software releases are done and reduce the "pullbacks.” Tracking them as a KPI might be an excellent way to track the quality of releases. Track how much work was shipped and how much time was spent fixing post-release issues.

Smaller releases, more often

Nothing is scarier than doing a release after weeks or months of work. You know the number of issues that come it from will be huge. Those sorts of big bang releases are a nightmare. You always want to do smaller more incremental releases. There are a lot of pros and cons to this, though.

The con is that every release takes time. A certain amount of planning and “ceremony” is required to do a release properly. Ceremony time includes planning, communication, training, change control, release notes, SQL changes, QA, maintenance windows, telling impacted customers, release monitoring, etc. Don’t discount how much time those things take!

If it takes hours to do a release, doing more of them is a giant time suck. Granted, hotfixes versus true feature releases are also different. Also, if you push stuff right to prod and don’t do any of the things I mentioned… it doesn’t take too long. In bigger organizations and more mature companies, there are commonly a lot of processes.

The pro is that it lessens collateral damage and reduces risk. The odds of there being a problem are smaller. If there is a problem, it is much easier to know the cause since the changeset is smaller.

The trick is finding the right delta size of changes to ship at your company. Weekly feature releases are a great goal, but every 2 weeks isn't bad. Daily feature releases to production sounds like pure chaos to me. Every company is different.

Improve releases with feature flags

Another way to try and control bugs and the pullback from releases is by using feature flags.

Feature flags, also known as feature toggles or feature switches, are a software development technique that allows developers to turn certain features of an application on or off without deploying a new version of the software.

These flags are conditional statements that control the visibility of specific features in the code. By using feature flags, developers can selectively enable or disable specific features in the codebase, making it easier to manage the release process and test new features before rolling them out to all users.

Feature flags can also be used to control the availability of certain features to different groups of users. For example, a feature flag could be used to make a new feature available only to beta testers or to roll out a new feature to a small percentage of users to test its performance before making it available to everyone.

Overall, feature flags provide a powerful tool for developers to manage the release process of new features and improve the stability and reliability of their software applications. You can check out a list of vendors in this article.

How to observe potential problems

I highly recommend that all companies have excellent application performance monitoring (APM) for production environments. Including application errors, logs, performance data, and code-level traces. This is also known as "observability."

By having excellent monitoring in place, your team can move much faster and take on more risk, knowing that you have the eyes and the ears to immediately find problems. With good CI/CD pipelines, you can hotfix bugs very fast.

You can also use observability tools in QA environments to try and find exceptions, slow queries, and other problems before they get to production.

After doing a deployment, you need to use an APM/observability product like Stackify to monitor these things:

Error rates
New exceptions being captured
Transaction volumes
Transaction performance
Top SQL queries
Other external dependencies

Application errors and major performance or traffic changes via APM are usually the first signs of problems or major changes. It still surprises me how many developers are unfamiliar with these invaluable tools. Too many developers are blind at all times without these insights.

If you aren’t familiar with APM, check out this What is APM? article to learn more.

The pullback ratio is a potential new KPI

No matter what you do, you will always be a little nervous about production releases. By deploying smaller changesets more often, you can reduce the risk. Feature flags can help you control when code changes are activated and for which customers.

Observability tools can help you quickly find and fix problems in QA before they get to production or quickly find them if they make it to production.

A potential great new KPI to track for your team is how big of a release you did and how big the pullback and hotfix effort needed to correct the issues were. A similar metric is the defect escape rate which is how many defects made it to production.

The pullback ratio could be the measure of the time spent fixing bugs in relation to the size of the release. It is another way to potentially track and improve your releases.

Visionary CTO Newsletter