Emergency handling protocol
Pushed for feature development under a tight deadline, deployed to live and alerts went crazy. Sounds familiar? It sometimes happens even to the best of us. The most important thing here is to cease panicking, stop running in circles and start taking measures. But what measures to take, specifically? To answer this, an Emergency handling protocol is needed.
When it comes to a situation when the situation goes pear shaped, it ultimately boils down to two options: make a hotfix, or do a rollback. The rollback isn't always possible to pull off, and the hotfix may take a while to prepare. It's of utmost importance to do the right choice.
The rollback
The rollback seems to be the most reasonable thing to do, however, before committing to it, consider the compatibility risks. The database model, the contracts with other services and the front-end (if any) must be compatible and tolerant to the backend downgrade.
The hotfix
Instead of doing one step back in time, there is always an option to boldly go ahead and make a hotfix. However, there are also trade-offs:
- Before even making a fix, the problem arisen must be diagnosed and a solution (even temporary) must be discovered and implemented.
- You cannot just build a new image and push it to production, bypassing the CICD and all the checks, as you can easily make things even worse
Making a hotfix by the book can easily take an hour or more, depending on the amount of unit tests the CICD runs, severity of the problem, etc. The question is: "Do you really have time for this? Your clients can't work and the SLA is breached."
Ideally there must be a document with a list of mission critical scenarios prepared in advance. These scenarios form the core of your application, the value it brings to the users. It could be a happy path that makes an order, or a way to see the list of clients. When any of these are affected, you don't have time for hesitation.
The choice between a hotfix and a rollback should be made only after answering the "Is any of the mission critical functionality affected?" question. If the answer is yes, the preferred course of action is to make a rollback.
As I've mentioned above, the compatibility issues must be addressed. The following it typical list of measures to make this possible:
- Backward compatible and re-applicable migrations. It means any migration always adds fields, don't change the existing data and uses things like IF NOT EXISTS or looking into the schema before creating objects such as enums, etc.
- Rejection of downward migrations as a concept. Down migrations sometimes imply the removal of newly introduced database columns, etc. You don't want this, as you don't want to loose user data even after the rollback happened.
- Backward compatible contracts. The same - you only add fields, or you introduce new versions of messages whilst keeping the previous ones in place too.
- Backward compatible front-end. The front-end should not collapse if a field that was there before is no longer found. It should degrade gracefully.
- Replayability of events. If your service sends information in an async manner, or perhaps consumes that kind of information, it should be message repetition tolerable.
As long as you stick to these rules, you should be safe and sound to make a rollback and breathe out. A detailed written instruction on how to do the rollback also won't hurt, so even a PM or a QA could potentially do the rollback.
Hope this blazing short article was helpful. Till the next time!
Sergei Gannochenko
Golang, React, TypeScript, Docker, AWS, Jamstack.
20+ years in dev.