Software Deployment Best Practices
Maximize customer value by deploying your software safely and quickly
Microservices
Microservices architecture enables medium to large organizations to quickly deliver customer value in small increments. Startups are still better off starting with a monolith as they are still finding the right product-market fit before they are ready to scale. Modern capabilities such as Serverless Functions in the cloud(e.g. AWS Lambda) enables product teams to focus only on the customer value and pay per product usage.
Due to distributed design, Microservices and Serverless Functions have several operational complexities. One of the complexities is trading off deployment safety with speed.
Business growth
Business growth depends on product teams to deliver software products that customers need. For most companies, their products are in a competitive environment and any unreliable customer experience may impact product usage and business growth. As a result, providing a consistent and reliable customer experience is critical for any organization. Software deployments directly impact product reliability.
Learnings from past production incidents
Each organization is different, has a different tech stack and budget constraints. An organization can improve service reliability by learning from past incidents. The best practices listed below are some of the practices that were derived as part of incidents post-mortems at my previous organizations (AWS, Microsoft, and now Airbnb). You should include best practices for your organization grounded in learnings from past incidents, to make your applications more reliable and enhance the customer experience.
Disclaimer: All content and opinions are my own. No organization including my current and previous organizations as listed above has endorsed this content.
Best practices for software deployments
Use an automated and codified deploy pipeline: A deployment pipeline such as Spinnaker enables you to automate and codify your release workflow, making software releases safe, repeatable, and consistent.
Use a canary with rolling or Blue-Green strategy: Canary release limits your deployment blast radius (number of users exposed to a poor quality release), giving you confidence in rolling out the new code to all customers. Canary release is a technique to introduce a new software version in production by rolling out the change to a small subset of users before rolling it out to the entire fleet. To deploy the new code to the rest of your fleet, you can use rolling deployments or Blue-Green deployment techniques with Canary release.
Rolling deployments is a technique in which a new software version is applied by replacing instances in the production fleet.
Blue-Green deployment technique minimizes the downtime by making it easy to rollback new change. It is a technique that gradually transfers user traffic from a previous version to the new version of an app, both running in production. Compared with rolling deployments, blue-green deployment is safer as it’s easier to rollback by just shifting the traffic to the previous version but its more expensive as you need two production environments, one for the previous version(blue) and the other for the new version(green).
Run automated integration tests as part of your release workflow: Integration tests provide good coverage for your new code changes and can run in both pre-production and production environments. You can prioritize which user-flows are critical and only invest in high-quality, non-flaky tests, including backward compatibility tests.
Use modern cloud-native infrastructure: Once your apps are running in containers, Kubernetes can continuously monitor the environment to maintain the desired state of your application. Kubernetes offers better protection because it simplifies a zero-downtime deployment using strategies such as a rolling update with health checks to replace old pods with new ones gradually while continuing to serve clients without incurring downtime.
Gradual deploys to multiple regions (e.g. AWS regions): To build confidence in the release, deploy your code in multiple increasing waves. For example, in Wave-1, deploy the new change to just one AZ in a region using a canary release with rolling deployment. In Wave-2, expand the deployment to multiple AZs in the same region with a rolling or Blue-Green deployment. In Wave-3, expand the deployment to a few regions. In Wave-4 deploy to all the remaining regions.
Make configuration changes go through a change management tool: Changes such as schema updates, or major infrastructure changes are a big risk to your system’s reliability. Ensure that all configuration changes and rollback plans are reviewed and signed-off by the team and operational experts.
Use feature-flags for a controlled rollout and as a kill-switch: For new features or high-risk changes, you can use feature-flags as a dial to gradually increase the exposure to the production traffic as confidence in the feature grows. If any issues are found, you can instantly turn it off by toggling the feature-flag off. The main drawback of feature flags is that they can make the code fragile, harder to test, harder to maintain, and less secure. If they are not short-lived (removed after the release) and are used for all changes instead of risky changes, they will lead to the explosion of feature-flags in the code, significantly increasing the tech-debt.
Use auto rollbacks: Automate your deployment pipeline such that it is continuously monitoring key metrics and can perform rollbacks automatically. This obviously can be done manually but it is time-consuming and error-prone, increasing the risk of an outage. If you don’t have auto-rollbacks, enable alerts on core metrics, and have a playbook to clearly direct anyone on the team to quickly rollback.