When you push to production multiple times a day, the stakes are high — and manual rollbacks slow.
Here’s my experiences from deploying services using Canary Deployments for an important backend service for a large global streaming service.
Why Canary Deployments?
We adopted Canary Deployment for a couple of reasons. We had one or two incidents where an automatic rollback would have given us a faster resolution. We also had pressure from management to have Canary Deployments if we wanted to keep doing Continuous Deployment, and avoid Release Trains. An important goal was to have automatic rollback if our errors spiked in the new release.
Our Canary Setup
We’re using the Canary Deployment Strategy in Argo Rollout. This of course means we are running in Kubernetes. The service I am using for most of my experience is being deployed quite often. It’s not unusual to have deployments to production several times per day.
Our canary analysis is based on a simple metric on the 5xx http status code, and that’s what I would recommend you to start with as well. If our error rate goes above 1% for 3 analysis runs, we fail the canary and trigger rollback. What’s your go-to canary metric?
analysis:
- name: response-code-500
interval: 30s
successCondition: len(result) == 0 || result[0] <= 0.01 || isNaN(result[0])
failureLimit: 3
query: |
sum(irate(
istio_requests_total{destination_service="{{args.canary-service-name}}.{{args.service-namespace}}.svc.cluster.local",kubernetes_namespace="{{args.service-namespace}}",response_code=~"5.*"}[5m]
)) /
sum(irate(
istio_requests_total{destination_service="{{args.canary-service-name}}.{{args.service-namespace}}.svc.cluster.local",kubernetes_namespace="{{args.service-namespace}}"}[5m]
))
Key learnings
It’s been a couple of years of canary deployments now. So what have we learnt?
Gradual deployment has been beneficial
A slow (minutes) rollout gives new pods traffic gradually so that they have a chance to warm up. (The service has in-memory caches etc)
Good to be able to filter metrics and logs on canary/stable
We have metrics in Prometheus and logs on the ELK stack, and having both metrics and logs tagged with canary/stable has helped us to manually see differences in behaviour during deployments.
Good to catch configuration errors
Have you ever seen a bad config slip through tests? The thing that’s unique in different environments is usually the configuration. We connect to different endpoints and databases in different environments. When introducing something new, it’s possible to screw up an endpoint configuration. That sort of thing will not be detected by unit or integration tests, but it might be detected by the canary analysis.
Metrics lag
In our system, it takes a while before we get metrics from the canary pods. The metrics endpoints on our pods are scraped every 30s, and our analysis is run every 30s, so there is some delay between errors happening and the metric moving. If you don’t run your analysis for long enough, you risk passing your canary without the analysis even seeing any canary traffic.
That’s what I could think of tonight! What are your experiences?