Robust Istio Upgrades in 7 easy steps
How do you safely upgrade Istio?

There are many articles that describe how to progressively roll out applications with Istio, but not about upgrading it itself. This article presents some shortcomings of the methods described in official documentation and proposes an alternative solution.
Istio consists of multiple components divided into the control plane and the data plane. Those components can be installed in multiple ways, in our case we were using Helm. The documentation presents two upgrade methods: in-place and canary and doesn't recommend using in-place upgrades for production.
Canary upgrade and its problems
Istio supports running two versions of the control plane in the cluster with them managing different namespaces with the help of revisions:

Istio documentation suggests:
deploying another canary control plane
deploying a new version of Istio Ingress Gateway
migrating existing workloads to the new revision by relabelling their namespaces
The docs also admit that you probably won't be able to use Helm if don't want to create a new load-balancer for your Ingress Gateway:
Because other installation methods bundle the gateway
Service, which controls its external IP address, with the gatewayDeployment, only the Kubernetes YAML method is supported for this upgrade method.https://istio.io/v1.22/docs/setup/additional-setup/gateway/#canary-upgrade-advanced
This is, of course, an issue, because with a new load-balancer you would need to move all the traffic to it. In highly-secured environments this can be further complicated by security policies tightly coupled to fixed IP addresses.
To summarise, the officially proposed upgrade process involves multiple manual steps with direct access to the cluster. Coming back to highly-secured environments, sometimes it's not possible.
Semi-canary upgrade
Here I present an alternative solution that doesn't require as many manual steps. The idea is taken from Pratima Nambiar and her talk at the Isito meetup demo. The process is the following:
Upgrade the CRDs
Install a canary control plane (skip this step if you have it running)
Have sample services managed by the canary control plane, and validate that everything is OK, e.g:
Deploy a new service with new versions of Istio CRDs
Run smoke tests eon the services, checking that the traffic is flowing as expected
Observe control-plane metrics for apparent issues, pay special attention to metrics like
pilot_proxy_convergence_time,pilot_proxy_queue_time,pilot_xds_push_time,pilot_inbound_updates, andpilot_total_xds_rejects. Read more about the metrics in the documentation.
In-place upgrade the main control plane
Run the validations and observe the metrics
Rolling-restart production workloads for them to pick up new side-cars
Upgrade Ingress Gateway
One of the crucial components of the reliability of this method is performing it on lower-tier environments first, thus reducing the likelihood of any issues to a minimum. Of course, upgrades on the lower-tier environments must be performed with the same versions of the components. It would not make sense to upgrade the istiod from 1.15.x to 1.16.x on dev and expect no issues upgrading from 1.21.x to 1.22.x in production.
Why this works and why it's robust?
When we upgrade Istio CRDs we don't expect incompatible changes in the save version. Thus, if we have a DestinationRule with apiVersion: networking.istio.io/v1alpha3 we expect it to continue working and incompatibilities potentially introduced in v1beta1. To make sure that this is the case we should carefully read the release notes and do the testing on low-tier environments.
When we upgrade the canary-control plane we don't risk anything because it only manages sample workloads we use for smoke-testing. Istio can have multiple versions of the control planes in the cluster working simultaneously.
After we did the tests on the sample workloads we are sure that the control plane can work with new CRDs and the traffic is correctly flowing. If we use an existing gateway for the tests, we also test that traffic is flowing between two versions of the proxies: the Gateway is still running an old version, and sample workload running with new ones (in the side-cars). Now it's time to upgrade the main control plane and the ingress. We observe the metrics again and are prepared to do a swift rollback if we see issues.



