alpa.sh

A Strange Death of DevOps

Alexander Pashkov — Sun, 08 Dec 2024 10:45:04 GMT

About 15 years ago DevOps was born: a movement that promised unseen improvements in time to market and software quality. It aimed to remove the silos between the development and operations teams, thus DevOps, and make the process of software delivery to the end user, which involved both, much more efficient.

Nowadays, “DevOps Engineer” (and some newer derivatives like DevSecOps and MLOps) jobs are an industry standard, and required skills usually include knowledge of cloud platforms like AWS, infrastructure-as-code tools, container orchestrators like Kubernetes, build-servers like Jenkins and monitoring solutions like Prometheus and Grafana. The actual job responsibilities usually include provisioning, configuring, and maintaining cloud infrastructure, build pipelines, and monitoring. Many roles require only minimal interaction with the developers, some close to none. But wait, we were supposed to get rid of the division between development and operations, weren’t we?

I call such roles “Cloud System Administration”, inspired by a book with almost the same name. Let’s see how Wikipedia defines “System administrator”:

An IT administrator, system administrator, sysadmin, or admin is a person who is responsible for the upkeep, configuration, and reliable operation of computer systems, especially multi-user computers, such as servers. The system administrator seeks to ensure that the uptime, performance, resources, and security of the computers they manage meet the needs of the users, without exceeding a set budget when doing so.

If you add the word “cloud” a few times here and there, you will get an accurate description of what the vast majority of freshly coined DevOps engineers do, despite in the cloud.

So how can we revive DevOps? Let’s remember a few things.

DevOps is not about tools, it’s about the process

DevOps is about making the delivery of the features to the end users more efficient. It’s about increasing the development speed and decreasing the number of bugs in production. It’s about decreasing the lead time (how much time has elapsed between committing code and deploying it to production), increasing MTBF - mean time between failures (elapsed time between inherent failures of a mechanical or electronic system during normal system operation), and decreasing MTTR - mean time to restore (the time it takes to restore service after production failure). As you work on improving these metrics, you will likely have to use the cloud and all other aforementioned tools, but tools are just tools, they are not the end goal.

For example, if your application suffers from a memory leak that your development team doesn’t have time to fix, you can increase the mean time between failures (MTBF) observed by the user by deploying a set of its replicas to a container orchestrator (e.g. Kubernetes, or AWS ECS) with a configured memory-limits. After a certain threshold, the application will be restarted and operational until the threshold is reached again. By having more than one replica we increase the chances of app availability when one of the replicas is restarted. However, deploying to the container orchestrator is just a tool to solve a problem, it’s not a goal in itself. To do DevOps you don’t have to be in the cloud or use any of the tools mentioned in the CNCF landscape.

You cannot outsource DevOps

DevOps is about improving your end-to-end process, and if you are not interested in that, no one is, especially the contractors you hire. The contractors can be an extremely valuable asset by bringing the expertise you may lack in-house, but why you need those assets has to be driven from the inside and continuously reconciled to make sure, that the contractors’ work is aligned with your initial goals. I’ve seen atrocious cases of ignorance that cost the company a lot. The contractors come, do their work, get paid and the company ends up with a hefty bill and a complex useless system that has to be thrown out.

Thus, understand your goals, focus on simplicity, and try to develop or hire the expertise in-house.

Summary

To sum up, there is no such role as a DevOps engineer, you are either a cloud system administrator or a software engineer practicing DevOps and it’s more interesting to be the latter, than the former.

Robust Istio Upgrades in 7 easy steps

Alexander Pashkov — Sun, 18 Aug 2024 11:05:38 GMT

There are many articles that describe how to progressively roll out applications with Istio, but not about upgrading it itself. This article presents some shortcomings of the methods described in official documentation and proposes an alternative solution.

Istio consists of multiple components divided into the control plane and the data plane. Those components can be installed in multiple ways, in our case we were using Helm. The documentation presents two upgrade methods: in-place and canary and doesn't recommend using in-place upgrades for production.

Canary upgrade and its problems

Istio supports running two versions of the control plane in the cluster with them managing different namespaces with the help of revisions:

Istio documentation suggests:

deploying another canary control plane
deploying a new version of Istio Ingress Gateway
migrating existing workloads to the new revision by relabelling their namespaces

The docs also admit that you probably won't be able to use Helm if don't want to create a new load-balancer for your Ingress Gateway:

Because other installation methods bundle the gateway Service, which controls its external IP address, with the gateway Deployment, only the Kubernetes YAML method is supported for this upgrade method.

https://istio.io/v1.22/docs/setup/additional-setup/gateway/#canary-upgrade-advanced

This is, of course, an issue, because with a new load-balancer you would need to move all the traffic to it. In highly-secured environments this can be further complicated by security policies tightly coupled to fixed IP addresses.

To summarise, the officially proposed upgrade process involves multiple manual steps with direct access to the cluster. Coming back to highly-secured environments, sometimes it's not possible.

Semi-canary upgrade

Here I present an alternative solution that doesn't require as many manual steps. The idea is taken from Pratima Nambiar and her talk at the Isito meetup demo. The process is the following:

Upgrade the CRDs
Install a canary control plane (skip this step if you have it running)
Have sample services managed by the canary control plane, and validate that everything is OK, e.g:
1. Deploy a new service with new versions of Istio CRDs
2. Run smoke tests eon the services, checking that the traffic is flowing as expected
3. Observe control-plane metrics for apparent issues, pay special attention to metrics like pilot_proxy_convergence_time, pilot_proxy_queue_time, pilot_xds_push_time, pilot_inbound_updates, and pilot_total_xds_rejects. Read more about the metrics in the documentation.
In-place upgrade the main control plane
Run the validations and observe the metrics
Rolling-restart production workloads for them to pick up new side-cars
Upgrade Ingress Gateway

One of the crucial components of the reliability of this method is performing it on lower-tier environments first, thus reducing the likelihood of any issues to a minimum. Of course, upgrades on the lower-tier environments must be performed with the same versions of the components. It would not make sense to upgrade the istiod from 1.15.x to 1.16.x on dev and expect no issues upgrading from 1.21.x to 1.22.x in production.

Why this works and why it's robust?

When we upgrade Istio CRDs we don't expect incompatible changes in the save version. Thus, if we have a DestinationRule with apiVersion: networking.istio.io/v1alpha3 we expect it to continue working and incompatibilities potentially introduced in v1beta1. To make sure that this is the case we should carefully read the release notes and do the testing on low-tier environments.

When we upgrade the canary-control plane we don't risk anything because it only manages sample workloads we use for smoke-testing. Istio can have multiple versions of the control planes in the cluster working simultaneously.

After we did the tests on the sample workloads we are sure that the control plane can work with new CRDs and the traffic is correctly flowing. If we use an existing gateway for the tests, we also test that traffic is flowing between two versions of the proxies: the Gateway is still running an old version, and sample workload running with new ones (in the side-cars). Now it's time to upgrade the main control plane and the ingress. We observe the metrics again and are prepared to do a swift rollback if we see issues.

Unquantifiable Value Propositions Flood the Market: Kubernetes This, AI That

Alexander Pashkov — Wed, 10 Jul 2024 07:11:09 GMT

Nowadays, advertisements bombard us with promises to revolutionize our lives: claims of unseen scalability, cost optimizations, agility, and enhanced developer productivity. The pitch is always the same: buy our service, and our magical software will solve all your problems.

Here lies the problem. Forty-nine years ago, Fred Brooks, in his seminal book The Mythical Man-Month, asserted that there is no silver bullet — no single software development technique, tool, or methodology that can increase development speed by an order of magnitude. His observation was that most time (and money) is spent dealing with the essential complexity of a project: domain models, data flow, and the like. Thus, the magical promises often fail to materialize. At best, these tools might reduce some accidental complexity, but it’s crucial to ensure that the cost of the tool or service is less than the benefit it brings.

Consider this example: a small company decides to “modernize” its infrastructure because it feels outdated, running a few services on mostly manually configured VMs. They see everyone moving to Kubernetes and feel the pressure to follow suit.

They embark on the journey, but the internal team needs to gain the skills and be eager to acquire them. The company needs help, so they hire external contractors who swear to deliver.

And deliver they do. The applications are containerized, Kubernetes clusters are spun up, and the GitOps methodology is strictly followed (God bless Flux and ArgoCD).

And what does the company get? An enormous amount of new accidental complexity on top of the essential one, and a hefty bill for the services delivered.

The developers need learn how to deal with containers, services, scheduling rules, new pipelines, etc. And remember: this is not what brings the revenue, this is new accidental stuff.

Thus arises the question: where is the money, Lebowski?

When embarking on such journeys, make sure you know what you’re doing and aren’t blindly following the latest trend. Remember, there’s no silver bullet.

Real value must be delivered for the greater benefit of everyone, whether you are the company that seeeks the service or a contactor that aims to provide it.