Infrastructure 3.0: Load balancing, observability, and deployments
In our last blog about our infrastructure, we described how we moved from VMs to Kubernetes. This is a continuation of that journey.
We’ll also go over some of the challenges we’ve met along the way, how we dealt with them, and what’s coming up in the next iteration.
Load balancing
Unleash uses Traefik as a reverse proxy and load balancer.
Before, we ran Traefik by setting up Network Load Balancers (NLBs) through service annotations. Now, Traefik is exposed by NLBs that we set up directly using Pulumi.
Here’s how we got there: After some time, we noticed that we were failing to serve some requests. Specifically, the requests timed out without reaching our Unleash instances.
We found that this only happened when a). we rolled out new versions of Traefik, or b). Traefik pods were scaled down. The root cause of this issue was that Traefik didn’t have enough time to finish ongoing requests before shutting down.
Here’s what I mean: When a Traefik pod stopped, it did not give the AWS NLB enough time to mark the host with that Traefik instance on it as unhealthy so that it stopped sending traffic to it.
Part of the fix included leveraging the preStop lifecycle in Kubernetes:
lifecycle: {
preStop: {
exec: {
command: ["/bin/sh", "-c", "pkill -TERM traefik; while killall -0 traefik;do sleep 1; done"]
}
}
},
'--entrypoints.websecure.transport.lifecycle.requestacceptgracetimeout=60s',
'--entrypoints.traefik.transport.lifecycle.requestacceptgracetimeout=60s',
'--entrypoints.web.transport.lifecycle.requestacceptgracetimeout=60s',
'--entrypoints.metrics.transport.lifecycle.requestacceptgracetimeout=60s',
This code makes a Traefik pod serve requests for 60 seconds after it’s told to shut down on all endpoints, excluding the health endpoint /ping.
However, the result was that we had just made our termination of Traefik pods slower without getting rid of the requests that were timing out.
We discovered an issue in the Kubernetes service the NLB was using to reach Traefik. It was immediately shutting off traffic when the pod entered its terminating state.
Obviously, this wasn’t great. Both the Kubernetes service and the NLB decided when to route traffic, and to where. Kubernetes made decisions faster than the NLB could. This caused the service and NLB to end up out of sync.
So we decided to remove the service layer in Kubernetes by exposing Traefik through host ports from the pods. We then targeted the host ports from the NLB instead of the Kubernetes service.
This meant we set up the NLBs ourselves rather than auto provisioning the NLB through annotations on the service object. This solved our issues.
Observability
Logging
We have transitioned from sending logs directly to Cloudwatch from our applications, to just logging to standard output. This follows the recommendations from The Twelve Factor App.
Inside our clusters we run Fluent Bit as a daemonset that pushes application logs to Loki and S3.
We set up Loki with a 7 day retention period in each Kubernetes cluster. We use Grafana as the interface for querying logs. We leverage S3 for longer retention times.
Metrics
We’ve migrated from Prometheus to VictoriaMetrics.
We now use VMAgent to scrape metrics. Then we leverage remote write to write to both a central VictoriaMetrics instance and a local instance in the cluster.
What this means is we can now expose metrics from VictoriaMetrics in our Unleash instances. The network traffic view in our SaaS offering is powered by VictoriaMetrics.
Deployments
As of this blog, we host roughly 450 customer instances.
Earlier we had used Pulumi to handle all deployments on our hosted instances. It took a lot of time.
Stacks had to be constantly updated due to the way it was set up. It was fragile, nerve wracking, and pretty terrible for our internal developers.
To compensate, we’ve split out Pulumi’s deployment responsibility into three parts:
- Pulumi for customer configuration
- Unleash Cloud Operator for managing how Unleash runs in Kubernetes
- Release Channel Operator for keeping the Unleash version updated
This means Pulumi only updates a CRD when we deploy new versions into production. With this new configuration, we can deploy a new version to 200 customers in less than 10 minutes.
What’s next in Unleash’s infrastructure?
Some things we’re working on for the next iteration of our infrastructure:
- We’ve begun running Spot instances in our clusters.
- We found that some of our workloads are not compatible with draining nodes. To handle those workloads, we’ve written a Kubernetes drain assistant.
- Together, Spot instances and reliable draining enable us to introduce Karpenter to our clusters. You’ll remember that we mentioned Karpenter in our last blog. While we’re not quite there yet, we’re super close to leveraging Karpenter’s ability to autoscale, rightsize, and balance availability zones.