Just over a year ago, Coolshop.com set out on our cloud journey. One of the things we were excited about was being able to use Kubernetes, and before long, we had our first deployment up and running in Google Kubernetes Engine (GKE).
With that first deployment quickly came the need to make it available to the internet. Initially very confused about the seemingly wide array of possible options, we quickly turned to the Kubernetes Ingress resource, even though it was (and still is) a feature in beta. We had using NGINX for the prior 8 years, so it seemed natural to look into the NGINX Ingress Controller.
Installation and Initial Impression
As it often happens when trying out new software, we scoured the internet for advice on how to install and start using NGINX Ingress Controller. We found advice that Helm was the best practice, read everything we could, and followed the guide that seemed most appropriate. We quickly had our Ingress running and were able to expose our deployments to the internet.
Bathing in our success, we started migrating more and more of our apps, and before long, time came to migrate our main application: the primary e-commerce software (which unfortunately is still a bit of a monolith).
Getting the Remote IP Address From Our Users
We soon found that our default installation of NGINX Ingress gave us a Cluster IP instead of the remote IP of the end users, making a bunch of operations difficult for us. It turned out that this was indeed the expected behavior. Per default setup, when any request is sent to the Kubernetes cluster, the requests are distributed to each node. Each node then forwards the request to the node that contains an Ingress Controller pod – and in this process, the Cluster IP is forwarded to the application. There’s a really great article describing the details of how source IP works in Kubernetes in the docs.
Fortunately, this is easily fixable by changing the NGINX Ingress Controller’s spec to have
externalTrafficPolicy set to
Local. Just be mindful that changing this setting on a controller that already exists can leads to a very unstable cluster in our experience, so to be sure, remove the controller and create a new one with the new setting.
With this new setting, traffic is now sent directly to the correct node, and the source IP will be retained to be the remote IP of the end user.
Random Timeouts in All Ingress Resources
About 9 months into our deployment, we had migrated all our applications to Google Kubernetes Engine. At this point, we’d had a few cases where we had seemingly lost our Ingress controller for a few seconds, but it’d been very far between, and usually in combination with maintenance (a Kubernetes upgrade for example). But the issues started to appear more frequent as we were nearing peak holiday season, and we knew we had to look into what was actually going on.
The symptoms were simply that everything gave a connection timeout - and it typically never lasted more than 1 or 2 minutes.
As it turns out, all of our issues came from the fact that we found a (seemingly) reliable guide on how to install and set up NGINX Ingress Controller in our cluster. While the guide was perfect to get started, it had omitted information on how to set up the controller for a high availability setup.
Our NGINX Ingress Controller ran a deployment that only had 1 replica. And sometimes this replica became very overwhelmed with all the traffic we were getting, and it restarted. While the pod was restarting, we of course had no pods to receive traffic – resulting in the timeouts we were seeing. This also explained our issues during Kubernetes upgrades; as nodes went down, we had to wait until a new pod started on another server, and the load balancer in Google Cloud had to update its status.
Lesson learned – make sure to read the high availability portion of the documentation, before you start using your system for production use!