Over the past few weeks, we’ve had a few service interruptions that can all be traced back to one cause - instability in our high-availability load balancer setup. Here’s a brief summary of our existing load balancing setup, a look at its problems, and what we’re doing to fix it.
To handle all of GitHub’s incoming HTTP, SSH and Git traffic, we run quite a few frontend servers. In front of these servers, we run an IPVS load balancer to distribute incoming traffic, while sending reply traffic via direct routing. When this server fails, GitHub is down.
IPVS doesn’t require beefy hardware, so in our initial deployment, we set it up on a small Xen virtual server. We run a handful of Xen hosts inside our private network to power utility servers that don’t require the power of a dedicated server - smtp, monitoring, serving HTML for GitHub Pages, etc. A few of these services require high availability - like Pages.
To achive high availability of virtual servers that require it, we combine LVM, DRBD, Pacemaker, and Heartbeat to run a pair of Xen virtual servers. For example, right now GitHub Pages are being served from a virtual server running on
xen3. This server has a DRBD mirror on
xen1. If heartbeat detects that the pages virtual server on
xen3 isn’t responding, it automatically shuts it down, adjusts the DRBD config to make the LVM volume on
xen1 the primary device, then starts the virtual server on
However, in recent weeks, we’ve had a few occurrences where load on our Xen servers spiked significantly, causing these heartbeat checks to time out repeatedly even when the services they were checking were working correctly. In a few occurrences, numerous repeated timeouts caused a dramatic downward spiral in service as the following sequence of events unfolded:
xen1. It starts by attempting to stop the virtual server on
xen3, but this times out due to high load.
xen3is dead since an management command has failed, and decides that only way to regain control of the cluster is to remove it completely. The node is STONITH‘d via an IPMI command that powers down the server via its out-of-band management card.
xen3is confirmed powered-off, Pacemaker starts the virtual servers previously residing on the now-dead
xen1, and notifies us we’ll need to manually intervene to get
xen3back up and running.
If the Xen server that is killed was running our load balancer at the time, HTTPS and Git traffic to GitHub stays down until it comes back up. To make matters worse, our load balancers occasionally require manual intervention to get into their proper state after reboot due to a bug in their init scripts.
After recovering from the outage early Saturday morning, we came to the realization that our current HA configuration was causing more downtime than it was preventing. We needed to make it less aggressive, and isolate the load balancers from any impact in the event of another services’ failure.
Over the weekend we made the following changes to make our HA setup less aggressive:
These changes alone have reduced the average load and load variance across our Xen cluster by a good bit:
More importantly, there hasn’t been a single false heartbeat alert since the change, and we don’t anticipate any more soon.
We’re also ordering a pair of servers on which we’ll run a dedicated HA pair for our load balancers. Once these are in place, our load balancers will be completely isolated from any HA Xen virtual server failure, legitimate or not.
Of course, we’re also working on improving the configuration of the load balancers to reduce the MTTR in the event of any legitimate load balancer failure.
We’ve just recently brought on a few new sysadmins (myself included), and are doubling down on stability and infrastructure improvements in the coming months. Thanks for your patience as we work to improve the GitHub experience as we grow!