Recent Load Balancer Problems

Over the past few weeks, we’ve had a few service interruptions that can all be traced back to one cause - instability in our high-availability load balancer setup. Here’s a brief summary of our existing load balancing setup, a look at its problems, and what we’re doing to fix it.

![](https://img.skitch.com/20111003-tjgih3up6j8rr3b6gunjn4agjt.jpg)

Load Balancing at GitHub

To handle all of GitHub’s incoming HTTP, SSH and Git traffic, we run quite a few frontend servers. In front of these servers, we run an IPVS load balancer to distribute incoming traffic, while sending reply traffic via direct routing. When this server fails, GitHub is down.

IPVS doesn’t require beefy hardware, so in our initial deployment, we set it up on a small Xen virtual server. We run a handful of Xen hosts inside our private network to power utility servers that don’t require the power of a dedicated server - smtp, monitoring, serving HTML for GitHub Pages, etc. A few of these services require high availability - like Pages.

HA Virtual Servers using Linux-HA, DRBD, and Xen

To achive high availability of virtual servers that require it, we combine LVM, DRBD, Pacemaker, and Heartbeat to run a pair of Xen virtual servers. For example, right now GitHub Pages are being served from a virtual server running on xen3. This server has a DRBD mirror on xen1. If heartbeat detects that the pages virtual server on xen3 isn’t responding, it automatically shuts it down, adjusts the DRBD config to make the LVM volume on xen1 the primary device, then starts the virtual server on xen1.

Enter STONITH

However, in recent weeks, we’ve had a few occurrences where load on our Xen servers spiked significantly, causing these heartbeat checks to time out repeatedly even when the services they were checking were working correctly. In a few occurrences, numerous repeated timeouts caused a dramatic downward spiral in service as the following sequence of events unfolded:

  1. Heartbeat checks timeout for the pages virtual server on xen3.
  2. Pacemaker starts the process of transitioning the virtual server to xen1. It starts by attempting to stop the virtual server on xen3, but this times out due to high load.
  3. Pacemaker now determines that xen3 is dead since an management command has failed, and decides that only way to regain control of the cluster is to remove it completely. The node is STONITH‘d via an IPMI command that powers down the server via its out-of-band management card.
  4. Once xen3 is confirmed powered-off, Pacemaker starts the virtual servers previously residing on the now-dead xen3 on xen1, and notifies us we’ll need to manually intervene to get xen3 back up and running.

If the Xen server that is killed was running our load balancer at the time, HTTPS and Git traffic to GitHub stays down until it comes back up. To make matters worse, our load balancers occasionally require manual intervention to get into their proper state after reboot due to a bug in their init scripts.

A Path to Stability

After recovering from the outage early Saturday morning, we came to the realization that our current HA configuration was causing more downtime than it was preventing. We needed to make it less aggressive, and isolate the load balancers from any impact in the event of another services’ failure.

Over the weekend we made the following changes to make our HA setup less aggressive:

  • Significantly reduce the frequency of Heartbeat checks between virtual server pairs
  • Significantly increase the timeouts of these Heartbeat checks

These changes alone have reduced the average load and load variance across our Xen cluster by a good bit:

![](https://img.skitch.com/20111003-x699hhu5afhkb3kypf7ri8pkf5.jpg)

More importantly, there hasn’t been a single false heartbeat alert since the change, and we don’t anticipate any more soon.

We’re also ordering a pair of servers on which we’ll run a dedicated HA pair for our load balancers. Once these are in place, our load balancers will be completely isolated from any HA Xen virtual server failure, legitimate or not.

Of course, we’re also working on improving the configuration of the load balancers to reduce the MTTR in the event of any legitimate load balancer failure.

We’ve just recently brought on a few new sysadmins (myself included), and are doubling down on stability and infrastructure improvements in the coming months. Thanks for your patience as we work to improve the GitHub experience as we grow!

Have feedback on this post? Let @github know on Twitter.
Need help or found a bug? Contact us.

Changelog

Subscribe

Discover new ways to build better

Try Marketplace apps free for 14 days

Learn more