Quorum and Gerrymandering : Thoughts on managing failover in a distributed HA system

Defining Quorum

Quorum is defined as the number of voters necessary to carry an election or vote.  We would like to ensure that all services carry on functioning  under various failure mechanisms. We need to ensure that our high availability computer systems have access to valid authoritative data at all times.

The Problem

We have found some failure cases where everything still works, except the Raft based HA services.  The reason is that there are not enough members of the electorate available to form a consensus.

Raft Consensus Protocol

We are using Consul to generate and publish some DNS records.   Some of those DNS records are within data-centres, some are private between data-centres, and some are public facing external records.  Consul uses Raft as its mechanism for managing the quorum and assuring that a master is elected.

If we have n masters the Raft process gives a quorum only if more than int(n/2+1) are available eg for 5 masters, 3 need to be able to communicate with each other to form a quorum.  Ref https://www.consul.io/docs/internals/consensus.html

Failure Mechanisms

But there are a couple of scenarios where this Raft quorum can guarantee that our service goes down:

  1. A problem occurs when we try to use quorum between masters in 3 different data-centres.  If the connection between a data-centre and the other data-centres disappears then the local master becomes isolated.  It cannot process any local requests.  In one failure scenario an upstream router is misconfigured. Our customers and suppliers are still connected within LINX and all our servers are still active… but any of our services relying on a quorum of masters will stop working.  This risk is mitigated by configuring Consul as an independent cluster in each DC.  We can then use the multi-data-centre capabilities of Consul to distribute only the necessary shared states and Key-Value pairs across the WAN links.
  2. We found another slightly less obvious gotcha.  If we build our services as a cluster of voters, and each voter is dependant on one of two PDU’s then it is impossible to build a cluster that will survive a PDU failure! If half your voters are down it is impossible to form a majority.
Quorum and Leader in each DC :High Level view of Consul architecture.

Consul treats WAN and LAN differently in order to avoid trying to form a quorum over the WAN

Rigging the ballot

This second scenario led us into the political minefield of how to effectively rig the ballot!  I think it is entirely possible that my life in Northern Ireland has influenced my views on voting…

Quoting wikipedia: https://en.wikipedia.org/wiki/Gerrymandering

In the process of setting electoral districts, gerrymandering is a practice intended to establish a political advantage for a particular party or group by manipulating district boundaries. The resulting district is known as a gerrymander (/ˈɛriˌmændər/); however, that word can also refer to the process. The term gerrymandering has negative connotations. Two principal tactics are used in gerrymandering: “cracking” (i.e. diluting the voting power of the opposing party’s supporters across many districts) and “packing” (concentrating the opposing party’s voting power in one district to reduce their voting power in other districts).[1]

We designed a gerrymandering service such that in the event of a service failure we use our monitoring service to kick start additional voters.  The gerrymandering service could also remove defunct and unavailable voters from the pool.  The gerrymander would keep adding new voters until it achieves a quorum and life returns to normal.

We found that this has the unfortunate side effect that it is really hard to re-integrate the other voters if they start participating again, much like the N. Ireland situation… split brain and divergence leads to inconsistencies that must be tided up by hand.  If we again using the parallel of N.Ireland politics, this is roughly the equivalent of trying to get Unionist and Nationalist politicians in our Northern Ireland Assembly to reach a consensus.

A better solution

After some additional head scratching we ditched the Voxbit Agreement Gerrymandering Under Extremis (V.A.G.U.E.) service.  Instead we made sure that some of the voters in any consensus based service are fed with power from both PDU’s, and that the other voters are evenly spread across the 2 PDU’s.

I’m not quite sure yet what the political equivalent of that would be.

Leave a Reply