Auto-failover

An HA cluster auto-failover is known as an auto-failover cluster. In an auto-failover cluster, if the Leader becomes unavailable, .auto-failover enables the automatic promotion of a Standby to Leader.

This section describes auto-failover capabilities.

To add auto-failover onto an existing HA cluster, see Configure auto-failover.

Benefits of auto-failover

The Leader and Standbys in an auto-failover cluster are aware of each other's availability. If the Leader fails, the remaining cluster members first determine if failover is possible and, if so, perform failover in a safe and coordinated manner.

An auto-failover cluster performs the following without any operator intervention:

  • Detects a problem with the Leader

  • Promotes a fully synchronized Standby

  • Rebases the remaining Standbys to the new Leader

  • Prevents the failed Leader from rejoining the cluster

  • How it works

Auto-failover is an add-on capability to an HA cluster. After configuring an HA cluster, the following must be done to add auto-failover to the cluster:  

  • Create a policy that describes and configures the auto-failover cluster

  • Use evoke commands to enroll the cluster nodes into an etcd cluster

After you configure auto-failover on a cluster, a cluster service runs on the Leader and each Standby in the cluster. This cluster service implements the auto-failover functionality.

The cluster service incorporates the etcd service and the Raft consensus algorithm, both industry-accepted open source solutions, to monitor cluster nodes and perform auto-failover.

etcd service

  • Maintains shared configuration about the current Leader
  • Understands the state of the cluster
  • Determines when to promote a Standby to Leader
  • Manages the promotion

Raft consensus algorithm

Helps determine when promotion is possible

A distributed configuration store provides cluster coordination and state management.

Failure detection and promotion

The following table describes the failover detection and promotion process.

Event

Description

Failure detection

In a healthy cluster, the Leader continuously resets a time-to-live (TTL) counter. When the Leader is unavailable to other members of the cluster, it cannot reset the TTL. After a configured amount of time, the TTL expires.

An expired TTL counter indicates failure of the Leader.

Reasons for a failed Leader include: 

  • The Leader was terminated

  • The machine running the Leader is experiencing problems

  • Network problems are affecting availability of the Leader to other members of the cluster

This functionality is available from Conjur Enterprise v10.9.

By default, the TTL is set to 5 minutes. Auto-failover honors the 5-minute wait time. If the problems causing unavailability are resolved before the timeout expires, cluster operation resumes without failover. You can configure the TTL value in the cluster policy file. For details, see Create and load cluster policy.

Promotion attempt

An expired TTL triggers promotion attempts within the auto-failover cluster.

When the TTL expires, the Standbys recognize this condition and decide whether a promotion attempt should occur. A promotion attempt can proceed only if all of the following conditions are true: 

  • The cluster has at least two Standbys

  • A majority of the cluster nodes (a quorum) can still communicate with each other - see The quorum rule

  • At least one Standby in the quorum is fully current with the Leader. A promotion requires one healthy Standby running in synchronous mode.

  • If all of the promotion conditions are satisfied, the promotion attempt continues. With the help of a consensus algorithm, Standbys identify themselves as candidates for promotion to Leader. They race to get a promotion lock. The Standby that wins attempts to promote itself and operate as the Leader. The other Standbys return to a normal Standby state.

Successful promotion

In a successful promotion, the new Leader updates the cluster configuration store. It records itself as the new Leader and removes the old Leader from the cluster configuration. The remaining Standbys rebase to the new Leader and begin receiving continuous replication from the new Leader.

The old Leader is evicted from the cluster. Even if it is physically healthy but temporarily unavailable due to network problems, it does not reappear in the cluster.

Failed promotion

If promotion is unsuccessful, manual intervention is required to troubleshoot the reasons why the Leader is unavailable and no Standbys are promotable.

The quorum rule

The quorum rule is the basis of the Raft consensus algorithm. It determines when a cluster is healthy enough to promote a Standby to Leader. This rule is the reason why there must always be an odd number of total nodes (an even number of Standbys) in the cluster.

The state of communication between all of the Standbys can affect whether auto-failover is possible when the Leader is unreachable. The Raft consensus algorithm uses majority rule with a leader to determine what happens in a failure situation. In Conjur, the leader is the Leader.

In a failure situation, some number of cluster nodes are unreachable, and a subset of nodes are still able to communicate with each other. If the number of nodes still able to communicate represents a majority (i.e. a quorum), those nodes make decisions about what to do about the failure.

Auto-failover does not occur in the following situations: 

  • If the majority of cluster nodes (a quorum) cannot communicate with each other, promotion attempts are not started.

  • If the Leader is in the quorum, promotion attempts are not started.

If a quorum exists and the Leader is not part of it, the Leader is assumed unreachable and a new Leader is elected. It is not acceptable to have two leaders, so the protocol evicts the old Leader from the cluster and all communications with it shut down. In Conjur, this means that the old Leader cannot return to the cluster after a Standby is promoted, even if communications are successfully re-established.

The following diagram illustrates the quorum rule, showing results from various network problems. From left to right:

  • In scenario 1, the Leader is available and promotion is not needed.

  • In scenarios 2, 3, and 4, the Leader is unavailable but the quorum rule (majority of nodes still communicating with each other) is not satisfied, so failover does not occur.

  • In scenario 5, the Leader is unavailable and promotion conditions are satisfied. Promotion is triggered. Note that a successful promotion includes removing the Leader from the cluster, even though the Leader in this case could be a perfectly viable running machine.