Auto-failover
An HA cluster auto-failover is known as an auto-failover cluster. In an auto-failover cluster, if the Leader becomes unavailable, .auto-failover enables the automatic promotion of a Standby to Leader.
This section describes auto-failover capabilities.
To add auto-failover onto an existing HA cluster, see Configure auto-failover.
Benefits of auto-failover
The Leader and Standbys in an auto-failover cluster are aware of each other's availability. If the Leader fails, the remaining cluster members first determine if failover is possible and, if so, perform failover in a safe and coordinated manner.
An auto-failover cluster performs the following without any operator intervention:
-
Detects a problem with the Leader
-
Promotes a fully synchronized Standby
-
Rebases the remaining Standbys to the new Leader
-
Prevents the failed Leader from rejoining the cluster
How it works
Auto-failover is an add-on capability to an HA cluster. After configuring an HA cluster, the following must be done to add auto-failover to the cluster:
-
Create a policy that describes and configures the auto-failover cluster
-
Use evoke commands to enroll the cluster nodes into an etcd cluster
After you configure auto-failover on a cluster, a cluster service runs on the Leader and each Standby in the cluster. This cluster service implements the auto-failover functionality.
The cluster service incorporates the etcd service and the Raft consensus algorithm, both industry-accepted open source solutions, to monitor cluster nodes and perform auto-failover.
etcd service |
|
Raft consensus algorithm |
Helps determine when promotion is possible |
A distributed configuration store provides cluster coordination and state management.
Failure detection and promotion
The following table describes the failover detection and promotion process.
Event |
Description |
---|---|
Failure detection |
In a healthy cluster, the Leader continuously resets a time-to-live (TTL) counter. When the Leader is unavailable to other members of the cluster, it cannot reset the TTL. After a configured amount of time, the TTL expires. An expired TTL counter indicates failure of the Leader. Reasons for a failed Leader include:
This functionality is available from Conjur Enterprise v10.9. By default, the TTL is set to 5 minutes. Auto-failover honors the 5-minute wait time. If the problems causing unavailability are resolved before the timeout expires, cluster operation resumes without failover. You can configure the TTL value in the cluster policy file. For details, see Create and load cluster policy. |
Promotion attempt |
An expired TTL triggers promotion attempts within the auto-failover cluster. When the TTL expires, the Standbys recognize this condition and decide whether a promotion attempt should occur. A promotion attempt can proceed only if all of the following conditions are true:
If all of the promotion conditions are satisfied, the promotion attempt continues. With the help of a consensus algorithm, Standbys identify themselves as candidates for promotion to Leader. They race to get a promotion lock. The Standby that wins attempts to promote itself and operate as the Leader. The other Standbys return to a normal Standby state. |
Successful promotion |
In a successful promotion, the new Leader updates the cluster configuration store. It records itself as the new Leader and removes the old Leader from the cluster configuration. The remaining Standbys rebase to the new Leader and begin receiving continuous replication from the new Leader. The old Leader is evicted from the cluster. Even if it is physically healthy but temporarily unavailable due to network problems, it does not reappear in the cluster. |
Failed promotion |
If promotion is unsuccessful, manual intervention is required to troubleshoot the reasons why the Leader is unavailable and no Standbys are promotable. |
The quorum rule
The quorum rule is the basis of the Raft consensus algorithm. It determines when a cluster is healthy enough to promote a Standby to Leader. This rule is the reason why there must always be an odd number of total nodes (an even number of Standbys) in the cluster.
The state of communication between all of the Standbys can affect whether auto-failover is possible when the Leader is unreachable. The Raft consensus algorithm uses majority rule with a leader to determine what happens in a failure situation. In Conjur, the leader is the Leader.
In a failure situation, some number of cluster nodes are unreachable, and a subset of nodes are still able to communicate with each other. If the number of nodes still able to communicate represents a majority (i.e. a quorum), those nodes make decisions about what to do about the failure.
Auto-failover does not occur in the following situations:
-
If the majority of cluster nodes (a quorum) cannot communicate with each other, promotion attempts are not started.
-
If the Leader is in the quorum, promotion attempts are not started.
If a quorum exists and the Leader is not part of it, the Leader is assumed unreachable and a new Leader is elected. It is not acceptable to have two leaders, so the protocol evicts the old Leader from the cluster and all communications with it shut down. In Conjur, this means that the old Leader cannot return to the cluster after a Standby is promoted, even if communications are successfully re-established.
The following diagram illustrates the quorum rule, showing results from various network problems. From left to right:
-
In scenario 1, the Leader is available and promotion is not needed.
-
In scenarios 2, 3, and 4, the Leader is unavailable but the quorum rule (majority of nodes still communicating with each other) is not satisfied, so failover does not occur.
-
In scenario 5, the Leader is unavailable and promotion conditions are satisfied. Promotion is triggered. Note that a successful promotion includes removing the Leader from the cluster, even though the Leader in this case could be a perfectly viable running machine.