Distributed Vaults during Vault Failure

Satellite Vault failure

If a Satellite Vault is unavailable, clients that have been working with this Satellite Vault will reconnect to another Vault, Satellite or Primary, depending on the order in the DNS SRV list or IP address list used by the client.

Primary Vault failure

If a Primary Vault is unavailable, the following happens:

CyberArk clients connected to Satellite Vaults will act as follows:
Credential Providers continue working with the Satellite Vault that is configured in the DNS SRV record list or IP address list.
PVWA users will have read-only access to their assets, and will be notified when connectivity to the Primary Vault is restored. After connectivity to the Primary Vault is restored, users must log on to their local PVWA to restore write operations.
To maintain business continuity for all other PAS components, one of the Satellite Vaults must be promoted to the role of Primary Vault and all PAS components must be directed to the new Primary Vault.

A Satellite Vault can be promoted to Primary Vault status in one of the following ways:

  • Automatic Failover - A Satellite Vault can be promoted automatically to Primary Vault status. For more information, refer to Automatic failover.
  • Manual promotion - A Satellite Vault can be promoted manually to Primary Vault status. For more information, refer to Promote a Satellite Vault to Primary Vault.

Automatic failover

If a Primary Vault fails, one of the Satellite Vaults can be promoted automatically to Primary Vault status. This increases Vault availability and provides full Vault services for all clients when a Primary Vault fails, without any manual intervention.

Failover scenarios

During failover, the new Primary Candidate Vault tries to promote itself to the Primary Vault role. In addition, all other Satellite Vaults try to synchronize with the new Primary Vault.

  • Successful promotion

    The health of the new Primary Vault and the rest of the Satellite Vaults is "OK". All Satellite Vaults are synchronized with the new Primary Vault.

  • Partially successful promotion

    Not all the Satellite Vaults have successfully synchronized with the new Primary Vault. For example, the following health table will be displayed as a part of the promotion process:

    IP

    Port

    Role

    Health

    1.1.1.1

    33061

    Master

    OK

    1.1.1.2

    33062

    Slave

    OK

    1.1.1.3

    33063

    Slave

    Not OK

    This promotion succeeded with warnings. The new Primary Vault provides full Vault services and some of the Satellite Vaults are synchronized with the new Primary Vault.

    Once Automatic Failure has occurred, the administrator needs to do the following:

    1. Check the logs for details about the problematic Satellite Vault, and fix the problem that's listed.

    2. Make sure that the Vault.ini specifies the address of the new Primary Vault.

    3. In PADR.ini, set the value of the NextBinaryLogNumberToStartAt parameter to -1.

    4. Restart the CyberArk Vault Disaster Recovery service.

    5. After full replication has been completed, restart the CyberArk Vault Disaster Recovery service.

    6. In PADR.log, check that replication was successful by searching for the following message: "PADR0099I Metadata Replication is running successfully".

  • Unsuccessful promotion

    • The new Primary Vault does not provide full Vault services after promotion:

      There is no Primary Vault that provides full Vault services.

      For example, the following health table will be displayed as a part of the promotion process:

      IP

      Port

      Role

      Health

      1.1.1.1

      33061

      Master

      Not OK

      1.1.1.2

      33062

      Slave

      OK

      1.1.1.3

      33063

      Slave

      OK

    The administrator needs to do the following:

    1. Check the logs for details about the problem, and fix it according to the listed messages.

    2. Promote the Satellite Vault to Primary Vault, as described in Promote a Satellite Vault to Primary Vault.

    3. In PADR.ini, set the EnableFailover parameter to no (as this node has been promoted to Primary).

  • The new Primary Vault is healthy after promotion, but all the Satellite Vaults failed to synchronize with the new Primary Vault:

    No Primary Vault provides full Vault services.

    For example, the following health table will be printed as a part of the promotion process:

    IP

    Port

    Role

    Health

    1.1.1.1

    33061

    Master

    OK

    1.1.1.2

    33062

    Slave

    Not OK

    1.1.1.3

    33063

    Slave

    Not OK

    The administrator needs to do the following:

    1. Check the logs for details about the problem, and fix it according to the listed messages.

    2. Promote the Satellite Vault to Primary Vault, as described in Promote a Satellite Vault to Primary Vault.

    3. In PADR.ini, set the EnableFailover parameter to no (as this node has been promoted to Primary).

Logs and Notifications

All logs created during the promotion process are written in the PADR.log. SNMP traps and ENE notifications can be configured to monitor failover events.

For more information, about PADR.log, see Logging.

Promote a Satellite Vault to Primary Vault

A Satellite Vault can be promoted to become a Primary Vault and provide write services to all components in your Distributed Vaults environment in the following scenarios:

  • Vault Failover – A Satellite Vault is promoted when the original Primary Vault is down.

  • Vault Switchover – A Satellite Vault can be promoted when the Primary Vault is up. For example, before upgrades or system maintenance.

Failback upon Recovery

The CyberArk clients frequently request the DNS SRV record in order to retrieve the list of prioritized Vault addresses. If the list includes a Vault of higher priority than the current Vault, the client re-routes to that Vault and will send requests directly to it. This facilitates the following scenarios:

  • A failback after a failover to another Vault in the Distributed Vaults environment:

    For example:

    1. CyberArk clients are connected to Vault A.

    2. Vault A goes down.

    3. The CyberArk clients fail over and connect to Vault B.

    4. Vault A is repaired.

    5. The CyberArk clients fail back to work with Vault A.

  • Reprioritization of the list of Vaults in the SRV record:

The following parameter in the Vault.ini file determines how often the Credential Provider checks the SRV record:

FAILBACKINTERVAL – The number of seconds between the CyberArk client requests to check the SRV record. The default value is 1800 seconds (30 minutes).

For more information about the Vault parameter file, see Vault Parameter File.