Fault tolerance in Anaconda Enterprise#

Anaconda Enterprise employs automatic service restarts and health monitoring to remain operational if a process halts or a worker node becomes unavailable. Additional levels of fault tolerance, such as service migration, are provided if there are at least three nodes in the deployment. However, the master node cannot currently be configured for automatic failover and does present a single point of failure.

When Anaconda Enterprise is deployed to a cluster with three or more nodes, the core services are automatically configured into a fault tolerant mode—whether Anaconda Enterprise is initially configured this way or changed later. As soon as there are three or more nodes available, the service fault tolerance features come into effect.

This means that in the event of any service failure:

  • Anaconda Enterprise core services will automatically be restarted or, if possible, migrated.

  • User-initiated project deployments will automatically be restarted or, if possible, migrated.

If a worker node becomes unresponsive or unavailable, it will be flagged while the core services and backend continue to run without interruption. If additional worker nodes are available the services that had been running on the failed worker node will be migrated or restarted on other still-live worker nodes. This migration may take a few minutes.

The process for adding new worker nodes to the Anaconda Enterprise cluster is described in Adding and removing nodes (Gravity).

Storage and persistency layer

Anaconda Enterprise does not automatically configure storage or persistency layer fault tolerance when using the default storage and persistency services. This includes the database, Git server, and object storage. If you have configured Anaconda Enterprise to use external storage and persistency services then you will need to configure these for fault tolerance.

Recovering after node failure

Other than storage-related services (database, Git server, and object storage), all core Anaconda Enterprise services are resilient to master node failure.

To maintain operation of Enterprise in the event of a master node failure, /opt/anaconda/ on the master node should be located on a redundant disk array or backed up frequently to avoid data loss. See Backing up and restoring Anaconda Enterprise for more information.

To restore Anaconda Enterprise operations in the event of a master node failure:

  1. Create a new master node. Follow the installation process for adding a new cluster node, described in command-line installations.

Note

To create the new master node, select --role=ae-master instead of --role=ae-worker.

  1. Restore data from a backup. After the installation of the new master node is complete, follow the instructions in Backing up and restoring Anaconda Enterprise.