Fault tolerance in Workbench#

Data Science & AI Workbench employs automatic service restarts and health monitoring to remain operational if a process halts or a worker node becomes unavailable. Additional levels of fault tolerance, such as service migration, are provided if there are at least three nodes in the deployment. However, the master node cannot currently be configured for automatic failover and does present a single point of failure.

When Workbench is deployed to a cluster with three or more nodes, the core services are automatically configured into a fault tolerant mode—whether Workbench is initially configured this way or changed later. As soon as there are three or more nodes available, the service fault tolerance features come into effect.

This means that in the event of any service failure:

Workbench core services will automatically be restarted or, if possible, migrated.
User-initiated project deployments will automatically be restarted or, if possible, migrated.

If a worker node becomes unresponsive or unavailable, it will be flagged while the core services and backend continue to run without interruption. If additional worker nodes are available the services that had been running on the failed worker node will be migrated or restarted on other still-live worker nodes. This migration may take a few minutes.

The process for adding new worker nodes to the Workbench cluster is described in Adding and removing nodes (Gravity).

Storage and persistency layer

Workbench does not automatically configure storage or persistency layer fault tolerance when using the default storage and persistency services. This includes the database, Git server, and object storage. If you have configured Workbench to use external storage and persistency services then you will need to configure these for fault tolerance.

Recovering after node failure

Other than storage-related services (database, Git server, and object storage), all core Workbench services are resilient to master node failure.

To maintain operation of Workbench in the event of a master node failure, /opt/anaconda/ on the master node should be located on a redundant disk array or backed up frequently to avoid data loss. See Backing up and restoring Workbench for more information.

To restore Workbench operations in the event of a master node failure:

Create a new master node. Follow the installation process for adding a new cluster node, described in command-line installations.

Note

To create the new master node, select --role=ae-master instead of --role=ae-worker.

Restore data from a backup. After the installation of the new master node is complete, follow the instructions in Backing up and restoring Workbench.