Preparing a Gravity environment for Workbench#

Determining the resource requirements for a Kubernetes cluster depends on a number of different factors, including what type of applications you are going to be running, the number of users that are active at once, and the workloads you will be managing within the cluster. Data Science & AI Workbench’s performance is tightly coupled with the health of your Kubernetes stack, so it is important to allocate enough resources to manage your users workloads. Generally speaking, your system should contain at least 1 CPU, 1GB of RAM, and 5GB of disk space for each project session or deployment.

To install Workbench successfully, your systems must meet or exceed the requirements listed below. Anaconda has created a pre-installation checklist to help prepare you for installation. The checklist verifies that your cluster has the necessary resources are reserved and is ready to install Workbench. Anaconda’s Implementation team will review the checklist with you prior to your installation.

You can initially install Workbench on up to five nodes. Once initial installation is complete, you can add or remove nodes as needed. Anaconda recommends having one master and one worker node per cluster. For more information, see Adding and removing nodes.

For historical information and details regarding Anaconda’s policies related to Gravity, see our Gravity update policy.

Hardware requirements#

Anaconda’s hardware recommendations ensure a reliable and performant Kubernetes cluster.

The following are minimum specifications for the master and worker nodes, as well as the entire cluster.

Master node

Minimum

CPU

16 cores

RAM

64GB

Disk space in /opt/anaconda

500GB

Disk space in /var/lib/gravity

300GB

Disk space in /tmp or $TMPDIR

50GB

Note

  • Disk space reserved for /var/lib/gravity is utilized as additional space to accommodate upgrades. Anaconda recommends having this available during installation.

  • The /var/lib/gravity volume must be mounted on local storage. Core components of Kubernetes run from this directory, some of which are extremely intolerant of disk latency. Therefore, Network-Attached Storage (NAS) and Storage Area Network (SAN) solutions are not supported for this volume.

  • Disk space reserved for /opt/anaconda is utilized for project and package storage (including mirrored packages).

  • Anaconda recommends that you set up the /opt/anaconda and /var/lib/gravity partitions using Logical Volume Management (LVM) to provide the flexibility needed to accommodate easier future expansion.

  • Currently /opt and /opt/anaconda must be an ext4 or xfs filesystem, and cannot be an NFS mountpoint. Subdirectories of /opt/anaconda may be mounted through NFS. For more information, see Mounting an external file share.

Warning

Installations of Workbench that utilize an xfs filesystem must support d_type file labeling to work properly. To support d-type file labeling, set ftype=1 by running the following command prior to installing Workbench.

This command will erase all data on the specified device! Make sure you are targeting the correct device and that you have backed up any important data from it before proceeding.

mkfs.xfs -n ftype=1 <PATH/TO/YOUR/DEVICE>

Worker node

Minimum

CPU

16 cores

RAM

64GB

Disk space in /var/lib/gravity

300GB

Disk space in /tmp or $TMPDIR

50GB

Note

When installing Workbench on a system with multiple nodes, verify that the clock of each node is in sync with the others prior to installation. Anaconda recommends using the Network Time Protocol (NTP) to synchronize computer system clocks automatically over a network. For step by step instructions, see How to Synchronize Time with Chrony NTP in Linux.

Disk IOPS requirements#

Master and worker nodes require a minimum of 3000 concurrent Input/Output operations Per Second (IOPS).

Note

Hard disk manufacturers report sequential IOPS, which are different than concurrent IOPS. On-premise installations require servers with disks that support a minimum of 50 sequential IOPS. Anaconda recommends using Solid State Drive (SSD) or better.

Cloud performance requirements#

Requirements for running Workbench in the cloud relate to compute power and disk performance. Make sure your chosen cloud platform meets these minimum specifications:

Anaconda recommends an instance type no smaller than m4.4xlarge for both master and worker nodes. You must have a minimum of 3000 IOPS.

Anaconda recommends a VM size of Standard D16s v3 (16 VCPUs, 64 GB memory).

There are no unique requirements for installing Workbench on the Google Cloud Platform.

Operating system requirements#

Workbench currently supports the following Linux versions:

  • RHEL/CentOS 7.x, 8.x

  • Ubuntu 16.04

  • SUSE 12 SP2, 12 SP3, 12 SP5 (Requires you set DefaultTasksMax=infinity in /etc/systemd/system.conf)

Caution

Some versions of the RHEL 8.4 AMI on AWS are bugged due to a combination of a bad ip rule and the networkmanager service. Remove the bad rule and disable the networkmanager service prior to installation.

Security requirements#

  • If your Linux system utilizes an antivirus scanner, make sure the scanner excludes the /var/lib/gravity volume from its security scans.

  • Installation requires that you have sudo access.

  • Nodes running CentOS or RHEL must make sure that Security Enhanced Linux (SELinux) set to either disabled or permissive mode in the /etc/selinux/config file.

    Tip

    Check the status of SELinux by running the following command:

    getenforce
    
    Configuring SELinux
    1. Open the /etc/selinux/config file using your preferred file editor.

    2. Find the the line that starts with SELINUX= and set it to either disabled or permissive.

    3. Save and close the file.

    4. Reboot your system for changes to take effect.

Kernel module requirements#

Kubernetes relies on certain functionalities provided by the Linux kernel. The Workbench installer verifies that the following kernel modules (required for Kubernetes to function properly) are present, and notifies you if any are not loaded.

Linux Distribution

Version

Required Modules

CentOS

7.2

bridge, ebtable_filter, ebtables, iptable_filter, iptable_nat, overlay

CentOS

7.3-7.7, 8.0

br_netfilter, ebtable_filter, ebtables, iptable_filter, iptable_nat, overlay

RedHat Linux

7.2

bridge, ebtable_filter, ebtables, iptable_filter, iptable_nat

RedHat Linux

7.3-7.7, 8.0

br_netfilter, ebtable_filter, ebtables, iptable_filter, iptable_nat, overlay

Ubuntu

16.04

br_netfilter, ebtable_filter, ebtables, iptable_filter, iptable_nat, overlay

Suse

12 SP2, 12 SP3, 12 SP5

br_netfilter, ebtable_filter, ebtables, iptable_filter, iptable_nat, overlay

Module Name

Purpose

bridge

Enables Kubernetes iptables-based proxy to operate

br_nerfilter

Enables Kubernetes iptables-based proxy to operate

overlay

Enables the use of the overlay or overlay2 Docker storage driver

ebtable_filter

Allows a service to communicate back to itself via internal load balancing when necessary

ebtables

Allows a service to communicate back to itself via internal load balancing when necessary

iptable_filter

Ensures the firewall rules set up by Kubernetes function properly

iptable_nat

Ensures the firewall rules set up by Kubernetes function properly

Note

Verify a module is loaded by running the following command:

# Replace <MODULE_NAME> with a module name
lsmod | grep <MODULE_NAME>

If the command produces a return, the module is loaded.

If necessary, run the the following command to load a module:

# Replace <MODULE_NAME> with a module name
sudo modprobe <MODULE_NAME>

Caution

If your system does not load modules at boot, you must run the following command—for each module—to ensure they are loaded on every reboot:

# Replace <MODULE_NAME> with a module name
sudo echo -e '<MODULE_NAME>' > /etc/modules-load.d/<MODULE_NAME>.conf

System control settings#

Workbench requires the following Linux sysctl settings to function properly:

sysctl setting

Purpose

net.bridge.bridge-nf-call-iptables

Communicates with bridge kernel module to ensure Kubernetes iptables-based proxy operates

net.bridge.bridge-nf-call-ip6tables

Communicates with bridge kernel module to ensure Kubernetes iptables-based proxy operates

fs.may_detach_mounts

Allows the unmount operation to complete even if there are active references to the filesystem remaining

net.ipv4.ip_forward

Required for internal load balancing between servers to work properly

fs.inotify.max_user_watches

Set to 1048576 to improve cluster longevity

Note

If necessary, run the following command to enable a system control setting:

# Replace <SYSCTL_SETTING> with a system control setting
sudo sysctl -w <SYSCTL_SETTING>=1

To persist system settings on boot, run the following for each setting:

# Replace  <SYSCTL_SETTING> with a system control setting
sudo echo -e "<SYSCTL_SETTING> = 1" > /etc/sysctl.d/10-<SYSCTL_SETTING>.conf

GPU requirements#

Workbench requires that you install a supported version of the NVIDIA Compute Unified Device Architecture (CUDA) driver on the host OS of any GPU worked node.

Currently, Workbench supports the following CUDA driver versions:

  • CUDA 10.2

  • CUDA 11.2

  • CUDA 11.4

  • CUDA 11.6

Note

Notify your Anaconda Implementation team member which CUDA version you intend to use, so they can provide the correct installer.

You can obtain the driver you need a few different ways.

  • Use the package manager or the Nvidia runfile to download the file directly.

  • For SLES, CentOS, and RHEL, you can get a supported driver using rpm (local) or rpm (network).

  • For Ubunutu, you can get a driver using deb (local) or deb (network).

GPU deployments should use one of the following models:

  • Tesla V100 (recommended)

  • Tesla P100 (adequate)

Theoretically, Workbench will work with any GPU card compatible with the CUDA drivers, as long as they are properly installed. Other cards supported by CUDA 11.6:

  • A-Series: NVIDIA A100, NVIDIA A40, NVIDIA A30, NVIDIA A10

  • RTX-Series: RTX 8000, RTX 6000, NVIDIA RTX A6000, NVIDIA RTX A5000, NVIDIA RTX A4000, NVIDIA T1000, NVIDIA T600, NVIDIA T400

  • HGX-Series: HGX A100, HGX-2

  • T-Series: Tesla T4

  • P-Series: Tesla P40, Tesla P6, Tesla P4

  • K-Series: Tesla K80, Tesla K520, Tesla K40c, Tesla K40m, Tesla K40s, Tesla K40st, Tesla K40t, Tesla K20Xm, Tesla K20m, Tesla K20s, Tesla K20c, Tesla K10, Tesla K8

  • M-Class: M60, M40 24GB, M40, M6, M4

Support for GPUs in Kubernetes is still a work in progress, and each cloud vendor provides different recommendations. For more information about GPUs, see Understanding GPUs.

Network requirements#

Workbench requires the following network ports to be externally accessible:

External ports

Port

Protocol

Description

80

TCP

Workbench UI (plaintext)

443

TCP

Workbench UI (encrypted)

32009

TCP

Operations Center Admin UI

These ports need to be externally accessible during installation only, and can be closed after completing the install process:

Install ports

Port

Protocol

Description

4242

TCP

Bandwidth checker utility

61009

TCP

Install wizard UI access required during cluster installation

61008, 61010, 61022-61024

TCP

Installer agent ports

The following ports are used for cluster operation, and must be open internally, between cluster nodes:

Cluster communication ports

Port

Protocol

Description

53

TCP and UDP

Internal cluster DNS

2379, 2380, 4001, 7001

TCP

Etcd server communication

3008-3012

TCP

Internal Workbench service

3022-3025

TCP

Teleport internal SSH control panel

3080

TCP

Teleport Web UI

5000

TCP

Docker registry

6443

TCP

Kubernetes API Server

6990

TCP

Internal Workbench service

7496, 7373

TCP

Peer-to-peer health check

7575

TCP

Cluster status gRPC API

8081, 8086-8091, 8095

TCP

Internal Workbench service

8472

UDP

Overlay network

9080, 9090, 9091

TCP

Internal Workbench service

10248-10250, 10255

TCP

Kubernetes components

30000-32767

TCP

Kubernetes internal services range

Make sure that the firewall is permanently set to keep the required ports open, and will save these settings across reboots. Then restart the firewall to load your changed settings.

Tip

There are various tools you can use to configure firewalls and open required ports, including iptables, firewall-cmd, susefirewall2, and more!

You’ll also need to update your firewall settings to ensure that the 10.244.0.0/16 pod subnet and 10.100.0.0/16 service subnet are accessible to every node in the cluster, and grant all nodes the ability to communicate via their primary interface.

For example, if you’re using iptables:

# Replace <NODE_IP> with the internal IP address(es) used by all nodes in the cluster to connect to the master node
iptables -A INPUT -s 10.244.0.0/16 -j ACCEPT
iptables -A INPUT -s 10.100.0.0/16 -j ACCEPT
iptables -A INPUT -s <NODE_IP> -j ACCEPT

If you plan to use online package mirroring, allowlist the following domains in your network’s firewall settings:

  • repo.anaconda.com

  • anaconda.org

  • conda.anaconda.org

  • binstar-cio-packages-prod.s3.amazonaws.com

To use Workbench in conjucntion with Anaconda Navigator in online mode, allowlist the following sites in your network’s firewall settings as well:

TLS/SSL certificate requirements#

Workbench uses certificates to provide transport layer security for the cluster. Self-signed certificates are generated during the initial installation. Once installation is complete, you can configure the platform to use your organizational TLS/SSL certificates.

You can purchase certificates commercially, or generate them using your organization’s internal public key infrastructure (PKI) system. When using an internal PKI-signed setup, the CA certificate is inserted into the Kubernetes secret.

In either case, the configuration will include the following:

  • A certificate for the root certificate authority (CA)

  • An intermediate certificate chain

  • A server certificate

  • A certificate private key

For more information about TLS/SSL certificates, see Updating TLS/SSL certificates.

DNS requirements#

Workbench assigns unique URL addresses to deployments by combining a dynamically generated universally unique identifier (UUID) with your organization’s domain name, like this: https://uuid001.anaconda.yourdomain.com.

This requires the use of wildcard DNS entries that apply to a set of domain names such as *.anaconda.yourdomain.com.

For example, if you are using the domain name anaconda.yourdomain.com with a master node IP address of 12.34.56.78, the DNS entries would be as follows:

anaconda.yourdomain.com IN A 12.34.56.78
*.anaconda.yourdomain.com IN A 12.34.56.78

Note

The wildcard subdomain’s DNS entry points to the Workbench master node.

The master node’s hostname and the wildcard domains must be resolvable with DNS from the master nodes, worker nodes, and the end user machines. To ensure the master node can resolve its own hostname, distribute any /etc/hosts entries to the gravity environment.

Caution

If dnsmasq is installed on the master node or any worker nodes, you’ll need to remove it from all nodes prior to installing Workbench.

Verify dnsmasq is disabled by running the following command:

sudo systemctl status dnsmasq

If necessary, stop and disable dnsmasq, run the following commands:

sudo systemctl stop dnsmasq
sudo systemctl disable dnsmasq

Browser requirements#

Workbench supports the following web browsers:

  • Chrome 39+

  • Firefox 49+

  • Safari 10+

The minimum browser screen size for using the platform is 800 pixels wide and 600 pixels high.

Verifying system requirements#

The installer performs pre-installation checks, and only allows installation to continue on nodes that are configured correctly, and include the required kernel modules. If you want to perform the system check yourself prior to installation, you can run the following commands from the installer directory, ~/anaconda-enterprise-<VERSION>, on your intended master and worker nodes:

To perform system checks on the master node, run the following command as sudo or root user:

sudo ./gravity check --profile ae-master

To perform system checks on a worker node, run the following command as sudo or root user:

sudo ./gravity check --profile ae-worker

If all of the system checks pass and all requirements are met, the output from the above commands will be empty. If the system checks fail and some requirements are not met, the output will indicate which system checks failed.

Pre-installation checklist#

Anaconda has created this pre-installation checklist to help you verify that you have properly prepared your environment prior to installation. You can run the system verification checks to automatically verify many of the requirements for you.

Caution

System verficication checks are not comprehensive, so make sure you manually verify the remaining requirements.

Gravity pre-inallation checklist

All nodes in the cluster meet the minimum or recommended specifications for CPU, RAM, and disk space.

All nodes in the cluster meet the minimum IOPS required for reliable performance.

All cluster nodes are operating the same version of the OS, and that the OS version is supported by Workbench.

NTP is being used to synchronize computer system clocks, and all nodes are in sync.

The user account performing the installation has sudo access on all nodes and is not a root user.

All required kernel modules are loaded.

The sysctl settings are configured correctly.

Any GPUs to be used with Workbench have a supported NVIDIA CUDA driver installed.

The system meets all network port requirements, whether the specified ports need to be open internally, externally, or during installation only.

The firewall is configured correctly, and an rules designed to limit traffic have been temporarily disabled until Workbench is installed and verified.

If necessary, the domains required for online package mirroring have been allowlisted.

The final TLS/SSL certificates <grav_tls_ssl_reqs> to be installed with Workbench have been obtained, including the private keys.

The Workbench A or CNAME domain record is fully operational, and points to the IP address of the master node.

The wildcard DNS entry for Workbench is also fully operational, and points to the IP address of the master node. More information about the wildcard DNS requirements can be found here.

The /etc/resolv.conf file on all the nodes does not include the rotate option.

Any existing installations of Docker (and dockerd), dnsmasq, and lxd have been removed from all nodes, as they will conflict with Workbench.

All web browsers to be used to access Workbench are supported by the platform.