Preparing a Gravity environment for Workbench#
Determining the resource requirements for a Kubernetes cluster depends on a number of different factors, including what type of applications you are going to be running, the number of users that are active at once, and the workloads you will be managing within the cluster. Data Science & AI Workbench’s performance is tightly coupled with the health of your Kubernetes stack, so it is important to allocate enough resources to manage your users workloads. Generally speaking, your system should contain at least 1 CPU, 1GB of RAM, and 5GB of disk space for each project session or deployment.
To install Workbench successfully, your systems must meet or exceed the requirements listed below. Anaconda has created a pre-installation checklist to help prepare you for installation. The checklist verifies that your cluster has the necessary resources are reserved and is ready to install Workbench. Anaconda’s Implementation team will review the checklist with you prior to your installation.
You can initially install Workbench on up to five nodes. Once initial installation is complete, you can add or remove nodes as needed. Anaconda recommends having one master and one worker node per cluster. For more information, see Adding and removing nodes.
For historical information and details regarding Anaconda’s policies related to Gravity, see our Gravity update policy.
Hardware requirements#
Anaconda’s hardware recommendations ensure a reliable and performant Kubernetes cluster.
The following are minimum specifications for the master and worker nodes, as well as the entire cluster.
Master node |
Minimum |
---|---|
CPU |
16 cores |
RAM |
64GB |
Disk space in /opt/anaconda |
500GB |
Disk space in /var/lib/gravity |
300GB |
Disk space in /tmp or $TMPDIR |
50GB |
Note
Disk space reserved for
/var/lib/gravity
is utilized as additional space to accommodate upgrades. Anaconda recommends having this available during installation.The
/var/lib/gravity
volume must be mounted on local storage. Core components of Kubernetes run from this directory, some of which are extremely intolerant of disk latency. Therefore, Network-Attached Storage (NAS) and Storage Area Network (SAN) solutions are not supported for this volume.Disk space reserved for
/opt/anaconda
is utilized for project and package storage (including mirrored packages).Anaconda recommends that you set up the
/opt/anaconda
and/var/lib/gravity
partitions using Logical Volume Management (LVM) to provide the flexibility needed to accommodate easier future expansion.Currently
/opt
and/opt/anaconda
must be anext4
orxfs
filesystem, and cannot be an NFS mountpoint. Subdirectories of/opt/anaconda
may be mounted through NFS. For more information, see Mounting an external file share.
Warning
Installations of Workbench that utilize an xfs
filesystem must support d_type
file labeling to work properly. To support d-type
file labeling, set ftype=1
by running the following command prior to installing Workbench.
This command will erase all data on the specified device! Make sure you are targeting the correct device and that you have backed up any important data from it before proceeding.
mkfs.xfs -n ftype=1 <PATH/TO/YOUR/DEVICE>
Worker node |
Minimum |
---|---|
CPU |
16 cores |
RAM |
64GB |
Disk space in /var/lib/gravity |
300GB |
Disk space in /tmp or $TMPDIR |
50GB |
Note
When installing Workbench on a system with multiple nodes, verify that the clock of each node is in sync with the others prior to installation. Anaconda recommends using the Network Time Protocol (NTP) to synchronize computer system clocks automatically over a network. For step by step instructions, see How to Synchronize Time with Chrony NTP in Linux.
Disk IOPS requirements#
Master and worker nodes require a minimum of 3000 concurrent Input/Output operations Per Second (IOPS).
Note
Hard disk manufacturers report sequential IOPS, which are different than concurrent IOPS. On-premise installations require servers with disks that support a minimum of 50 sequential IOPS. Anaconda recommends using Solid State Drive (SSD) or better.
Cloud performance requirements#
Requirements for running Workbench in the cloud relate to compute power and disk performance. Make sure your chosen cloud platform meets these minimum specifications:
Anaconda recommends an instance type no smaller than m4.4xlarge
for both master and worker nodes. You must have a minimum of 3000 IOPS.
Anaconda recommends a VM size of Standard D16s v3
(16 VCPUs, 64 GB memory).
There are no unique requirements for installing Workbench on the Google Cloud Platform.
Operating system requirements#
Workbench currently supports the following Linux versions:
RHEL/CentOS 7.x, 8.x
Ubuntu 16.04
SUSE 12 SP2, 12 SP3, 12 SP5 (Requires you set
DefaultTasksMax=infinity
in/etc/systemd/system.conf
)
Caution
Some versions of the RHEL 8.4 AMI on AWS are bugged due to a combination of a bad ip rule
and the networkmanager service. Remove the bad rule and disable the networkmanager service prior to installation.
Security requirements#
If your Linux system utilizes an antivirus scanner, make sure the scanner excludes the
/var/lib/gravity
volume from its security scans.Installation requires that you have
sudo
access.Nodes running CentOS or RHEL must make sure that Security Enhanced Linux (SELinux) set to either
disabled
orpermissive
mode in the/etc/selinux/config
file.Tip
Check the status of SELinux by running the following command:
getenforce
Configuring SELinux
Open the
/etc/selinux/config
file using your preferred file editor.Find the the line that starts with
SELINUX=
and set it to eitherdisabled
orpermissive
.Save and close the file.
Reboot your system for changes to take effect.
Kernel module requirements#
Kubernetes relies on certain functionalities provided by the Linux kernel. The Workbench installer verifies that the following kernel modules (required for Kubernetes to function properly) are present, and notifies you if any are not loaded.
Linux Distribution |
Version |
Required Modules |
---|---|---|
CentOS |
7.2 |
bridge, ebtable_filter, ebtables, iptable_filter, iptable_nat, overlay |
CentOS |
7.3-7.7, 8.0 |
br_netfilter, ebtable_filter, ebtables, iptable_filter, iptable_nat, overlay |
RedHat Linux |
7.2 |
bridge, ebtable_filter, ebtables, iptable_filter, iptable_nat |
RedHat Linux |
7.3-7.7, 8.0 |
br_netfilter, ebtable_filter, ebtables, iptable_filter, iptable_nat, overlay |
Ubuntu |
16.04 |
br_netfilter, ebtable_filter, ebtables, iptable_filter, iptable_nat, overlay |
Suse |
12 SP2, 12 SP3, 12 SP5 |
br_netfilter, ebtable_filter, ebtables, iptable_filter, iptable_nat, overlay |
Module Name |
Purpose |
---|---|
bridge |
Enables Kubernetes iptables-based proxy to operate |
br_nerfilter |
Enables Kubernetes iptables-based proxy to operate |
overlay |
Enables the use of the overlay or overlay2 Docker storage driver |
ebtable_filter |
Allows a service to communicate back to itself via internal load balancing when necessary |
ebtables |
Allows a service to communicate back to itself via internal load balancing when necessary |
iptable_filter |
Ensures the firewall rules set up by Kubernetes function properly |
iptable_nat |
Ensures the firewall rules set up by Kubernetes function properly |
Note
Verify a module is loaded by running the following command:
# Replace <MODULE_NAME> with a module name lsmod | grep <MODULE_NAME>
If the command produces a return, the module is loaded.
If necessary, run the the following command to load a module:
# Replace <MODULE_NAME> with a module name sudo modprobe <MODULE_NAME>
Caution
If your system does not load modules at boot, you must run the following command—for each module—to ensure they are loaded on every reboot:
# Replace <MODULE_NAME> with a module name
sudo echo -e '<MODULE_NAME>' > /etc/modules-load.d/<MODULE_NAME>.conf
System control settings#
Workbench requires the following Linux sysctl
settings to function properly:
sysctl setting |
Purpose |
---|---|
net.bridge.bridge-nf-call-iptables |
Communicates with bridge kernel module to ensure Kubernetes iptables-based proxy operates |
net.bridge.bridge-nf-call-ip6tables |
Communicates with bridge kernel module to ensure Kubernetes iptables-based proxy operates |
fs.may_detach_mounts |
Allows the unmount operation to complete even if there are active references to the filesystem remaining |
net.ipv4.ip_forward |
Required for internal load balancing between servers to work properly |
fs.inotify.max_user_watches |
Set to 1048576 to improve cluster longevity |
Note
If necessary, run the following command to enable a system control setting:
# Replace <SYSCTL_SETTING> with a system control setting sudo sysctl -w <SYSCTL_SETTING>=1
To persist system settings on boot, run the following for each setting:
# Replace <SYSCTL_SETTING> with a system control setting sudo echo -e "<SYSCTL_SETTING> = 1" > /etc/sysctl.d/10-<SYSCTL_SETTING>.conf
GPU requirements#
Workbench requires that you install a supported version of the NVIDIA Compute Unified Device Architecture (CUDA) driver on the host OS of any GPU worked node.
Currently, Workbench supports the following CUDA driver versions:
CUDA
10.2
CUDA
11.2
CUDA
11.4
CUDA
11.6
Note
Notify your Anaconda Implementation team member which CUDA version you intend to use, so they can provide the correct installer.
You can obtain the driver you need a few different ways.
Use the package manager or the Nvidia runfile to download the file directly.
For SLES, CentOS, and RHEL, you can get a supported driver using
rpm (local)
orrpm (network)
.For Ubunutu, you can get a driver using
deb (local)
ordeb (network)
.
GPU deployments should use one of the following models:
Tesla V100 (recommended)
Tesla P100 (adequate)
Theoretically, Workbench will work with any GPU card compatible with the CUDA drivers, as long as they are properly installed. Other cards supported by CUDA 11.6
:
A-Series: NVIDIA A100, NVIDIA A40, NVIDIA A30, NVIDIA A10
RTX-Series: RTX 8000, RTX 6000, NVIDIA RTX A6000, NVIDIA RTX A5000, NVIDIA RTX A4000, NVIDIA T1000, NVIDIA T600, NVIDIA T400
HGX-Series: HGX A100, HGX-2
T-Series: Tesla T4
P-Series: Tesla P40, Tesla P6, Tesla P4
K-Series: Tesla K80, Tesla K520, Tesla K40c, Tesla K40m, Tesla K40s, Tesla K40st, Tesla K40t, Tesla K20Xm, Tesla K20m, Tesla K20s, Tesla K20c, Tesla K10, Tesla K8
M-Class: M60, M40 24GB, M40, M6, M4
Support for GPUs in Kubernetes is still a work in progress, and each cloud vendor provides different recommendations. For more information about GPUs, see Understanding GPUs.
Network requirements#
Workbench requires the following network ports to be externally accessible:
External ports
Port
Protocol
Description
80
TCP
Workbench UI (plaintext)
443
TCP
Workbench UI (encrypted)
32009
TCP
Operations Center Admin UI
These ports need to be externally accessible during installation only, and can be closed after completing the install process:
Install ports
Port
Protocol
Description
4242
TCP
Bandwidth checker utility
61009
TCP
Install wizard UI access required during cluster installation
61008, 61010, 61022-61024
TCP
Installer agent ports
The following ports are used for cluster operation, and must be open internally, between cluster nodes:
Cluster communication ports
Port
Protocol
Description
53
TCP and UDP
Internal cluster DNS
2379, 2380, 4001, 7001
TCP
Etcd server communication
3008-3012
TCP
Internal Workbench service
3022-3025
TCP
Teleport internal SSH control panel
3080
TCP
Teleport Web UI
5000
TCP
Docker registry
6443
TCP
Kubernetes API Server
6990
TCP
Internal Workbench service
7496, 7373
TCP
Peer-to-peer health check
7575
TCP
Cluster status gRPC API
8081, 8086-8091, 8095
TCP
Internal Workbench service
8472
UDP
Overlay network
9080, 9090, 9091
TCP
Internal Workbench service
10248-10250, 10255
TCP
Kubernetes components
30000-32767
TCP
Kubernetes internal services range
Make sure that the firewall is permanently set to keep the required ports open, and will save these settings across reboots. Then restart the firewall to load your changed settings.
Tip
There are various tools you can use to configure firewalls and open required ports, including
iptables
,firewall-cmd
,susefirewall2
, and more!
You’ll also need to update your firewall settings to ensure that the 10.244.0.0/16
pod subnet and 10.100.0.0/16
service subnet are accessible to every node in the cluster, and grant all nodes the ability to communicate via their primary interface.
For example, if you’re using iptables
:
# Replace <NODE_IP> with the internal IP address(es) used by all nodes in the cluster to connect to the master node iptables -A INPUT -s 10.244.0.0/16 -j ACCEPT iptables -A INPUT -s 10.100.0.0/16 -j ACCEPT iptables -A INPUT -s <NODE_IP> -j ACCEPT
If you plan to use online package mirroring, allowlist the following domains in your network’s firewall settings:
repo.anaconda.com
anaconda.org
conda.anaconda.org
binstar-cio-packages-prod.s3.amazonaws.com
To use Workbench in conjucntion with Anaconda Navigator in online mode, allowlist the following sites in your network’s firewall settings as well:
https://repo.anaconda.com (or for older versions of Navigator and conda)
https://conda.anaconda.org if any users will use conda-forge and other channels on Anaconda.org
google-public-dns-a.google.com (8.8.8.8:53) to check internet connectivity with Google Public DNS
TLS/SSL certificate requirements#
Workbench uses certificates to provide transport layer security for the cluster. Self-signed certificates are generated during the initial installation. Once installation is complete, you can configure the platform to use your organizational TLS/SSL certificates.
You can purchase certificates commercially, or generate them using your organization’s internal public key infrastructure (PKI) system. When using an internal PKI-signed setup, the CA certificate is inserted into the Kubernetes secret.
In either case, the configuration will include the following:
A certificate for the root certificate authority (CA)
An intermediate certificate chain
A server certificate
A certificate private key
For more information about TLS/SSL certificates, see Updating TLS/SSL certificates.
DNS requirements#
Workbench assigns unique URL addresses to deployments by combining a dynamically generated universally unique identifier (UUID) with your organization’s domain name, like this: https://uuid001.anaconda.yourdomain.com
.
This requires the use of wildcard DNS entries that apply to a set of domain names such as *.anaconda.yourdomain.com
.
For example, if you are using the domain name anaconda.yourdomain.com
with a master node IP address of 12.34.56.78
, the DNS entries would be as follows:
anaconda.yourdomain.com IN A 12.34.56.78
*.anaconda.yourdomain.com IN A 12.34.56.78
Note
The wildcard subdomain’s DNS entry points to the Workbench master node.
The master node’s hostname and the wildcard domains must be resolvable with DNS from the master nodes, worker nodes, and the end user machines. To ensure the master node can resolve its own hostname, distribute any /etc/hosts
entries to the gravity environment.
Caution
If dnsmasq
is installed on the master node or any worker nodes, you’ll need to remove it from all nodes prior to installing Workbench.
Verify dnsmasq
is disabled by running the following command:
sudo systemctl status dnsmasq
If necessary, stop and disable dnsmasq
, run the following commands:
sudo systemctl stop dnsmasq
sudo systemctl disable dnsmasq
Browser requirements#
Workbench supports the following web browsers:
Chrome 39+
Firefox 49+
Safari 10+
The minimum browser screen size for using the platform is 800 pixels wide and 600 pixels high.
Verifying system requirements#
The installer performs pre-installation checks, and only allows installation to continue on nodes that are configured correctly, and include the required kernel modules. If you want to perform the system check yourself prior to installation, you can run the following commands from the installer directory, ~/anaconda-enterprise-<VERSION>
, on your intended master and worker nodes:
To perform system checks on the master
node, run the following command as sudo or
root user:
sudo ./gravity check --profile ae-master
To perform system checks on a worker
node, run the following command as sudo or
root user:
sudo ./gravity check --profile ae-worker
If all of the system checks pass and all requirements are met, the output from the above commands will be empty. If the system checks fail and some requirements are not met, the output will indicate which system checks failed.
Pre-installation checklist#
Anaconda has created this pre-installation checklist to help you verify that you have properly prepared your environment prior to installation. You can run the system verification checks to automatically verify many of the requirements for you.
Caution
System verficication checks are not comprehensive, so make sure you manually verify the remaining requirements.
Gravity pre-inallation checklist
All nodes in the cluster meet the minimum or recommended specifications for CPU, RAM, and disk space.
All nodes in the cluster meet the minimum IOPS required for reliable performance.
All cluster nodes are operating the same version of the OS, and that the OS version is supported by Workbench.
NTP is being used to synchronize computer system clocks, and all nodes are in sync.
The user account performing the installation has
sudo
access on all nodes and is not a root user.All required kernel modules are loaded.
The sysctl settings are configured correctly.
Any GPUs to be used with Workbench have a supported NVIDIA CUDA driver installed.
The system meets all network port requirements, whether the specified ports need to be open internally, externally, or during installation only.
The firewall is configured correctly, and an rules designed to limit traffic have been temporarily disabled until Workbench is installed and verified.
If necessary, the domains required for online package mirroring have been allowlisted.
The final TLS/SSL certificates <grav_tls_ssl_reqs> to be installed with Workbench have been obtained, including the private keys.
The Workbench
A
orCNAME
domain record is fully operational, and points to the IP address of the master node.The wildcard DNS entry for Workbench is also fully operational, and points to the IP address of the master node. More information about the wildcard DNS requirements can be found here.
The
/etc/resolv.conf
file on all the nodes does not include therotate
option.Any existing installations of Docker (and
dockerd
),dnsmasq
, andlxd
have been removed from all nodes, as they will conflict with Workbench.All web browsers to be used to access Workbench are supported by the platform.