Understanding Workbench system requirements#

Data Science & AI Workbench is a DS/AI/ML development and deployment platform built on top of the industry standard Kubernetes container orchestration system. Workbench leverages Kubernetes to create and manage sessions, deployments, and jobs. During normal operation, users and administrators are insulated from this underlying complexity. Workbench is truly at its best when running on a stable, performant Kubernetes cluster.

When issues arise that compromise the operation of Kubernetes, on the other hand, the operation of Workbench itself suffers. We have found that it is helpful to share with our customers more detail about the system requirements that Kubernetes demands. By doing so, we hope to clarify and motivate our documented requirements, and to help customers appreciate why the implementation process requires precision, persistence, and patience.

For our Gravity-based customers, the installation process hides much of that complexity. The main step in the installation process—the execution of the sudo ./gravity install command—is actually performing the following steps:

Perform a variety of pre-flight checks to verify the satisfaction of important system requirements
Install and configure Docker
Install and configure Planet, a containerized implementation of Kubernetes bundled with a set of custom cluster management tools
Install Helm, an industry standard tool for installing Kubernetes applications
Load the Workbench container images into the internal Docker registry
Use a standard Helm process to install the Workbench application
Run final Anaconda-specific application configuration tasks

Enumerating these steps helps to illustrate just how “thin” the steps specific to Workbench truly are. The bulk of the Gravity implementation effort involves the construction of a stable, performant Kubernetes cluster.

Hardware considerations#

CPU and Memory: node considerations#

In our Gravity implementation, the primary node—where the central Kubernetes services and Workbench system containers run—does not host user workloads. Our standard recommendation of 16 cores and 64GB RAM provides ample headroom to ensure the correct operation of these functions.

For worker nodes, where user workloads (sessions, deployments, and jobs) are scheduled, the most important quantities are the total number of cores and total amount of RAM across all worker nodes. That said, nodes with more cores and RAM are better than smaller nodes for two reasons: first, because it allows aggregate user workload to be accommodated with less total hardware; and second, so that the system can accommodate truly large-memory workloads when necessary.

CPU and Memory: user workloads#

Do not compromise the compute resources offered to your users.

A decent data science laptop today ships with 6 cores and 16GB of RAM. While some of these resources are consumed by the operating system and other processes, their data science workloads are free to consume the vast majority.

Not all users are likely to be active at any given time, so it is not necessary to mirror this allocation on a 1:1 basis on your Workbench cluster. Kubernetes supports the notion of oversubscription, enabling CPU and memory allocations to exceed 100%. If we adopt a relatively standard oversubscription ratio of 4:1, we still need 75 cores and 200GB of RAM to support 50 users. Rounding that down to 64 cores and 192GB of RAM seems reasonable, at least to start. Economizing further will come at the cost of productivity—and additional resources tend to be significantly less expensive than the data scientists who will use them!

The laptop comparison is imperfect in a very important respect. On a laptop, swap space can be employed to temporarily allow memory consumption in excess of the physical limit. Not so with Kubernetes: a process will be terminated if it exceeds its memory limit, likely resulting in a loss of work. This further emphasizes the need to ensure that users are given a generous memory limit, determined not by their average usage, but rather their peak.

Some installations operate with just a single node—serving both control plane and user workload functions. For installations with a small number of simultaneous users, this is a feasible approach, as long as the node is sized aggressively—say, 64 cores and 256GB RAM.

VM QoS / Oversubscription / Overcommitment#

Many on-premise data centers employ virtualization technology such as VMWare to better manage compute resources. A common practice in such scenarios is oversubscription—the ability to schedule a greater number of virtual CPUs (vCPUs) than the number of physical CPUs (pCPUs) present on the system. Oversubscription is an essential component of cost effective virtual machine management, since machines rarely see constant 100% usage levels.

Unfortunately, this approach is not necessarily compatible with Kubernetes. Kubernetes employs its own resource management strategy, including a notion of oversubscription. Our recommended practice for Workbench is to employ a ratio of 4:1 for user workloads. If this were compounded with, say, a 4:1 ratio at the virtual machine level, and the true overcommitment level is closer to 16:1. With no control over the other workloads sharing the same physical cores, there is a real risk of sporadic performance loss that impacts overall cluster health.

For this reason, we strongly recommend that any virtual machine intended to serve as a Workbench node be assigned to a guaranteed service class that ensures that its CPU and memory reservations are fully honored, with no oversubscription at the VM level. Allow the Kubernetes layer to manage oversubscription exclusively.

GPUs#

One of the more challenging aspects of implementation is the enablement of GPU computation within Workbench. It is our view that NVidia is still in the process of maturing their “story” around the use of GPUs within Docker containers in general and Kubernetes in particular. As of February 2022, the official Kubernetes documentation about GPU scheduling marks it as an “experimental” feature.

In our experience, customers can be successful deploying GPUs in Workbench. Workbench ships a standard CUDA 11 library in user-facing containers, and the underlying Planet implementation is built with NVidia support components. That said, our experience leads us to offer these cautions proactively.

GPUs cannot be shared between sessions, deployments, and jobs. That means that if a user launches a session with a GPU resource profile, that GPU is reserved for their container, even if it is idle.
Not all versions of the NVidia driver set are compatible with GPU container runtime.
For some versions of the NVidia drivers, some manual rearrangement of the installed driver files are sometimes required in order for the Gravity/Planet container to “find” them.

In short, the enablement of GPUs within Workbench is a challenge, but one that many customers have nevertheless found worthwhile.

Operating system#

In this section, we highlight a number of the important considerations for our Gravity-based offering. For BYOK8s customers, these types of concerns are likely “baked-in” to the general objective of standing up a performance cluster. However, customers who build their own on-premise Kubernetes clusters will likely encounter similar concerns.

Kernel modules and settings#

The system requirements provide sufficient detail on the kernel modules and other OS settings required to ensure effective operation of the Kubernetes layer. A common mistake is the failure to ensure that these settings are preserved upon reboot—so the cluster operates without incident until a system modification forces a reboot. System management software (see below) can often prevent these settings from persisting properly.

Firewall settings#

Kubernetes itself actively manages the firewall settings on the master and worker nodes to ensure proper communication management between nodes and pods. Introducing additional firewall settings runs the risk of interrupting Workbench functionality. Please make sure that additional firewall configurations are disabled or confirmed to be compatible with Workbench. This is another common configuration that can be corrupted by automated system management tools.

The Linux audit daemon (auditd)#

The Linux audit system provides a flexible method to detect and log a variety of system issues, and is a genuinely useful tool that is commonly enabled on the Kubernetes stack. For this reason, we have the following guidelines for exceptions and exclusions:

/var/lib/gravity must be excluded from auditd monitoring.
/opt/anaconda should be excluded as well. That said, we do not have strong evidence that system instability can be tied to monitoring of that directory.
If managed persistence is hosted on the master node, then we encourage the exclusion of that directory as well. Conda environment management performs a significant number of disk operations, and slowing these operations can significantly diminish the user experience.

Antivirus / antimalware#

Our customers utilize a variety of Linux antivirus and antimalware scanning tools, some of which include an on-demand scanning component. As with auditd, this scanning introduces a significant burden on proper Kubernetes operation. For this reason, our guidance for on-demand scanning mirrors that of auditd. In particular, /var/lib/gravity must be excluded from on-demand scanning.

System management software#

One frequent culprit involved in sudden loss of Workbench functionality are system management tools such as Chef or Puppet. Tools such as these are designed to automate and simplify the management of large numbers of servers. Where they run afoul of Workbench is when the application requires exceptions to configurations enforced by these tools. It is essential that those exceptions are properly enabled. Otherwise, these tools can make fatal modifications to the underlying operating system unannounced: removing necessary kernel modules, reinstalling firewall rules, removing auditd exceptions, and others. If your organization uses tools such as these, please review the Workbench system requirements with them and confirm that the necessary exceptions are permanently engaged, with clear documentation as to why. Otherwise, we find that customers will eventually encounter administrators who remove these important configuration details and thereby disrupt the operation of Workbench.

Backup solutions#

Many organizations will employ backup solutions on any server running critical applications, or production environments. It is important to exclude Gravity from any scheduled backup as this will cause severe disk pressure. Workbench has its own scripts that can be used to make a backup of the application on a regular basis.

Disks#

Disk space#

The disk space requirements specified for Gravity installations for /var/lib/gravity, /opt/anaconda, and /tmp must be respected. The installer includes disk space checks in its pre-flight checks.

With managed persistence, generous disk space allocations are even more important. This disk holds a copy of every project (and one copy for each collaborator), and every custom conda environment created by users. A single conda environment can consume multiple gigabytes. For this reason, we encourage that the size of this disk should start at 1TB, and preferably support live resizing.

I/O performance#

Low disk latency and high throughput in the /var/lib/gravity directory is essential for the stability of the platform. In particular, the master node hosts the Kubernetes etcd key-value store there.

In practice, we have found that the use of platter disks for /var/lib/gravity is a primary cause of system instability. Use of an SSD for this directory is effectively required. Direct-attached storage is preferred whenever possible, but we do believe that a sufficiently performant network-attached storage volume for /opt/anaconda is acceptable. Indeed, our positive experience with shared storage for BYOK8s installations validate this belief.

Auditing and antivirus software#

As mentioned above, auditd daemons and antivirus software can significantly impact effective disk performance. For this reason, we mention here as well that the guidelines listed above for these tools must be honored.

Managed persistence#

The new Managed Persistence functionality of Workbench requires the use of a shared volume that is accessible from all nodes, master and worker. So far, our customers have found that a performant enterprise NAS offers sufficient performance for their needs.

In theory, it is possible to export a directory from the master node via NFS. If an independent file sharing option is available, Anaconda recommends that instead, to ensure that the master node may focus on Kubernetes-related duties. But we have multiple successful implementations using this approach.

As our real-world experience with this feature is more limited, we will update these recommendations as more information comes in.

Cloud-specific concerns#

Ensuring sufficient disk I/O performance is essential for a successful cloud-based implementation of Workbench. Fortunately, the common cloud providers make this a relatively straightforward thing to achieve. If possible, select VMs with attached SSDs large enough to hold /var/lib/gravity. When it is necessary to use additional attached block volumes, respect the IOPS recommendations in our system requirements. Each cloud provider offers different mechanisms for ensuring disk performance.

In practice, the larger the disk, the higher the base IOPS performance. If you are generous with disk space, you are less likely to have issues.
With some providers (e.g., Azure), the only mechanism for increasing performance is to increase the disk size.
Providers like AWS offer managed IOPS, allowing you to provision size and IOPS separately. This is a reasonable approach, and may enable lower costs, but Anaconda recommends at least studying the cost of a larger disk instead of simply boosting IOPS.

Network#

It is vitally important that the nodes of the cluster have unfettered access to each other. Whenever network performance is impacted by hardware or operating system issues, the Kubernetes cluster will be unstable, and thus so will Workbench itself.

Private networking#

For very understandable reasons, customers usually need to place Workbench behind a firewall or VPN. It is important that this firewall does not interrupt communication between nodes, however. If possible, use private networking to connect the nodes to each other so that they may communicate over more direct connections even as the public-facing access to the cluster is restricted.

Load balancing#

Workbench does not currently support being placed behind an SSL termination load balancer. Our experience is that it will function properly behind an SSL passthrough load balancer, however.

Proxies#

Proxies may be required to access external data stores, repositories, and so forth. However, they must not be required for the nodes to speak with each other, and proxies must not be enabled at the OS/system level.

WAN accelerators (IDS, packet caching, etc.)#

Network acceleration technology should be disabled. Kubernetes needs to manage its own traffic shaping.

Shared volume (NFS) access#

As is commonly understood, losing access to an external NFS share can cause disk waits and other significant issues on Unix machines. This is true for Kubernetes clusters as well. The platform can be expected to behave unreliably until access to any attached NFS volumes is restored. Interruptions to access for the managed persistence volume in particular will be severely disruptive.

Cloud vs. On-premise#

Most of our customers know in advance whether or not they will be deploying onto on-premise hardware or on a major cloud provider (AWS, Google, Azure). Others have the option to choose either option, and look to us for advice on which to prefer.

In our experience, cloud installations are smoother and more reliable for a number of reasons:

It is easier to ensure that the hardware requirements are met. For each of the major cloud providers, we can recommend specific instance types that are known to provide good performance for Workbench.
There tends to be less additional software installed on cloud hardware, reducing the likelihood of unexpected behavior caused by interactions with the Gravity stack.
The provisioning process is faster, as is the process of adding additional nodes or disk when required.
We have found it significantly easier to ensure a compatible GPU configuration in the cloud. On-premise GPU nodes often require BIOS modifications or other configuration changes to successfully deploy.

That is not to say that cloud installations are always perfectly smooth. Indeed, all of the guidance in this document applies to both cloud and on-premise installations, and we have included cloud-specific amplifications above.

Bring Your Own Kubernetes#

At a high level, many of the recommendations offered above have been developed with the assumption of an Anaconda-supplied, Gravity/Planet-based Kubernetes stack. In contrast, our BYOK8s customers will be able to leverage existing Kubernetes resources—either an on-premise Kubernetes cluster already configured to support multiple tenants, or a managed Kubernetes offering such as EKS (AWS), AKS (Azure), GKE (Google). In these scenarios, many of the above concerns are not relevant:

Concerns about disk performance for /var/lib/gravity are tied to the need to ensure a performant Kubernetes stack.
Operating system requirements will likely be settled either by the Kubernetes administrator or the managed Kubernetes provider.
Workbench will likely not have access to the Kubernetes control plane; instead, its own application containers will be running on worker nodes alongside user workloads.

In short, most of the system requirements we have historically offered for Workbench center around ensuring a reliable and performant Kubernetes cluster. Most of our requirements, therefore, are superseded by the requirements imposed by your cluster.

Assuming the existence of a stable Kubernetes cluster, therefore, here is a list of some of the remaining “requirements” that remain. These include some special caveats that we have accumulated from experience with customers who may be new to the use of Kubernetes to deploy resource-intensive data science workloads.

Docker image sizes#

Our Docker images are larger than many Kubernetes administrators are accustomed to. In particular, the Docker image on which users run their sessions, deployments, and jobs is nearly 20GB uncompressed. This is probably the most difficult requirement for some Kubernetes administrators to swallow. Here are a couple of points to emphasize when discussing this with your administrators.

First: this does not imply that every session, deployment, and job will consume 20GB of disk space. Docker images are shared across all containers that utilize them. Therefore, the disk space consumed by the image is amortized across all of its uses on a given node

Second: the primary reason for this disk consumption is the set of pre-baked, global data science environments contained in this image. Future versions of Workbench will have the option to remove those environments or move them to shared storage; however, the image size is likely to never drop below 5GB.

In our experience, the response to our image sizes among Kubernetes administrators is somewhat bimodal: some react strongly negatively to it, while others have already seen images of comparable size.

Resource profiles#

In our experience, Kubernetes administrators who are not accustomed to serving data science workloads will be surprised by our requirements. For many microservice workloads, CPU limits of less than a single core, and memory limits of less than 1GB, will be very common. Data science workloads require several times this much per session.

On the other hand, our standard oversubscription recommendation of 4:1—that is, the ratio between our memory/CPU limits and requests values—is a somewhat standard choice. Higher levels of oversubscription will result in sporadic performance issues for your users.

We reiterate here what we emphasized in the CPU and Memory section above: do not compromise the CPU and memory allocations for your users.

Storage#

The /opt/anaconda/storage volume does not have the same strict performance requirements that /var/lib/gravity has on a Gravity installation. However, we definitely encourage the use of a “premium” performance tier for this volume if possible, as well as for the managed persistence volume.

A high-performance storage tier should be chosen for the managed persistence volume as well. Remember, users will be interacting with that volume to create Python environments and run data science workloads. Performance limitations on this volume will directly impact the user experience.

Security#

In Openshift (OCP), containers by default will not run as root, and will use the Restricted Security Context Constraint (SCC). However, to use certain features such as authenticated NFS, we may need to allow pods to use the “anyuid” SCC.

Replacing the Ops Center#

If you have administered a Gravity-based Workbench installation, you are accustomed to using the Ops Center for cluster configuration / management / monitoring. This was a feature unique to Gravitational installs, as it was provided by the Gravity site pod. In a BYOK8s environment, you will need to use the built-in management / configuration / monitoring tools provided by your k8s platform.

Autoscaling#

We do not yet support autoscaling, but we are investigating it for feasibility. It is important to note however that scaling down Anaconda Enterprise—that is, reducing the number of nodes consumed—is not likely to be feasible in an automatic fashion. This is because downscaling requires moving workload from the nodes being decommissioned onto the remaining nodes. For user sessions, that is not something you should do without warning or planning, as doing so can interrupt active work.