Preparing a BYOK8s environment for Workbench#

Determining the resource requirements for a Kubernetes cluster depends on a number of different factors, including what type of applications you are going to be running, the number of users that are active at once, and the workloads you will be managing within the cluster. Data Science & AI Workbench’s performance is tightly coupled with the health of your Kubernetes stack, so it is important to allocate enough resources to manage your users workloads.

Anaconda’s hardware recommendations ensure a reliable and performant Kubernetes cluster. However, most requirements are likely to be superseded by the requirements imposed by your existing Kubernetes cluster, whether that is an on-premise cluster that is configured to support multiple tenants or a cloud offering.

To install Workbench successfully, your systems must meet or exceed the requirements listed below. Anaconda has created a pre-installation checklist to help prepare you for installation. The checklist helps you verify that your cluster is ready to install Workbench, and that the necessary resources are reserved. Anaconda’s Implementation team will review the checklist with you prior to your installation.

Supported Kubernetes versions#

Workbench is compatible with Kubernetes API versions 1.15-1.28. If your version of Kubernetes utilizes the API at these versions, you can install Workbench!

Workbench has been successfully installed on the following Kubernetes variants:

  • Vanilla Kubernetes

  • VMWare Tanzu

  • RedHat OpenShift

  • Google Anthos

  • Amazon Elastic Kubernetes Service (EKS)

  • Microsoft Azure Kubernetes Service (AKS)

  • Google Kubernetes Service (GKE)

  • RedHat OpenShift on AWS (ROSA)

Aside from the basic requirements listed on this page, Anaconda also offers environment-specific recommendations for you to consider.

Administration server#

Installation requires a machine with direct access to the target Kubernetes cluster and Docker registry. Anaconda refers to this machine as the Administration Server. Anaconda recommends that you identify a machine to be your Administration Server that will remain available for ongoing management of the application once installed. It is useful for this server to be able to mount the storage volumes as well.

The following software must be installed on the Administration Server:

  • Helm version 3.2+

  • The Kubernetes CLI tool - kubectl

  • (OpenShift only) The OpenShift oc CLI tool

  • (Optional) The watch command line tool

  • (Optional) The jq command line tool

Administration server setup

You can obtain all of the tools you need for your Administration Server by installing the ae5-conda environment. This environment already contains helm, kubectl, oc, jq, and a number of other useful Workbench management utilities. To install the environment:

  1. Download the environment.

  2. If necessary, move the environment to the Administration Server.

  3. Open a terminal shell and install the environment by running the following command:

    bash ae5-conda-latest-Linux-x86_64.sh
    
  4. Follow the prompts, then restart your terminal.

  5. (Optional) Add the environment to your PATH.

CPU, memory, and nodes#

  • Minimum node size: 8 CPU cores, 32GB RAM

  • Recommended node size: 16 CPU cores, 64GB RAM (or more)

  • Recommended oversubscription (limits/requests ratio) 4:1

  • Minimum number of worker nodes: 3

Minimally sized nodes should be reserved for test environments with low user counts and small workloads. Any development or production environment should meet or exceed the recommended requirements.

Workbench utilizes node labels and taints and tolerations to ensure that workloads run on the appropriate nodes. Anaconda recommends identifying any necessary affinity (which nodes a workload runs on based on its applied labels) or toleration settings prior to installation.

Resource profiles#

Resource profiles available to platform users for their sessions and deployments are created by the cluster administrator. Each resource profile can be customized for the amount of CPU, memory, and (optionally) GPU resources available. Anaconda recommends determining what resource profiles you will require prior to installation. For more information, see Configuring workload resource profiles.

Namespace, service account, RBAC#

Workbench should be installed in a namespace that is not occupied by any other applications, including other instances of Workbench. Create a service account for the namespace with sufficient permissions to complete the helm installation and enable the dynamic resource provisioning Workbench performs during normal operation.

Workbench requires more permissions than would normally be given to an application that only requires read-only access to the Kubernetes API. However, with the exception of the ingress controller, all necessary permission grants are limited to the application namespace. Please speak with the Anaconda Implementation team about any questions you may have regarding these permissions.

RBAC template

The following Role and RoleBinding pair can be used to grant sufficient permissions to the Service Account.

# Replace <SERVICE_ACCOUNT> with the name of your service account
# Replace <NAMESPACE> with the namespace reserved for Workbench
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: <SERVICE_ACCOUNT>
namespace: <NAMESPACE>
rules:
- verbs: [ "get", "list" ]
    apiGroups: [ "" ]
    resources: [ "namespaces", "pods/log", "events" ]
- verbs: [ "create", "delete", "get", "list", "patch", "update", "watch" ]
    apiGroups: [ "" ]
    resources: [ "configmaps", "secrets", "pods", "persistentvolumeclaims", "endpoints", "services" ]
- verbs: [ "create", "delete", "get", "list", "patch", "update", "watch" ]
    apiGroups: [ "apps" ]
    resources: [ "deployments", "replicasets", "statefulsets" ]
- verbs: [ "create", "delete", "get", "list", "patch", "update", "watch" ]
    apiGroups: [ "batch" ]
    resources: [ "jobs", "cronjobs" ]
- verbs: [ "create", "delete", "get", "list", "patch", "update", "watch" ]
    apiGroups: [ "extensions" ]
    resources: [ "deployments", "replicasets" ]
- verbs: [ "create", "delete", "get", "list", "patch", "update", "watch" ]
    apiGroups: [ "networking.k8s.io" ]
    resources: [ "ingresses" ]
- verbs: [ "create", "delete", "get", "list", "patch", "update", "watch" ]
    apiGroups: [ "route.openshift.io" ]
    resources: [ "routes", "routes/custom-host" ]
- verbs: [ "get",  "list" ]
    apiGroups: [ "" ]
    resources: [ "serviceaccounts", "roles" ]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: <SERVICE_ACCOUNT>
namespace: <NAMESPACE>
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: <SERVICE_ACCOUNT>
subjects:
- kind: ServiceAccount
    name: <SERVICE_ACCOUNT>

Note

Recent versions of OpenShift no longer allow granting direct access to the anyuid Security Context Constraint (SCC), or any other default SCC. Instead, access grants are defined within the role.

Example anyuid SCC configuration
- verbs:
    - use
apiGroups:
    - security.openshift.io
resources:
    - securitycontextconstraints
resourceNames:
    - anyuid

If you want to use the Anaconda-supplied ingress, it is also necessary to grant a small number of additional, cluster-wide permissions. This is because the ingress controller expects to be able to monitor ingress-related resources across all namespaces.

Ingress controller permissions

The following is a minimal ClusterRole and ClusterRoleBinding pair that has grants the ingress controller sufficient permissions to run without warnings:

# Ingress Controller Permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: <SERVICE_ACCOUNT>-ingress
  annotations:
    anaconda-rbac: "true"
rules:
  - verbs: [ "create", "delete", "get", "list"]
    apiGroups: [ "*" ]
    resources: [ "ingressclasses" ]
  - verbs: [ "patch", "create" ]
    apiGroups: [ "*" ]
    resources: ["events"]
  - verbs: ["list", "watch"]
    apiGroups: ["*"]
    resources: [ "secrets", "endpoints", "ingresses", "endpointslices", "services", "pods" ]
---

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: <SERVICE_ACCOUNT>-ingress
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: <SERVICE_ACCOUNT>
subjects:
  - kind: ServiceAccount
    name: <SERVICE_ACCOUNT>
    namespace: <NAMESPACE>

If you want to use the Kubernetes Dashboard and resource monitoring services included with Workbench, you must include additional permissions for each service.

Note

You must establish permissions for both Prometheus and kube-state-metrics to utilize Worbench’s resource monitoring features.

Kubernetes dashboard permissions
# Dashboard - Application
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: <RELEASE_NAME>-dashboard
  namespace: <NAMESPACE>
rules:
  # Allow Dashboard to get, update and delete Dashboard exclusive secrets.
  - apiGroups: [ "" ]
    resources: [ "secrets" ]
    resourceNames: [ "kubernetes-dashboard-key-holder", "kubernetes-dashboard-certs", "kubernetes-dashboard-csrf" ]
    verbs: [ "get", "update", "delete" ]
    # Allow Dashboard to get and update 'kubernetes-dashboard-settings' config map.
  - apiGroups: [ "" ]
    resources: [ "configmaps" ]
    resourceNames: [ "kubernetes-dashboard-settings" ]
    verbs: [ "get", "update" ]
    # Allow Dashboard to get metrics.
  - apiGroups: [ "" ]
    resources: [ "services" ]
    resourceNames: [ "heapster", "dashboard-metrics-scraper" ]
    verbs: [ "proxy" ]
  - apiGroups: [ "" ]
    resources: [ "services/proxy" ]
    resourceNames: [ "heapster", "http:heapster:", "https:heapster:", "dashboard-metrics-scraper", "http:dashboard-metrics-scraper" ]
    verbs: [ "get" ]
---

# Dashboard - Application
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: <RELEASE_NAME>-dashboard
  namespace: <NAMESPACE>
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: <RELEASE_NAME>-dashboard
subjects:
  - kind: ServiceAccount
    name: <SERVICE_ACCOUNT>
    namespace: <NAMESPACE>
---

# Dashboard - Application (Bound with RoleBinding for namespace scoping)
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: <RELEASE_NAME>-dashboard-metrics
rules:
  # Allow Metrics Scraper to get metrics from the Metrics server
  - apiGroups: [ "metrics.k8s.io" ]
    resources: [ "pods", "nodes" ]
    verbs: [ "get", "list", "watch" ]
---

# Dashboard - Namespace
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: <RELEASE_NAME>-dashboard-namespace
  namespace: <NAMESPACE>
rules:
  # Allow Dashboard to manage resources within the namespace.
  - verbs: [ "get", "list", "watch", "patch", "update", "delete", "create" ]
    apiGroups: [ "*" ]
    resources:
      # Workloads
      - "cronjobs"
      - "daemonsets"
      - "deployments"
      - "jobs"
      - "pods"
      - "pods/log"
      - "pods/exec"
      - "replicasets"
      - "replicationcontrollers"
      - "statefulsets"
      # Services
      - "ingresses"
      - "ingressclasses"
      - "services"
      # Config and Storage
      - "configmaps"
      - "persistentvolumeclaims"
      - "secrets"
      - "storageclasses"
      # Cluster
      - "clusterrolebindings"
      - "clusterroles"
      - "events"
      - "namespaces"
      - "networkpolicies"
      - "nodes"
      - "persistentvolumes"
      - "rolebindings"
      - "roles"
      - "serviceaccounts"
---

# Dashboard - Namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: <RELEASE_NAME>-dashboard-namespace
  namespace: <NAMESPACE>
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: <RELEASE_NAME>-dashboard-namespace
subjects:
  - kind: ServiceAccount
    name: <SERVICE_ACCOUNT>
    namespace: <NAMESPACE>
---

# Dashboard - Application Metrics
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: <RELEASE_NAME>-dashboard-metrics-namespace
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: <RELEASE_NAME>-dashboard-metrics
subjects:
  - kind: ServiceAccount
    name: <SERVICE_ACCOUNT>
    namespace: <NAMESPACE>
Prometheus permissions
# Prometheus Common Permissions
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: <RELEASE_NAME>-prometheus
  namespace: <NAMESPACE>
rules:
  - verbs: [ "get", "list", "watch" ]
    apiGroups: [ "" ]
    resources: [ "nodes", "nodes/proxy", "nodes/metrics", "services", "endpoints", "pods", "ingresses", "configmaps" ]
  - verbs: [ "get", "list", "watch"]
    apiGroups: [ "extensions",  "networking.k8s.io"]
    resources: [ "ingresses/status", "ingresses"]
  - verbs: [ "get" ]
    nonResourceURLs: [ "/metrics" ]
---

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: <RELEASE_NAME>-prometheus
  namespace: <NAMESPACE>
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: <RELEASE_NAME>-prometheus
subjects:
  - kind: ServiceAccount
    name: <SERVICE_ACCOUNT>
    namespace: <NAMESPACE>
kube-state-metrics permissions
# Kubestatemetrics Exporter Namespace Binding
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: <RELEASE_NAME>-kubestatemetrics
  namespace: <NAMESPACE>
rules:
  - verbs: [ "list", "watch" ]
    apiGroups: [ "certificates.k8s.io" ]
    resources: ["certificatesigningrequests"]
  - verbs: [ "list", "watch" ]
    apiGroups: [""]
    resources: ["configmaps"]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "batch" ]
    resources: [ "cronjobs" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "extensions", "apps" ]
    resources: [ "daemonsets" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "extensions", "apps"]
    resources: [ "deployments" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [""]
    resources: [ "endpoints" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "autoscaling" ]
    resources: [ "horizontalpodautoscalers" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "extensions", "networking.k8s.io" ]
    resources: [ "ingresses" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "batch" ]
    resources: ["jobs"]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "coordination.k8s.io" ]
    resources: [ "leases" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "" ]
    resources: [ "limitranges" ]
  - verbs: ["list", "watch"]
    apiGroups: [ "admissionregistration.k8s.io" ]
    resources: [ "mutatingwebhookconfigurations" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "" ]
    resources: [ "namespaces" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "networking.k8s.io" ]
    resources: [ "networkpolicies" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "" ]
    resources: [ "nodes" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "" ]
    resources: [ "persistentvolumeclaims" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "" ]
    resources: [ "persistentvolumes" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "policy" ]
    resources: [ "poddisruptionbudgets" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "" ]
    resources: [ "pods" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "extensions", "apps" ]
    resources: [ "replicasets" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "" ]
    resources: [ "replicationcontrollers" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "" ]
    resources: [ "resourcequotas" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "" ]
    resources: [ "secrets" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "" ]
    resources: [ "services" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "apps" ]
    resources: [ "statefulsets" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "storage.k8s.io" ]
    resources: [ "storageclasses" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "admissionregistration.k8s.io" ]
    resources: [ "validatingwebhookconfigurations" ]
  - verbs: [ "list", "watch" ]
    apiGroups: [ "storage.k8s.io" ]
    resources: [ "volumeattachments" ]
---

kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: <RELEASE_NAME>-kubestatemetrics
  namespace: <NAMESPACE>
subjects:
  - kind: ServiceAccount
    name: <SERVICE_ACCOUNT>
    namespace: <NAMESPACE>
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: <RELEASE_NAME>-kubestatemetrics

Caution

Please review these RBAC configurations with your Kubernetes administrator. While it is possible to further reduce these scopes, doing so is likely to prevent normal operation of Workbench.

Security#

Preparing your Kubernetes cluster to install Workbench involves configuring your environment in a way that both supports the functionality of Workbench and adheres to security best practices.

Workbench containers can be run using any fixed, non-zero UID, making the application compatible with an OpenShift Container Platform (OCP) restricted SCC, or an equivalent non-permissive Kubernetes security context. This reduces the risk to your systems if the container is compromised.

However, in order to enable the Authenticated Network File System (NFS) capability, allowing user containers to access external, authenticated fileshares (storage servers), user pods must be permitted to run as root (UID 0).

Note

This configuration runs containers in a privileged state to determine and assign the authenticated group memberships of the user running the container only. Once authentication is complete, the container drops down to a non-privileged state for all further execution.

Please speak with your Anaconda Implementation team for more information, and to see if it is possible for your application to avoid this requirement.

Storage#

A standard installation of Workbench requires two Persistent Volume Claims (PVCs) to be statically provisioned and bound prior to installation. Anaconda strongly recommends a premium performance tier for provisioning your PVCs if the option is available. Expand the following sections for more information about the necessary volumes:

anaconda-storage

This volume contains:

  • Anaconda’s internal Postgres control database

  • Anaconda’s internal Git storage mechanism

  • Anaconda’s internal conda package repository

Note

If you are hosting conda packages outside of Workbench, the minimum size of your anaconda-storage volume must be at least 100GiB.

However, if you intend to mirror conda packages into the Workbench repository, the anaconda-storage volume will need to be much larger to accommodate those packages. Anaconda recommends at least 500GiB of storage.

The anaconda-storage volume must support either the ReadWriteOnce or ReadWriteMany access mode.

Caution

The ReadWriteOnce configuration requires that the postgres, git-storage, and object-storage pods run on the same node.

anaconda-persistence

This volume hosts Anaconda’s managed persistence storage. It contains:

  • Custom sample projects

  • Custom conda environments

  • User code and data

The anaconda-persistence volume requires ReadWriteMany access.

Caution

The demands on the anaconda-persistence volume will continuously grow with usage. Anaconda recommends you provision at least 1TiB of storage to start.

It is possible to combine these into a single PersistentVolumeClaim to cover both needs, as long as that single volume simultaneously meets the performance needs demanded by both.

The root directories of these storage volumes must be writable by Workbench containers. This can be accomplished by configuring the volumes to be group writable by a single numeric GroupID (GID). Anaconda strongly recommends that this GID be 0. This is the default GID assigned to Kubernetes containers. If this is not possible, supply the GID within the Persistent Volume specification as a pv.beta.kubernetes.io/gid annotation.

Caution

To ensure that the data on these volumes is not lost if Workbench is uninstalled, do not change the ReclaimPolicy from its default value of Retain.

Ingress and firewall#

Workbench is compatible with most ingress controllers that are commonly used with Kubernetes clusters. Because ingress controllers are a cluster-wide resource, Anaconda recommends that the controller be installed and configured prior to installing Workbench. For example, if your Kubernetes version falls within 1.19-1.26, any ingress controller with full support for the networking.k8s.io/v1 ingress API enables Workbench to build endpoints for user sessions and deployments.

Note

If your cluster is fully dedicated to Workbench, you can configure the Helm chart to install a version of the NGINX Ingress controller, which is compatible with multiple variants of Kubernetes, including OpenShift. Anaconda’s only modification to the stock NGINX container enables it to run without root privileges.

Your cluster configuration and firewall settings must allow all TCP traffic between nodes, particularly HTTP, HTTPS, and the standard Postgres ports.

Caution

Healthy clusters can block inter-node communication, which disrupts the pods that Workbench requires to provision user workloads.

External traffic to Workbench will be funneled entirely through the ingress controller, through the standard HTTPS port 443.

DNS/SSL#

Workbench requires the following:

  • A valid, fully qualified domain name (FQDN) reserved for Workbench.

  • A DNS record for the FQDN, as well as a wildcard DNS record for its subdomains.

    Note

    Both records must point to the IP address allocated by the ingress controller. If you are using an existing ingress controller, you may be able to obtain this address prior to installation. Otherwise, you must populate the DNS records with the address after the initial installation is complete.

  • A valid wildcard SSL certificate covering the cluster FQDN and its subdomains. Installation requires both the public certificate and the private key.

    Note

    If the certificate chain includes an intermediate certificate, the public certificate for the intermediate is required. The scope of the wildcard only requires *.anaconda.company.com to be covered, ensuring that all subdomains under this specific domain are included in the SSL certificate and DNS configuration.

  • The public root certificate, if the above certificates were created with a private Certificate Authority (CA).

Warning

Wildcard DNS records and SSL certificates are required for correct operation of Workbench. If you have any objections to this requirement, speak with your Anaconda Implementation team.

Docker images#

Anaconda strongly recommends that you copy the Workbench Docker images from our authenticated source repository on Docker Hub into your internal docker registry. This ensures their availability even if there is an interruption in connectivity to Docker Hub. This registry must be accessible from the Kubernetes cluster where Workbench is installed.

Caution

In an air-gapped setting, this is required.

Docker images from the aedev/ Docker Hub channel

Anaconda provides a more precise manifest, including version numbers and the credentials required to pull these images from our authenticated repository, prior to installation.

ae-app-proxy
ae-auth
ae-auth-api
ae-auth-escrow
ae-deploy
ae-docs
ae-editor
ae-git-proxy
ae-git-storage
ae-object-storage
ae-operation-controller
ae-operation-create-project
ae-repository
ae-storage
ae-sync
ae-ui
ae-workspace
ae-wagonwheel
ae-auth-keycloak
ae-nginx-ingress-v1
ae-nginx-ingress-v1beta1
postgres:16.3

Note

Docker images used by Workbench are larger than many Kubernetes administrators are accustomed to. For more background, see Docker image sizes.

GPU Information#

This release of Workbench supports up to Compute Unified Device Architecture (CUDA) 11.6 DataCenter drivers.

Anaconda has directly tested the application with the following GPU cards:

  • Tesla V100 (recommended)

  • Tesla P100 (adequate)

Theoretically, Workbench will work with any GPU card compatible with the CUDA drivers, as long as they are properly installed. Other cards supported by CUDA 11.6:

  • A-Series: NVIDIA A100, NVIDIA A40, NVIDIA A30, NVIDIA A10

  • RTX-Series: RTX 8000, RTX 6000, NVIDIA RTX A6000, NVIDIA RTX A5000, NVIDIA RTX A4000, NVIDIA T1000, NVIDIA T600, NVIDIA T400

  • HGX-Series: HGX A100, HGX-2

  • T-Series: Tesla T4

  • P-Series: Tesla P40, Tesla P6, Tesla P4

  • K-Series: Tesla K80, Tesla K520, Tesla K40c, Tesla K40m, Tesla K40s, Tesla K40st, Tesla K40t, Tesla K20Xm, Tesla K20m, Tesla K20s, Tesla K20c, Tesla K10, Tesla K8

  • M-Class: M60, M40 24GB, M40, M6, M4

Support for GPUs in Kubernetes is still a work in progress, and each cloud vendor provides different recommendations. For more information about GPUs, see Understanding GPUs.

Helm charts#

Helm is a tool used by Workbench to streamline the creation, packaging, configuration, and deployment of the application’s configurations. It combines all of the config map objects into a single reusable package called a Helm chart. This chart contains all the necessary resources to deploy the application within your cluster. These resources include .yaml configuration files, services, secrets, and config maps.

Helm values.byok.*.yaml templates

For customer-supplied Kubernetes environments, Workbench includes templated values.byok.*.yaml files that override the default values in the top-level Helm chart. These template files are heavily commented to guide you through configuring the parameters that most commonly require modifications.

If Workbench is the only application present within your cluster, use the single-tenant configurations. If Workbench shares the cluster with other applications, use the multi-tenant configurations. Choose the template that applies to your setup and make additions and modifications to the file with your current cluster configurations at this time.

Note

Single-tenant clusters use the values.byok.cluster.yaml override file template.

# This values.yaml template is intended to be customized
# for each installation. Its values *augment and override*
# the default values found in Anaconda-Enterprise/values.yaml.

global:
  # global.hostname -- The fully qualified domain name (FQDN) of the cluster.
  # @section -- Global Common Parameters
  hostname: "anaconda.example.com"

  # global.version -- (string) The application version; defaults to `Chart.AppVersion`.
  # @section -- Global Common Parameters
  version:

  # Uncomment for OpenShift only
  # dnsServer: dns-default.openshift-dns.svc.cluster.local

  # The UID under which to run the containers (required)
  runAsUser: 1000

  # Docker registry information
  image:
    # Repository for Workbench images.
    # Trailing slash required if not empty
    server: "aedev/"
    # A single pull secret name, or a list of names, as required
    pullSecrets:

  # Global Service Account Settings
  serviceAccount:
    # global.serviceAccount.name -- Service account name
    # @section -- Global RBAC Parameters
    name: "anaconda-enterprise"

# If the DNS record for the hostname above resolves to an
# address inaccessible from the cluster, supply a valid
# IP address for the ingress or load balancer here.
privateIP: ""

# rbac
serviceAccount:
  # serviceAccount.create -- Controls the creation of the service account
  # @section -- RBAC Parameters
  create: true
rbac:
  # rbac.create -- Controls the creation and binding of rbac resources. This excludes ingress.
  # See `.Values.ingress.install` for additional details on managing rbac for that resource
  # type.
  # @section -- RBAC Parameters
  create: true

# generateCerts -- Generate Self-Signed Certificates.
# `load`: use the certificates in Anaconda-Enterprise/certs.
# `skip`: do nothing; assume the secrets already exist.
# Existing secrets are always preserved during upgrades.
# @section -- TLS / SSL Secret Management
generateCerts: "generate"

# Keycloak LDAPS Settings
# truststore: path to your truststore file containing custom CA cert
# truststore_password: password of the truststore
# truststore_seret: name of secret used for the truststore such as anaconda-enterprise-truststore
keycloak:
  # keycloak.truststore -- Java Truststore File
  # @section -- Keycloak Parameters
  truststore: ""

  # keycloak.truststore_password -- Java Truststore Password
  # @section -- Keycloak Parameters
  truststore_password: ""

  # keycloak.truststore_secret -- Java Truststore Secret
  # @section -- Keycloak Parameters
  truststore_secret: ""

  # keycloak.tempUsername --
  # Important note: these have an effect only during
  # initial installation. If an administrative user
  # already exists, these values are ignored.
  # @section -- Keycloak Parameters
  tempUsername: "admin"

  # keycloak.tempPassword --
  # Important note: these have an effect only during
  # initial installation. If an administrative user
  # already exists, these values are ignored.
  # @section -- Keycloak Parameters
  tempPassword: "admin"

ingress:
  # ingress.className -- (string) If an existing ingress controller is being used, this
  # must match the ingress.className of that controller.
  # Cannot be empty if ingress.install is true.
  # @section -- Ingress Parameters
  className:

  # ingress.install -- Ingress Install Control.
  # `false`: an existing ingress controller will be used.
  # `true`: install an ingress controller in this namespace.
  # @section -- Ingress Parameters
  install: false

  # ingress.installClass -- IngressClass Install Control.
  # `false`: an existing IngressClass resource will be used.
  # `true`: create a new IngressClass in the global namespace.
  # Ignored if ingress.install is `false`.
  # @section -- Ingress Parameters
  installClass: false

  # ingress.labels -- `.metadata.labels` for the ingress.
  # If your ingress controller requires custom labels to be
  # added to ingress entries, list them here as a dictionary
  # of key/value pairs.
  # @section -- Ingress Parameters
  labels: {}

  # If your ingress requires custom annotations to be added
  # to ingress entries, they can be included here. These
  # will be added to any existing annotations in the chart.
  # For all ingress entries
  global: {}
  # For the master ingress only
  system: {}
  # For sessions and deployments only
  user: {}

# To configure an external Git repository, uncomment this section and fill
# in the relevant values. For more details, consult this page:
# https://enterprise-docs.anaconda.com/en/latest/admin/advanced/config-repo.html
#
# git:
#   type: github-v3-api
#   name: Github.com Repo
#   url: https://api.github.com/
#   credential-url: https://api.github.com/anaconda-test-org
#   organization: anaconda-test-org
#   repository: {owner}-{id}
#   username: somegituser
#   auth-token: 98bcf2261707794b4a56f24e23fd6ed771d6c742
#   http-timeout: 60
#   disable-tls-verification: false
#   create-args: {}

# As discussed in the documentation, you may use the same
# persistent volume for both storage resources. If so, make
# sure to use the same pvc: value in both locations.
storage:
  create: false
  pvc: "anaconda-storage"
persistence:
  pvc: "anaconda-storage"

# TOLERATIONS / AFFINITY
# Please work with the Anaconda team for assistance
# to configure these settings if you need them.

tolerations:
  # For all pods
  global: []
  # For system pods, except the ingress
  system: []
  # For the ingress daemonset alone
  ingress: []
  # For user pods
  user: []

affinity:
  # For all pods
  global: {}
  # For system pods, except the ingress
  system: {}
  # For the ingress daemonset alone
  ingress: {}
  # For user pods
  user: {}

# By default, all ops services are enabled for single-tenant BYOK installations.
# Consult the documentation for details on how to configure each service.

opsDashboard:
  enabled: true
opsMetrics:
  enabled: true
opsGrafana:
  enabled: true

Note

Multi-tenant clusters use the values.byok.namespace.yaml override file template.

# This values.yaml template is intended to be customized
# for each installation. Its values *augment and override*
# the default values found in Anaconda-Enterprise/values.yaml.

global:
  # global.hostname -- The fully qualified domain name (FQDN) of the cluster.
  # @section -- Global Common Parameters
  hostname: "anaconda.example.com"

  # global.version -- (string) The application version; defaults to `Chart.AppVersion`.
  # @section -- Global Common Parameters
  version:

  # Uncomment for OpenShift only
  # dnsServer: dns-default.openshift-dns.svc.cluster.local

  # The UID under which to run the containers (required)
  runAsUser: 1000

  # Docker registry information
  image:
    # Repository for Workbench images.
    # Trailing slash required if not empty
    server: "aedev/"
    # A single pull secret name, or a list of names, as required
    pullSecrets:

  # Global Service Account Settings
  serviceAccount:
    # global.serviceAccount.name -- Service account name
    # @section -- Global RBAC Parameters
    name: "anaconda-enterprise"

  opsMetrics:
    # global.opsMetrics.ownNamespace -- Controls whether scraping rules target release namespace or all namespaces.
    ownNamespace: true

# If the DNS record for the hostname above resolves to an
# address inaccessible from the cluster, supply a valid
# IP address for the ingress or load balancer here.
privateIP: ""

# rbac
serviceAccount:
  # serviceAccount.create -- Controls the creation of the service account
  # @section -- RBAC Parameters
  create: false
rbac:
  # rbac.create -- Controls the creation and binding of rbac resources. This excludes ingress.
  # See `.Values.ingress.install` for additional details on managing rbac for that resource
  # type.
  # @section -- RBAC Parameters
  create: false

# generateCerts -- Generate Self-Signed Certificates.
# `load`: use the certificates in Anaconda-Enterprise/certs.
# `skip`: do nothing; assume the secrets already exist.
# Existing secrets are always preserved during upgrades.
# @section -- TLS / SSL Secret Management
generateCerts: "generate"

# Keycloak LDAPS Settings
# truststore: path to your truststore file containing custom CA cert
# truststore_password: password of the truststore
# truststore_seret: name of secret used for the truststore such as anaconda-enterprise-truststore
keycloak:
  # keycloak.truststore -- Java Truststore File
  # @section -- Keycloak Parameters
  truststore: ""

  # keycloak.truststore_password -- Java Truststore Password
  # @section -- Keycloak Parameters
  truststore_password: ""

  # keycloak.truststore_secret -- Java Truststore Secret
  # @section -- Keycloak Parameters
  truststore_secret: ""

  # keycloak.tempUsername --
  # Important note: these have an effect only during
  # initial installation. If an administrative user
  # already exists, these values are ignored.
  # @section -- Keycloak Parameters
  tempUsername: "admin"

  # keycloak.tempPassword --
  # Important note: these have an effect only during
  # initial installation. If an administrative user
  # already exists, these values are ignored.
  # @section -- Keycloak Parameters
  tempPassword: "admin"

ingress:
  # ingress.className -- (string) If an existing ingress controller is being used, this
  # must match the ingress.className of that controller.
  # Cannot be empty if ingress.install is true.
  # @section -- Ingress Parameters
  className:

  # ingress.install -- Ingress Install Control.
  # `false`: an existing ingress controller will be used.
  # `true`: install an ingress controller in this namespace.
  # @section -- Ingress Parameters
  install: false

  # ingress.installClass -- IngressClass Install Control.
  # `false`: an existing IngressClass resource will be used.
  # `true`: create a new IngressClass in the global namespace.
  # Ignored if ingress.install is `false`.
  # @section -- Ingress Parameters
  installClass: false

  # ingress.labels -- `.metadata.labels` for the ingress.
  # If your ingress controller requires custom labels to be
  # added to ingress entries, list them here as a dictionary
  # of key/value pairs.
  # @section -- Ingress Parameters
  labels: {}

  # If your ingress requires custom annotations to be added
  # to ingress entries, they can be included here. These
  # will be added to any existing annotations in the chart.
  # For all ingress entries
  global: {}
  # For the master ingress only
  system: {}
  # For sessions and deployments only
  user: {}

# To configure an external Git repository, uncomment this section and fill
# in the relevant values. For more details, consult this page:
# https://enterprise-docs.anaconda.com/en/latest/admin/advanced/config-repo.html
#
# git:
#   type: github-v3-api
#   name: Github.com Repo
#   url: https://api.github.com/
#   credential-url: https://api.github.com/anaconda-test-org
#   organization: anaconda-test-org
#   repository: {owner}-{id}
#   username: somegituser
#   auth-token: 98bcf2261707794b4a56f24e23fd6ed771d6c742
#   http-timeout: 60
#   disable-tls-verification: false
#   create-args: {}

# As discussed in the documentation, you may use the same
# persistent volume for both storage resources. If so, make
# sure to use the same pvc: value in both locations.
storage:
  create: false
  pvc: "anaconda-storage"
persistence:
  pvc: "anaconda-storage"

# TOLERATIONS / AFFINITY
# Please work with the Anaconda team for assistance
# to configure these settings if you need them.

tolerations:
  # For all pods
  global: []
  # For system pods, except the ingress
  system: []
  # For the ingress daemonset alone
  ingress: []
  # For user pods
  user: []

affinity:
  # For all pods
  global: {}
  # For system pods, except the ingress
  system: {}
  # For the ingress daemonset alone
  ingress: {}
  # For user pods
  user: {}

# By default, all ops services are disabled for multi-tenant BYOK installations.
# Consult the documentation for details on how to enable and configure each service.

opsDashboard:
  enabled: false
opsMetrics:
  enabled: false
opsGrafana:
  enabled: false

Pre-installation checklist#

Anaconda has created this pre-installation checklist to help you verify that you have properly prepared your environment prior to installation.

Within this checklist, Anaconda provides some commands or command templates for you to run in order to verify a given requirement, along with a typical output from the command to give you an idea of the kind of information you should see. Run each of these commands, (modified as appropriate for your environment) and copy the outputs into a document. Send this document to your Anaconda implementation team so that they can verify your environment is ready before you begin the installation process.

BYOK8s pre-installation checklist

Verify that your administration server has been provisioned with appropriate versions of kubectl, helm, and other tools needed to perform installation and administration tasks by running the following command:

helm version

Here is an example response from the command:

version.BuildInfo{Version:"v3.7.1", GitCommit:"1d11fcb5d3f3bf00dbe6fe31b8412839a96b3dc4", GitTreeState:"clean", GoVersion:"go1.16.9"}

Verify that the API version of the Kubernetes cluster is between 1.15 and 1.28 by running the following command:

kubectl version

Here is an example response from the command:

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.12", GitCommit:"e2a822d9f3c2fdb5c9bfbe64313cf9f657f0a725", GitTreeState:"clean", BuildDate:"2020-05-06T05:17:59Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.12", GitCommit:"e2a822d9f3c2fdb5c9bfbe64313cf9f657f0a725", GitTreeState:"clean", BuildDate:"2020-05-06T05:09:48Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

Verify all nodes nodes on which Workbench will be installed have sufficient CPU and memory allocations by running the following command:

kubectl get nodes -o=jsonpath="{range .items[*]}{.metadata.name}{'\t'}{.status.capacity.cpu}{'\t'}{.status.capacity.memory}{'\n'}{end}"

Here is an example response from the command:

10.234.2.18 16  65806876Ki
10.234.2.19 16  65806876Ki
10.234.2.20 16  65806876Ki
10.234.2.21 16  65806876Ki
10.234.2.6  16  65974812Ki

Verify that the namespace where Workbench will be installed has been created by running the following command:

# Replace <NAMESPACE> with the namespace you've reserved for Workbench
kubectl describe namespace <NAMESPACE>

Here is an example response from the command:

Name:         default
Labels:       <none>
Annotations:  <none>
Status:       Active
No resource quota.
No resource limits.

Verify the service account that will be used by Workbench during installation and operation has been created by running the following command:

# Replace <SERVICE_ACCOUNT> with the name of the service account you've created for Workbench
kubectl describe sa <SERVICE_ACCOUNT>

Here is an example response from the command:

Name:                anaconda-enterprise
Namespace:           default
Labels:              <none>
Annotations:         <none>
Image pull secrets:  <none>
Mountable secrets:   anaconda-enterprise-token-cdmnf
Tokens:              anaconda-enterprise-token-cdmnf
Events:              <none>

(Openshift Only) Verify the Security Context Constraint (SCC) associated with the service account contains all of the necessary permissions by running the following command:

oc describe scc <SCC_NAME>

Here is an example response from the command:

Name:                       anyuid
Priority:                   10
Access:
  Users:                    <none>
  Groups:                   system:cluster-admins

Note

This example uses the anyuid SCC; however, the restricted SCC can also be used, as long as the uid range is known.

Verify the ClusterRole resource associated with the service account has the necessary permissions to facilitate installation and operation by running the following command:

# Replace <SERVICE_ACCOUNT> with the name of the service account you've created for Workbench
kubectl describe clusterrole <SERVICE_ACCOUNT>-ingress

Here is an example response from the command:

Name:         anaconda-enterprise-ingress
Labels:       app.kubernetes.io/managed-by=Helm
              skaffold.dev/run-id=8d38b94a-ab82-49d7-a6fd-0bc0fb549d1c
Annotations:  meta.helm.sh/release-name: anaconda-enterprise
              meta.helm.sh/release-namespace: default
PolicyRule:
  Resources  Non-Resource URLs  Resource Names  Verbs
  ---------  -----------------  --------------  -----
  *.*        []                 []              [*]
             [*]                []              [*]

Note

The above example is fully permissive. See the RBAC template for more realistic configurations.

A numeric UID that will be used to run Workbench containers is reserved.

Note

Include the UID in your checklist results.

Verify that GID 0 is permitted by the security context.

Verify any tolerations and/or node labels required to permit Workbench to run on its assigned nodes have been identified by running the following command:

# This command returns information for tolerations only
kubectl get nodes -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.taints[*].key}{"\n"}{end}'``

Verify that a Persistent Volume Claim (PVC) has been created within the application namespace, referencing a statically provisioned Persistent Volume that meets the storage requirements for the anaconda-storage volume.

Command: kubectl describe pvc anaconda-storage:

Name:          anaconda-storage
Namespace:     default
StorageClass:  anaconda-storage
Status:        Bound
Volume:        anaconda-storage
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
            pv.kubernetes.io/bound-by-controller: yes
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      500Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    anaconda-enterprise-ap-git-storage-6658575d6f-vxj4s
            anaconda-enterprise-ap-object-storage-76bcfc4d44-ctlhp
            anaconda-enterprise-postgres-c76869799-cbqzq
Events:        <none>

Verify a PVC has been created within the application namespace, referencing a statically provisioned Persistent Volume that meets the storage requirements for the anaconda-persistence volume by running the following command:

kubectl describe pvc anaconda-persistence

Here is an example response from the command:

Name:            anaconda-persistence
Labels:          <none>
Annotations:     pv.kubernetes.io/bound-by-controller: yes
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:
Status:          Bound
Claim:           default/anaconda-persistence
Reclaim Policy:  Retain
Access Modes:    RWX
VolumeMode:      Filesystem
Capacity:        500Gi
Node Affinity:   <none>
Message:
Source:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    10.234.2.7
    Path:      /data/persistence
    ReadOnly:  false
Events:        <none>

The cluster is sized appropriately (CPU / Memory) for user workload, including consideration for “burst” workloads. For more information, see Understanding Workbench system requirements.

Resource Profiles have been determined, and created in the values.yaml file.

A domain name for the Workbench application has been identified.

Note

Please include this domain name in your checklist output.

If you are supplying your own ingress controller, verify it has already been installed, and its master IP address and ingressClassName value have been identified.

Note

Please include both the IP address ingress class name in your checklist output.

Verify the DNS records for both anaconda.example.com and *.anaconda.example.com have been created and point to the IP address of the ingress controller by running the following command:

ping test.anaconda.example.com

Here is an example response from the command:

PING test.anaconda.example.com (167.172.143.144): 56 data bytes

Note

If the ingress controller is to be installed with Workbench, this may not be possible. In such cases, it is sufficient to confirm that the networking team is prepared to establish these records immediately following installation.

A wildcard SSL secret for anaconda.example.com and *.anaconda.example.com has been created. The public and private keys for the main certificate, as well as the full public certificate chain, are accessible from the administration server.

Note

Please share the public certificate chain in your checklist output.

If the SSL secret was created using a private CA, verify the public root certificate has been obtained.

If you are using a private Docker registry, verify the full set of Docker images have been transferred to this registry.

If a pull secret is required to access the Docker images (whether from the standard Workbench Docker channel or the private registry) verify the secret has been created in the application namespace by running the following command:

# Replace <NAMESPACE> with the namespace you've reserved for Workbench
# Replace <PULL_SECRET_NAME> with the name for your pull secret
kubectl get secret -n <NAMESPACE> <PULL_SECRET_NAME>