System metrics#

Data Science & AI Workbench uses Prometheus to monitor the operation and performance of the application. Prometheus exposes system metrics such as CPU usage, memory consumption, and network traffic to help you understand and maintain the overall health of your infrastructure. Regularly analyzing your system can help you establish a baseline for your system operations, identify potential issues with your system, and troubleshoot active problems by aiding in determining the root cause.

Prometheus comes with a built-in alert manager that you can configure to inform you when certain conditions are met or when a specified threshold is exceeded. Both Prometheus and the alert manager are installed with their default settings in the Helm values.yaml file during the initial installation of Workbench, but can be updated at any time.

Follow the steps for Setting platform configurations using the Helm chart to modify the default configurations to enable additional alerts.

Accessing system metrics#

You can access the system metrics from the administrative console using the following steps.

  1. Open the My Account dropdown menu and select Admin Console.

  2. Open your Resource Monitoring consoles by clicking on the resources you want to view.

Configuring alerts#

To configure additional custom alerts, you must provide a few key elements in your alert and place the alert in the correct area of the Helm chart (opsMetrics.prometheus.server.alertingRules).

Here is an example alert that you might implement in your system:

- alert: PodsBlockedInTerminatingState
  expr: count(kube_pod_deletion_timestamp) by (namespace, pod) * count(kube_pod_status_reason{reason="NodeLost"} == 0) by (namespace, pod) > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    # Use outer single quotes '' in annotations if they contain Prometheus labels
    # Use {{"{{}}"}} for Prometheus labels to allow Helm to pass the template to Prometheus properly
    summary: 'Pod {{"{{$labels.namespace}}"}}/{{"{{$labels.pod}}"}} blocked in Terminating state.'

For detailed information about defining alerting rules, see the official Prometheus documentation.