Connect to a Spark Cluster

Anaconda Enterprise contains numerous example projects, including a Spark/Hadoop project. This project includes Sparkmagic, so that you can connect to a Spark cluster with a running Livy server.

You can use Spark with Anaconda Enterprise in two ways:

  • Starting a notebook with one of the Spark kernels, in which case all code will be executed on the cluster and not locally. Note that a connection and all cluster resources will be assigned as soon as you execute any ordinary code cell, that is, any cell not marked as %%local.
  • Starting a normal notebook with a Python kernel, and using %load_ext sparkmagic.magics. That command will enable a set of functions to run code on the cluster. See examples (external link).

To display graphical output directly from the cluster, you must use SQL commands. This is also the only way to have results passed back to your local Python kernel, so that you can do further manipulation on it with pandas or other packages.

In the common case, the configuration provided for you in the Session will be correct and not require modification. However, in other cases you may need to use sandbox or ad-hoc environments that require the modifications described below.

Supported versions

The following combinations of the multiple tools are supported:

  • Python 2 and Python 3, Apache Livy 0.5, Apache Spark 2.1, Oracle Java 1.8
  • Python 2, Apache Livy 0.5, Apache Spark 1.6, Oracle Java 1.8

Overriding session settings

Certain jobs may require more cores or memory, or custom environment variables such as Python worker settings. The configuration passed to Livy is generally defined in the file ~/.sparkmagic/conf.json. You may inspect this file, particularly the section "session_configs", or you may refer to the example file in the spark directory, sparkmagic_conf.example.json. Note that the example file has not been tailored to your specific cluster.

In a Sparkmagic kernel such as PySpark, SparkR, or similar, you can change the configuration with the magic %%configure. This syntax is pure JSON, and the values are passed directly to the driver application.

Example:

%%configure -f
{"executorMemory": "4G", "executorCores":4}

If you are using a Python kernel and have done %load_ext sparkmagic.magics, you can use the %manage_spark command to set configuration options. The session options are in the “Create Session” pane under “Properties”.

Overriding basic settings

In some more experimental situations, you may want to change the Kerberos or Livy connection settings. For example, when first configuring the platform for a cluster. This is likely to be done by an administrator with intimate knowledge of the cluster’s security model.

In these cases, we recommend creating a krb5.conf file and a sparkmagic_conf.json file in the project directory so they will be saved along with the project itself. An example Sparkmagic configuration is included, sparkmagic_conf.example.json, listing the fields that are typically set. Of particular importance are the "url" and "auth" keys in each of the kernel sections.

The krb5.conf file is normally copied from the Hadoop cluster, rather than written manually, and may refer to additional configuration or certificate files. These files must all be uploaded using the interface.

To use these alternate configuration files, set the KRB5_CONFIG variable default to point to the full path of krb5.conf and set the values of SPARKMAGIC_CONF_DIR and SPARKMAGIC_CONF_FILE to point to the Sparkmagic config file. You can set these either by using the Project pane on the left of the interface, or by directly editing the anaconda-project.yml file.

For example, the final file’s variables section may look like this:

variables:
  KRB5_CONFIG:
    description: Location of config file for kerberos authentication
    default: /opt/continuum/project/krb5.conf
  SPARKMAGIC_CONF_DIR:
    description: Location of sparkmagic configuration file
    default: /opt/continuum/project
  SPARKMAGIC_CONF_FILE:
    description: Name of sparkmagic configuration file
    default: sparkmagic_conf.json

NOTE: You must perform these actions before running kinit or starting any notebook/kernel.

Kerberos

To authenticate and connect to a Kerberized Spark cluster, you need the appropriate configuration to execute a kinit command in a terminal on the project session.

NOTE: For more information, see the authentication section in Kerberos Configuration.

Managing Kerberos and Sparkmagic Configuration

Both Kerberos and Sparkmagic configuration files are managed with Kubernetes secrets and can easily be created with the anaconda-enterprise-cli tool. This allows system administrators to populate Kerberos and Sparkmagic configurations for all projects.

To create a kubernetes secret for both Kerberos and Sparkmagic, execute the following:

anaconda-enterprise-cli spark-config --config /etc/krb5.conf krb5.conf \
                  --config /opt/continuum/.sparkmagic/config.json config.json

This will create a yaml file anaconda-config-files-secret.yaml with the data converted for Kubernetes and AE5. Next, upload the file:

sudo kubectl replace -f anaconda-config-files-secret.yaml

When new projects are created, /etc/krb5.conf and ~/.sparkmagic/conf.json are populated with the appropriate data.

The Sparkmagic configuration file, commonly config.json, must set the auth field in the kernel_python_credentials section:

{
  "kernel_python_credentials" : {
    "url": "http://<LIVY_SERVER>:8998",
    "auth": "Kerberos"
  }
}

Finally, in the same config.json file set the home_path in handlers to ~/.sparkmagic-logs.

Example:

"logging_config": {
  "handlers": {
    "magicsHandler": {
      "class": "hdijupyterutils.filehandler.MagicsFileHandler",
      "formatter": "magicsFormatter",
      "home_path": "~/.sparkmagic-logs"
    }
  }
}

Using Custom Anaconda Parcels and Management Packs

Anaconda Enterprise provides functionality to generate custom Anaconda parcels for Cloudera CDH or custom Anaconda management packs for Hortonworks HDP, which allows administrators to distribute customized versions of Anaconda across a Hadoop/Spark cluster using Cloudera Manager for CDH or Apache Ambari for HDP.

Data scientists can then select a specific version of Anaconda and Python on a per-project basis by including the following configuration in the first cell in a Sparkmagic-based Jupyter Notebook:.

Example:

%%configure -f
{"conf": {"spark.yarn.appMasterEnv.PYSPARK_PYTHON": "/opt/anaconda/bin/python",
          "spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON": "/opt/anaconda/bin/python",
          "spark.yarn.executorEnv.PYSPARK_PYTHON": "/opt/anaconda/bin/python",
          "spark.pyspark.python": "/opt/anaconda/bin/python",
          "spark.pyspark.driver.python": "/opt/anaconda/bin/python"
         }
}

NOTE: Replace /opt/anaconda/ with the prefix of the install name and location for the particular Parcel/MPack installed.

../../_images/ae51-sparkmagic-example.png