Configuring Livy server for Hadoop Spark access

After installing Livy server, there are main 3 aspects you need to configure on Apache Livy server for Anaconda Enterprise users to be able to access Hadoop Spark within Anaconda Enterprise:

If the Hadoop cluster is configured to use Kerberos authentication, you’ll need to allow Livy to access the services. Additionally, you can configure Livy as a secure endpoint. For more information, see Configuring Livy to use HTTPS below.


Configuring Livy impersonation

To enable users to run Spark sessions within Anaconda Enterprise, they need to be able to log in to each machine in the Spark cluster. The easiest way to accomplish this is to configure Livy impersonation as follows:

  1. Add Hadoop.proxyuser.livy to your authenticated hosts, users, or groups.

  2. Check the option to Allow Livy to impersonate users and set the value to all (*), or a list of specific users or groups.

If impersonation is not enabled, the user executing the livy-server (livy) must exist on every machine. You can add this user to each machine by running the following command on each node:

sudo useradd -m livy

Note

If you have any problems configuring Livy, try setting the log level to DEBUG in the conf/log4j.properties file.


Configuring cluster access

Livy server enables users to submit jobs from any remote machine or analytics cluster—even where a Spark client is not available—without requiring you to install Jupyter and Anaconda directly on an edge node in the Spark cluster.

To configure Livy server, put the following environment variables into a user’s .bashrc file, or the conf/livy-env.sh file that’s used to configure the Livy server.

These values are accurate for a Cloudera install of Spark with Java version 1.8:

export JAVA_HOME=/usr/java/jdk1.8.0_121-cloudera/jre/
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark/
export SPARK_CONF_DIR=$SPARK_HOME/conf
export HADOOP_HOME=/etc/hadoop/
export HADOOP_CONF_DIR=/etc/hadoop/conf

Note that the port parameter that’s defined as livy.server.port in conf/livy-env.sh is the same port that will generally appear in the Sparkmagic user configuration.

The minimum required parameter is livy.spark.master. Other possible values include the following:

  • local[*]—for testing purposes

  • yarn-cluster—for using with the YARN resource allocation system

  • a full spark URI like spark://masterhost:7077—if the spark scheduler is on a different host.

Example with YARN:

livy.spark.master = yarn-cluster

The YARN deployment mode is set to cluster for Livy. The livy.conf file, typically located in $LIVY_HOME/conf/livy.conf, may include settings similar to the following:

livy.server.port = 8998
# What spark master Livy sessions should use: yarn or yarn-cluster
livy.spark.master = yarn
# What spark deploy mode Livy sessions should use: client or cluster
livy.spark.deployMode = cluster

# Kerberos settings

livy.server.auth.type = kerberos
livy.impersonation.enabled = true

# livy.server.launch.kerberos.principal = livy/$HOSTNAME@ANACONDA.COM
# livy.server.launch.kerberos.keytab = /etc/security/livy.keytab
# livy.server.auth.kerberos.principal = HTTP/$HOSTNAME@ANACONDA.COM
# livy.server.auth.kerberos.keytab = /etc/security/httplivy.keytab

# livy.server.access_control.enabled = true
# livy.server.access_control.users = livy,hdfs,zeppelin
# livy.superusers = livy,hdfs,zeppelin

After configuring Livy server, you’ll need to restart it:

./bin/anaconda-livy-server stop
./bin/anaconda-livy-server start

Consider using a process control mechanism to restart Livy server, to ensure that it’s reliably restarted in the event of a failure.


Using Livy with Kerberos authentication

If the Hadoop cluster is configured to use Kerberos authentication, you’ll need to do the following to allow Livy to access the services:

  1. Generate 2 keytabs for Apache Livy using kadmin.local.

IMPORTANT: The keytab principals for Livy must match the hostname that the Livy server is deployed on, or you’ll see the following exception: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentials).

These are hostname and domain dependent, so edit the following example according to your Kerberos settings:

$ sudo kadmin.local

kadmin.local:  addprinc livy/ip-172-31-3-131.ec2.internal
WARNING: no policy specified for livy/ip-172-31-3-131.ec2.internal@ANACONDA.COM; defaulting to no policy
Enter password for principal "livy/ip-172-31-3-131.ec2.internal@ANACONDA.COM":
Re-enter password for principal "livy/ip-172-31-3-131.ec2.internal@ANACONDA.COM":
kadmin.local:  xst -k livy-ip-172-31-3-131.ec2.internal.keytab livy/ip-172-31-3-131.ec2.internal@ANACONDA.COM
...

kadmin.local:  addprinc HTTP/ip-172-31-3-131.ec2.internal
WARNING: no policy specified for HTTP/ip-172-31-3-131.ec2.internal@ANACONDA.COM; defaulting to no policy
Enter password for principal "HTTP/ip-172-31-3-131.ec2.internal@ANACONDA.COM":
Re-enter password for principal "HTTP/ip-172-31-3-131.ec2.internal@ANACONDA.COM":
kadmin.local:  xst -k HTTP-ip-172-31-3-131.ec2.internal.keytab HTTP/ip-172-31-3-131.ec2.internal@ANACONDA.COM
...

This will generate two files: livy-ip-172-31-3-131.ec2.internal.keytab and HTTP-ip-172-31-3-131.ec2.internal.keytab.

  1. Change the permissions of these two files so they can be read by livy-server.

  2. Enable Kerberos authentication and reference these two keytab files in the conf/livy.conf configuration file, as shown:

    livy.server.auth.type = kerberos
    livy.impersonation.enabled = false  # see notes below
    
    # principals and keytabs to exactly match those generated before
    livy.server.launch.kerberos.principal = livy/ip-172-31-3-131@ANACONDA.COM
    livy.server.launch.kerberos.keytab = /home/centos/conf/livy-ip-172-31-3-131.keytab
    livy.server.auth.kerberos.principal = HTTP/ip-172-31-3-131@ANACONDA.COM
    livy.server.auth.kerberos.keytab = /home/centos/conf/HTTP-ip-172-31-3-131.keytab
    
    # this may not be required when delegating auth to kerberos
    livy.server.access-control.enabled = true
    livy.server.access-control.allowed-users = livy,zeppelin,testuser
    livy.superusers = livy,zeppelin,testuser
    

NOTES:

  • The hostname and domain are not the same—verify that they match your Kerberos configuration.

  • livy.server.access-control.enabled = true is only required if you’re going to also whitelist the allowed users with the livy.server.access-control.allowed-users <user> key.


Configuring project access

After you’ve installed Livy and configured cluster access, some additional configuration is required before Anaconda Enterprise users will be able to connect to a remote Hadoop Spark cluster from within their projects. For more information, see Connecting to the Hadoop Spark ecosystem.

  • If the Hadoop installation used Kerberos authentication, add the krb5.conf to the global configuration using the following command:

    anaconda-enterprise-cli spark-config --config /etc/krb5.conf krb5.conf
    
  • To use Sparkmagic, pass two flags to the previous command to configure a Sparkmagic configuration file:

    anaconda-enterprise-cli spark-config --config /etc/krb5.conf krb5.conf --config /opt/continuum/.sparkmagic/config.json config.json
    

This creates a yaml file—anaconda-config-files-secret.yaml—with the data converted for Anaconda Enterprise.

Use the following command to upload the yaml file to the server:

sudo kubectl replace -f anaconda-config-files-secret.yaml

To update the Anaconda Enterprise server with your changes, run the following command to identify the pod associated with the workspace services:

kubectl get pods

Restart the workspace services by running:

kubectl delete pod anaconda-enterprise-ap-workspace-<unique ID>

Now, whenever a new project is created, /etc/krb5.conf will be populated with the appropriate data.


Configuring Livy to use HTTPS

If you want to use Sparkmagic to communicate with Livy via HTTPS, you need to do the following to configure Livy as a secure endpoint:

  • Generate a keystore file, certificate, and truststore file for the Livy server—or use a third-party SSL certificate.

  • Update Livy with the keystore details.

  • Update your Sparkmagic configuration.

  • Restart the Livy server.


If you’re using a self-signed certificate:

  1. Generate a keystore file for Livy server using the following command:

    keytool -genkey -alias <host> -keyalg RSA -keysize 1024 –dname CN=<host>,OU=hw,O=hw,L=paloalto,ST=ca,C=us –keypass <keyPassword> -keystore <keystore_file> -storepass <storePassword>
    
  2. Create a certificate:

    keytool -export -alias <host> -keystore <keystore_file> -rfc –file <cert_file> -storepass <StorePassword>
    
  3. Create a truststore file:

    keytool -import -noprompt -alias <host> -file <cert_file> -keystore <truststore_file> -storepass <truststorePassword>
    
  4. Update livy.conf with the keystore details. For example:

    livy.keystore = /home/centos/livy-0.5.0-incubating-bin/keystore.jks
    livy.keystore.password = anaconda
    livy.key-password = anaconda
    
  5. Update ~/.sparkmagic/config.json. For example:

     "kernel_python_credentials" : {
        "username": "",
        "password": "",
        "url": "https://35.172.121.109:8998",
        "auth": "None"
      },
    "ignore_ssl_errors": true,
    

Note

In this example, ignore_ssl_errors is set to true because this configuration uses self-signed certificates. Your production cluster setup may be different.

Caution

If you misconfigure a .json file, all Sparkmagic kernels will fail to launch. You can test your Sparkmagic configuration by running the following Python command in an interactive shell: python -m json.tool config.json.

If you have formatted the JSON correctly, this command will run without error. Additional edits may be required, depending on your Livy settings.

  1. Restart the Livy server.

The Livy server should now be accessible over https. For example, https://<livy host>:<livy port>.

To test your SSL-enabled Livy server, run the following Python code in an interactive shell to create a session:

livy_url = "https://<livy host>:<livy port>/sessions"
data = {'kind': 'spark', 'numExecutors': 1}
headers = {'Content-Type': 'application/json'}
r = requests.post(livy_url, data=json.dumps(data), headers=headers, auth=HTTPKerberosAuth(mutual_authentication=REQUIRED,     sanitize_mutual_error_response=False), verify=False)
r.json()

Run the following Python code to verify the status of the session:

session_url = "https://<livy host>:<livy port>/sessions/0"
headers = {'Content-Type': 'application/json'}
r = requests.get(session_url, headers=headers, auth=HTTPKerberosAuth(mutual_authentication=REQUIRED,  sanitize_mutual_error_response=False), verify=False)
r.json()

Then submit the following statement:

session_url = "https://<livy host>:<livy port>/sessions/0/statements"
data ={"code": "sc.parallelize(1 to 10).count()"}
headers = {'Content-Type': 'application/json'}
r = requests.get(session_url, headers=headers, auth=HTTPKerberosAuth(mutual_authentication=REQUIRED, sanitize_mutual_error_response=False), verify=False)
r.json()

If you’re using a third-party certificate:

Note

Ensure that Java JDK is installed on the Livy server.

  1. Create the keystore.p12 file using the following command:

    openssl pkcs12 -export -in [path to certificate] -inkey [path to private key] -certfile [path to certificate ] -out keystore.p12
    
  2. Use the following command to create the keystore.jks file:

    keytool -importkeystore -srckeystore keystore.p12 -srcstoretype pkcs12 -destkeystore keystore.jks -deststoretype JKS
    
  3. If you don’t already have the rootca.crt, you can run the following command to extract it from your Anaconda Enterprise installation:

    kubectl get secrets anaconda-enterprise-certs -o jsonpath="{.data[`rootca\.crt`]}" | base64 -d > /ext/share/rootca.crt
    
  4. Add the rootca.crt to the keystore.jks file:

    keytool -importcert -keystore keystore.jks -storepass <password> -alias rootCA -file rootca.crt
    
  5. Add the keystore.jks file to the livy.conf file. For example:

    livy.keystore = /home/centos/livy-0.5.0-incubating-bin/keystore.jks
    livy.keystore.password = anaconda
    livy.key-password = anaconda
    
  6. Restart the Livy server.

  7. Run the following command to verify that you can connect to the Livy server (using your actual host and port):

    openssl s_client -connect anaconda.example.com:8998 -CAfile rootca.crt
    

    If running this command returns 0, you’ve successfully configured Livy to use HTTPS.


To add the trusted root certificate to the AE server, do the following:

  1. Install the ca-certificates package:

    yum install ca-certificates
    
  2. Enable dynamic CA configuration:

    update-ca-trust force-enable
    
  3. Add your rootca.crt as a new file:

    cp rootca.crt /etc/pki/ca-trust/source/anchors
    
  4. Update the certificate authority trust:

    update-ca-trust extract
    

To connect to Livy within a session, open the project and run the following command in an interactive shell:

import os
os.environ['REQUESTS_CA_BUNDLE'] = /path/to/root.ca

You can also edit the anaconda-project.yml file for the project and set the environment variable there. See Hadoop / Spark for more information.