Configuring Livy server for Hadoop Spark access#
Review the Apache Livy requirements before you begin the configuration process. There are three main configuration settings you must update on your Apache Livy server to allow Data Science & AI Workbench users access to Hadoop/Spark clusters:
If the Hadoop cluster is configured to use Kerberos authentication, you’ll need to allow Livy to access the services. Additionally, you can configure Livy as a secure endpoint. For more information, see Configuring Livy to use HTTPS below.
Configuring Livy impersonation#
To enable users to run Spark sessions within Workbench, they need to be able to log in to each machine in the Spark cluster. The easiest way to accomplish this is to configure Livy impersonation as follows:
Add
Hadoop.proxyuser.livy
to your authenticated hosts, users, or groups.Check the option to
Allow Livy to impersonate users
and set the value to all (*
), or a list of specific users or groups.
If impersonation is not enabled, the user executing the livy-server (livy
) must exist on every machine. You can add this user to each machine by running the following command on each node:
sudo useradd -m livy
Note
If you have any problems configuring Livy, try setting the log level to DEBUG
in the conf/log4j.properties
file.
Configuring cluster access#
Livy server enables users to submit jobs from any remote machine or analytics cluster—even where a Spark client is not available—without requiring you to install Jupyter and Anaconda directly on an edge node in the Spark cluster.
To configure Livy server, put the following environment variables into a user’s .bashrc file
, or the conf/livy-env.sh
file that’s used to configure the Livy server.
These values are accurate for a Cloudera install of Spark with Java version 1.8:
SPARK_HOME=/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p0.15945976/lib/spark
LIVY_LOG_DIR=/var/log/livy2
LIVY_PID_DIR=/var/run/livy2
JAVA_HOME=/usr/java/jdk1.8.0_232-cloudera/
SPARK_CONF_DIR=/etc/spark/conf
HADOOP_HOME=/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p0.15945976/lib/hadoop
HADOOP_CONF_DIR=/etc/hadoop/conf
Note that the port parameter that’s defined as livy.server.port
in conf/livy-env.sh
is the same port that will generally appear in the Sparkmagic user configuration.
The minimum required parameter is livy.spark.master
. Other possible values include the following:
local[*]
—for testing purposesyarn-cluster
—for using with the YARN resource allocation systema full spark URI like
spark://masterhost:7077
—if the spark scheduler is on a different host.
Example with YARN:
livy.spark.master = yarn-cluster
The YARN deployment mode is set to cluster
for Livy. The livy.conf
file, typically located in $LIVY_HOME/conf/livy.conf
, may include settings similar to the following:
# What host address to start the server on. By default, Livy will bind to all network interfaces.
livy.server.host = 0.0.0.0
# What port to start the server on.
livy.server.port = 8998
# What spark master Livy sessions should use.
livy.spark.master = yarn
# What spark deploy mode Livy sessions should use.
livy.spark.deploy-mode = cluster
Restart Livy server once configuration is complete.
Anaconda recommends using a process control mechanism to restart your Livy server to ensure that it’s reliably restarted in the event of a failure.
Note
The above example is to be used as a template only. Anaconda cannot assist with configuring Apache Livy for your organization.
Using Livy with Kerberos authentication#
If the Hadoop cluster is configured to use Kerberos authentication, you’ll need to do the following to allow Livy to access the services:
Generate two keytabs for Apache Livy using
kadmin.local
.Caution
The keytab principals for Livy must match the hostname that the Livy server is deployed on, or you’ll see the following exception:
GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentials)
.These are hostname and domain dependent, so edit the following example according to your Kerberos settings:
$ sudo kadmin.local kadmin.local: addprinc livy/<HOSTNAME> kadmin.local: xst -k livy-<HOSTNAME>.keytab livy/<HOSTNAME>@<REALM> ... kadmin.local: addprinc HTTP/<HOSTNAME> kadmin.local: xst -k HTTP-<HOSTNAME>.keytab HTTP/<HOSTNAME>@<REALM> ...
This will generate two files:
livy-<HOSTNAME>.keytab
andHTTP-<HOSTNAME>.keytab
.Change the permissions of these two files so they can be read by
livy-server
.Enable Kerberos authentication and reference these two keytab files in the
conf/livy.conf
configuration file, as shown:# Kerberos settings # Authentication support for Livy server # Livy has a built-in SPnego authentication support for HTTP requests with below configurations. livy.server.auth.type = kerberos livy.server.auth.kerberos.principal = HTTP/<HOSTNAME>@<REALM> livy.server.auth.kerberos.keytab = <FILEPATH>/<KEYTAB> livy.server.auth.kerberos.name-rules = DEFAULT livy.server.launch.kerberos.principal = livy/<HOSTNAME>@<REALM> livy.server.launch.kerberos.keytab = <FILEPATH>/<KEYTAB>
Note
The hostname and domain are not the same—verify that they match your Kerberos configuration.
Note
The above example is to be used as a template only. Anaconda cannot assist with configuring Kerberos for your organization.
Configuring Livy to use HTTPS#
If you want to use Sparkmagic to communicate with Livy via HTTPS, you need to do the following to configure Livy as a secure endpoint:
Generate a keystore file, certificate, and truststore file for the Livy server—or use a third-party SSL certificate.
Update Livy with the keystore details.
Update your Sparkmagic configuration.
Restart the Livy server.
If you’re using a self-signed certificate#
Generate a keystore file for Livy server using the following command:
keytool -genkey -alias <host> -keyalg RSA -keysize 1024 –dname CN=<host>,OU=hw,O=hw,L=paloalto,ST=ca,C=us –keypass <keyPassword> -keystore <keystore_file> -storepass <storePassword>
Create a certificate:
keytool -export -alias <host> -keystore <keystore_file> -rfc –file <cert_file> -storepass <StorePassword>
Create a truststore file:
keytool -import -noprompt -alias <host> -file <cert_file> -keystore <truststore_file> -storepass <truststorePassword>
Update
livy.conf
with the keystore details. For example:livy.keystore = <FILEPATH>/keystore.jks livy.keystore.password = anaconda livy.key-password = anaconda
Update
~/.sparkmagic/config.json
. For example:"kernel_python_credentials" : { "username": "", "password": "", "url": "https://<IP>:8998", "auth": "None" }, "ignore_ssl_errors": true,
Note
In this example,
ignore_ssl_errors
is set totrue
because this configuration uses self-signed certificates. Your production cluster setup may be different.Caution
If you misconfigure a
.json
file, all Sparkmagic kernels will fail to launch. You can test your Sparkmagic configuration by running the following Python command in an interactive shell:python -m json.tool config.json
.If you have formatted the JSON correctly, this command will run without error. Additional edits may be required, depending on your Livy settings.
Restart the Livy server.
The Livy server should now be accessible over https. For example,
https://<livy host>:<livy port>
.To test your SSL-enabled Livy server, run the following Python code in an interactive shell to create a session:
livy_url = "https://<livy host>:<livy port>/sessions" data = {'kind': 'spark', 'numExecutors': 1} headers = {'Content-Type': 'application/json'} r = requests.post(livy_url, data=json.dumps(data), headers=headers, auth=HTTPKerberosAuth(mutual_authentication=REQUIRED, sanitize_mutual_error_response=False), verify=False) r.json()
Run the following Python code to verify the status of the session:
session_url = "https://<livy host>:<livy port>/sessions/0" headers = {'Content-Type': 'application/json'} r = requests.get(session_url, headers=headers, auth=HTTPKerberosAuth(mutual_authentication=REQUIRED, sanitize_mutual_error_response=False), verify=False) r.json()
Then submit the following statement:
session_url = "https://<livy host>:<livy port>/sessions/0/statements" data ={"code": "sc.parallelize(1 to 10).count()"} headers = {'Content-Type': 'application/json'} r = requests.get(session_url, headers=headers, auth=HTTPKerberosAuth(mutual_authentication=REQUIRED, sanitize_mutual_error_response=False), verify=False) r.json()
If you’re using a third-party certificate#
Note
Ensure that Java JDK is installed on the Livy server.
Create the
keystore.p12
file using the following command:openssl pkcs12 -export -in [path to certificate] -inkey [path to private key] -certfile [path to certificate ] -out keystore.p12
Use the following command to create the
keystore.jks
file:keytool -importkeystore -srckeystore keystore.p12 -srcstoretype pkcs12 -destkeystore keystore.jks -deststoretype JKS
If you don’t already have the
rootca.crt
, you can run the following command to extract it from your Workbench installation:kubectl get secrets anaconda-enterprise-certs -o jsonpath="{.data[`rootca\.crt`]}" | base64 -d > /ext/share/rootca.crt
Add the
rootca.crt
to thekeystore.jks
file:keytool -importcert -keystore keystore.jks -storepass <password> -alias rootCA -file rootca.crt
Add the
keystore.jks
file to thelivy.conf
file. For example:livy.keystore = <FILEPATH>/keystore.jks livy.keystore.password = anaconda livy.key-password = anaconda
Restart the Livy server.
Run the following command to verify that you can connect to the Livy server (using your actual host and port):
openssl s_client -connect anaconda.example.com:8998 -CAfile rootca.crt
If running this command returns
0
, you’ve successfully configured Livy to use HTTPS.
To add the trusted root certificate to the Workbench server#
Install the
ca-certificates
package:yum install ca-certificates
Enable dynamic CA configuration:
update-ca-trust force-enable
Add your
rootca.crt
as a new file:cp rootca.crt /etc/pki/ca-trust/source/anchors
Update the certificate authority trust:
update-ca-trust extract
To connect to Livy within a session#
Open the project and run the following command in an interactive shell:
import os
os.environ['REQUESTS_CA_BUNDLE'] = /path/to/root.ca
You can also edit the anaconda-project.yml
file for the project and set the environment variable there. See Hadoop / Spark for more information.
Configuring project access#
After you’ve configured Livy for cluster access, you must configure your project to connect to the remote Hadoop Spark cluster.
Log in to Workbench as an admin user.
Open the project you want to connect to the remote cluster.
If the Hadoop installation used Kerberos authentication, place the
krb5.conf
file in the/tools
mounted directory.If you’re using Sparkmagic, include your
config.json
file in the/tools
mounted directory.Open the project’s
anaconda-project.yml
file.Find the
variables:
section of the file and add the paths to the configuration files you’ve just placed in the/tools
directory. For example:variables: KRB5_CONFIG: description: Location of config file for Kerberos authentication default: /tools/krb5.conf SPARKMAGIC_CONF_DIR: description: Location of Sparkmagic configuration file default: /tools SPARKMAGIC_CONF_FILE: description: Name of Sparkmagic configuration file default: config.json
Note
If you want users to use this project’s environment as their main method for accessing the Hadoop Spark cluster, consider making it a template. For more information, see Providing your environment to users.