Creating a Kerberized EMR cluster for use with AE 5

Objective

In this exercise an AWS EMR cluster with Kerberos will be set up and configured to be used with Anaconda Enterprise v5 for testing of Spark, Hive and HDFS access.

Applicability

  • Amazon EMR
  • Anaconda Enterprise 5.1.x on AWS

Due to the way public and private addresses are handled in AWS, on premises installs and installs on different cloud providers will have somewhat different configuration.

Implementation

The figure below illustrates the main components of this solution.

../../_images/emr-architecture.png

Consequences

  • This setup uses internal KDC for Kerberos. It is possible to configure cross domain trust with ActiveDirectory.
  • This setup requires definition of users on all nodes in the cluster for delegation tokens.
  • The described configuration scope is per project. It can be extended to site wide.

Creating the AWS EMR Cluster

Set up AWS EMR security configuration

Go to the Amazon EMR Console: https://console.aws.amazon.com/elasticmapreduce

Confirm that your default home region is selected.

In the left hand pane select “Security configurations”.

Create a security configuration with these options:

  • Name: my-emr-seccfg
  • Encryption: Leave all of these blank.
  • Authentication: Select Kerberos, leave Ticket lifetime at 24 hours and leave Cross-realm trust blank.
  • IAM roles for EMRFS: Leave blank. The default roles will be used for the purposes of this exercise.
../../_images/emr-security-cfg.png

Click Create. Confirm that the new security configuration is shown in the list.

Create the AWS EMR cluster

Go to the Amazon EMR Console: https://console.aws.amazon.com/elasticmapreduce

Confirm that your default home region is selected.

In the left hand pane select “Clusters”.

Select “Create Cluster” and “Advanced Options”.

Software

Release: emr-5.15.0

Applications: Hadoop 2.8.3, Hive 2.3.3, Hue 4.2.0, Spark 2.3.0, Livy 0.4.0

Under “Edit software settings” insert this JSON snippet:

[
{ "Classification": "core-site",
"Properties":
{
"hadoop.proxyuser.yarn.groups":"*",
"hadoop.proxyuser.yarn.hosts":"*",
"hadoop.proxyuser.livy.groups":"*",
"hadoop.proxyuser.livy.hosts":"*"
}
},
{"Classification":"hadoop-kms-site",
"Properties":
{"hadoop.kms.proxyuser.livy.users":"*",
"hadoop.kms.proxyuser.livy.hosts":"*",
"hadoop.kms.proxyuser.livy.groups":"*"}
}
]
../../_images/emr-software.png

Click Next.

Hardware

Instance group configuration: Left as Uniform instance groups.

Network: Select your preferred VPC in this box. Save the inbound rules to this VPC for AE5 access later.

EC2 Subnet: Select a subnet from the selected VPC.

Root device EBS volume size: Left at 10GB. This is for test purposes only with the storage in S3 so large volumes should not be required.

Master configuration: Left as default. Instance type is calculated depending on what applications were assigned.

Core configuration: Reduced to 1 instance. This is for connectivity testing primarily.

../../_images/emr-hardware.png

Click Next.

General cluster settings

General Options: Cluster name set to “my-emr-cluster”.

All others left as default.

../../_images/emr-options.png

Click Next.

Security

Security Options: Specify your EC2 key pair and leave the option to make “Cluster visible to all IAM users in account”.

Permissions: Set to “Default” which automatically configures the following roles:

  • EMR role: EMR_DefaultRole
  • EC2 instance profile: EMR_EC2_DefaultRole
  • Auto Scaling role: EMR_AutoScaling_DefaultRole

Authentication and encryption: Enter the name of the security configuration defined in Set up AWS EMR security configuration which was “my-emr-seccfg”.

Realm: MYEMRREALM.ORG

KDC admin password: <kdcpassword>. This is required to add end users to kerberos in the next step.

Create a security group. This group will be used for inbound livy access on port 8998 (the default livy port).

Name tag: emr-livy-sg

Group name: emr-livy

Description: Inbound access for Livy from AE5

VPC: Same as in Hardware section above.

Create the following Inbound Rules:

  • Type: Custom TCP Rule
  • Protocol: TCP
  • Port Range: 8998
  • Source: This can be either a port range or a security group. In this case the security group of the AE 5 cluster was used.
  • Description: Inbound access for Livy from AE5

Save the security group.

Attach the emr-livy-sg security group to the Additional security groups section for the Master only.

../../_images/emr-security.png

Click Create cluster. This will take 10 minutes or so.

Add end users to Kerberos

In this step user principals will be configured.

Launch an SSH console to the master instance created above.

Add the new principal: sudo kadmin.local add_principal -pw password rbarthelmie

Check that the user can log in:

kinit rbarthelmie
Password for rbarthelmie@EMR.CONTIUUM.IO:
klist
Ticket cache: FILE:/tmp/krb5cc_500
Default principal: rbarthelmie@MYEMRREALM.ORG
Valid starting       Expires              Service principal
07/09/2018 15:48:05  07/10/2018 01:48:05  krbtgt/MYEMRREALM.ORG@MYEMRREALM.ORG
         renew until 07/10/2018 15:48:05

Repeat these steps for each user to add.

Create user home directories

Change to the hdfs account (as root): su hdfs

Create the user’s home directory: hdfs dfs -mkdir /user/rbarthelmie

Change the ownership on the user’s home directory: hdfs dfs -chown rbarthelmie:rbarthelmie /user/rbarthelmie

Check the home directories for the correct access:

hdfs dfs -ls -q /user
Found 9 items
drwxrwxrwx - hadoop      hadoop      0 2018-07-09 13:06 /user/hadoop
drwxr-xr-x - mapred      mapred      0 2018-07-09 13:06 /user/history
drwxrwxrwx - hdfs        hadoop      0 2018-07-09 13:06 /user/hive
drwxrwxrwx - hue         hue         0 2018-07-09 13:06 /user/hue
drwxrwxrwx - livy        livy        0 2018-07-09 13:06 /user/livy
drwxrwxrwx - oozie       oozie       0 2018-07-09 13:06 /user/oozie
drwxr-xr-x - rbarthelmie rbarthelmie 0 2018-07-09 15:52 /user/rbarthelmie
drwxrwxrwx - root        hadoop      0 2018-07-09 13:06 /user/root
drwxrwxrwx - spark       spark       0 2018-07-09 13:06 /user/spark

Add local Linux accounts

This step is required due to YARN security requesting access for HDFS delegation tokens by the YARN ApplicationMaster. The identity is required. The credentials are provided by Kerberos.

As root, on both the master and the slave node: useradd rbarthelmie

Test Spark and Livy

At this point Spark and Livy can be tested. If these tests do not complete successfully do not proceed further until they are resolved.

Test 1 - Spark submit

This is the simplest test. It uses the master configuration for Hadoop along with the Kerberos credentials of the user that was created earlier (in this case rbarthelmie). After a successful kinit (checked with klist, if required) run the following command on the newly created master: spark-submit --name SParkPi --master yarn --deploy-mode cluster  /usr/lib/spark/examples/src/main/python/pi.py

If successful one of the final blocks of outputs should read:

client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
       diagnostics: N/A
       ApplicationMaster host: 172.31.28.18
       ApplicationMaster RPC port: 0
       queue: default
       start time: 1531152504474
       final status: SUCCEEDED
       tracking URL: http://ip-172-31-20-241.ec2.internal:20888/proxy/application_1531141630439_0005/
       user: rbarthelmie

Test 2 - PySpark

In this test the user obtains a token and a file will be uploaded to hdfs. PySpark will be used.

Assume the identity of rbarthelmie and kinit:

su rbarthelmie
kinit rbarthelmie

Upload a file to hdfs:

hdfs dfs -put /etc/hadoop/conf/core-site.xml /user/rbarthelmie

Check for its presence:

hdfs dfs -ls /user/rbarthelmie
Found 2 items
drwxr-xr-x   - rbarthelmie rbarthelmie    0 2018-07-09 16:15 /user/rbarthelmie/.sparkStaging
-rw-r--r--   1 rbarthelmie rbarthelmie 4836 2018-07-09 16:16 /user/rbarthelmie/core-site.xml

Start PySpark:

pyspark

At the >>> prompt type the following commands:

>>> textfile = spark.read.text("rootca.crt")
18/07/09 16:23:06 WARN FileStreamSink: Error while looking for metadata directory.
>>> textfile.count()
193
>>> textfile.first()
Row(value=u'<?xml version="1.0"?>')

Test 3 - Livy retrieve sessions

This test ensures that the user can connect to Livy and then onward to the YARN Resource Manager:

/usr/bin/curl -X GET --negotiate -u :  http://ip-aaa-bbb-ccc-ddd.ec2.internal:8998/sessions  | python -m json.tool

  % Total %   Received % Xferd  Average Speed  Time     Time     Time     Current
                                Dload  Upload  Total    Spent    Left     Speed
100 300   100 300      0 0       4323   0       --:--:-- --:--:-- --:--:-- 4347
100  34   100  34      0 0         66   0       --:--:-- --:--:-- --:--:--  112
{
  "from": 0,
  "sessions": [],
  "total": 0
}

Test 4 - Livy Create Sessions

This test ensures that the user can connect to Livy and the YARN Resource Manager and then start an application on the YARN Application Master:

/usr/bin/curl -X POST --negotiate -u : --data '{"kind":"pyspark"}' -H "Content-Type:application/json" http://ip-aaa-bbb-ccc-ddd.ec2.internal:8998/sessions | python -m json.tool

If successful the ID of the new application is returned:

{
  "appId": null,
  "appInfo": {
       "driverLogUrl": null,
       "sparkUiUrl": null
  },
  "id": 1,
  "kind": "pyspark",
  "log": [
       "stdout: ",
       "\nstderr: ",
       "\nYARN Diagnostics: "
  ],
  "owner": "rbarthelmie",
  "proxyUser": "rbarthelmie",
  "state": "starting"
}

This session can be kept track of using the Livy Retrieve Sessions diagnostic above.

Configure and test AE5 Kerberos7

In this test the AE5 connectivity to Kerberos will be configured from a user perspective. See the documentation on how to apply Hadoop cluster settings for every user in the role of an administrator. This uses the Hadoop-Spark template that has the correct support for connecting to the Livy Server through the SparkMagic kernel.

In Anaconda Enterprise 5 create a new project.

  • Name: project-spark-emr
  • Select Project Type: Hadoop-Spark
  • Create

In the Projects tab edit the Variable “KRB5_CONFIG” to read:

  • Name: KRB5_CONFIG
  • Description: Location of config file for Kerberos authentication
  • Default: /opt/continuum/project/krb5.conf

Note that the out of the box configuration is /etc/krb5.conf. This location is not editable and will be overwritten if a site wide configuration is set by an administrator. Saving the variable should create or amend the environment variable within the Terminal. This can be checked with:

echo $KRB5_CONFIG
/opt/continuum/project/krb5.conf

Using a text editor (either Text Editor from the Launcher window or vi from the Terminal) create the file /opt/continuum/project/krb5.conf. The server should point to the internal EC2 address of the EMR master (assuming both are in the same VPC) and the domain configured in Security above. You can also obtain a working copy of this file from /etc/krb5.conf on your EMR master:

[libdefaults]
    default_realm = MYEMRREALM.ORG
    dns_lookup_realm = false
    dns_lookup_kdc = false
    rdns = true
    ticket_lifetime = 24h
    forwardable = true
    udp_preference_limit = 1000000
    default_tkt_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 des3-cbc-sha1
    default_tgs_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 des3-cbc-sha1
    permitted_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 des3-cbc-sha1

[realms]
    myemrrealm.org = {
        kdc = ip-aaa-bbb-ccc-ddd.ec2.internal:88
        admin_server = ip-aaa-bbb-ccc-ddd.ec2.internal:749
        default_domain = ec2.internal
   }

[domain_realm]
    .ec2.internal = MYEMRREALM.ORG
     ec2.internal = MYEMRREALM.ORG
[logging]
   kdc = FILE:/var/log/kerberos/krb5kdc.log
    admin_server = FILE:/var/log/kerberos/kadmin.log
    default = FILE:/var/log/kerberos/krb5lib.log

With that file in place test your Kerberos configuration with:

kinit rbarthelmie
klist

There should be a result similar to:

Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: rbarthelmie@EMR.CONTIUUM.IO

Valid starting       Expires             Service principal
07/09/2018 20:14:33  07/10/2018 06:14:33  krbtgt/EMR.CONTIUUM.IO@EMR.CONTIUUM.IO
         renew until 07/10/2018 20:14:32

Configure and Test AE5 Livy

In this step we will configure the SparkMagic kernel installed in the AE5 project template “Hadoop-Spark” using a configuration file and two environment variables.

In the Projects tab edit the Variable “SPARKMAGIC_CONF_DIR” to read:

  • Name: SPARKMAGIC_CONF_DIR
  • Description: Location of sparkmagic configuration file
  • Default: /opt/continuum/project

Note that if the default location /opt/continuum/.sparkmagic is used it will be overwritten by a site wide configuration performed by an administrator. Also, as this directory .sparkmagic is effectively hidden the configuration file will not be committed with other files during a project commit resulting in its loss on restarting a session.

In the Projects tab confirm that the Variable “SPARKMAGIC_CONF_DIR” reads:

  • Name: SPARKMAGIC_CONF_DIR
  • Description: Location of sparkmagic configuration file
  • Default: /opt/continuum/project

Copy the SparkMagic configuration file template: cp /opt/continuum/project/spark/sparkmagic_conf.example.json config.json

Edit the configuration file: vi config.json

Change the targets for the Livy Server to the EMR Master and the authentication types to read:

"kernel_python_credentials" : {
    "url": "http://ip-aaa-bbb-ccc-ddd.ec2.internal:8998",
    "auth": "Kerberos"
},
"kernel_python3_credentials" : {
    "url": "http://ip-aaa-bbb-ccc-ddd.ec2.internal:8998",
    "auth": "Kerberos"
},
"kernel_scala_credentials" : {
    "url": "http://ip-aaa-bbb-ccc-ddd.ec2.internal:8998",
    "auth": "Kerberos"
},
"kernel_r_credentials": {
    "url": "http://ip-aaa-bbb-ccc-ddd.ec2.internal:8998",
    "auth": "Kerberos"
},

From the Launcher open a Notebook with the PySpark3 template.

In the first cell type sc and then shift-enter. If successful, the following will be seen in the output cell:

../../_images/emr-sc.png

If that is working correctly then you can continue to explore Spark interactively. If something is not right see the next section for more details on how to identify and remediate errors.

Errors and Resolutions

Looking for Errors

Look for more detail on errors in the following locations:

  • Notebooks - You can retrieve the log information for the current Livy session using the special “magic” by typing %%logs into a cell and then executing the cell.
  • Terminal - The log directory is shown under the “handlers” section of the SparkMagic configuration file. The default location is /opt/continuum/.sparkmagic-logs/logs.
  • Livy Server - The default log location is /var/log/livy/livy-livy-server.out. The most relevant errors can be found with the tag “stdout: ERROR:”.
  • Kerberos Server - The default log location is /var/log/kerberos/krb5kdc.log.
  • S3 Log URI - This link can be found in the cluster summary on the Amazon EMR Console. It will be of the format s3://aws-logs-nnnnnnnnnnnn-us-east-1/elasticmapreduce/ and a link will be provided. The cluster ID is also displayed on the cluster summary if you have multiple clusters. Use the following locations:
    • cluster/ node / (master instance id) / applications / hadoop-yarn / yarn-yarn-resourcemanager
    • cluster/ node / (slave instance id) / applications / hadoop-yarn / yarn-yarn-resourcemanager / yarn-yarn-nodemanager
  • Application history - This can be found in the Amazon EMR console under the Applications history tab. Only successfully submitted applications will be logged here.

Common Errors

Error

User: livy/ip-aaa-bbb-ccc-ddd.ec2.internal@MYEMRREALM.ORG is not allowed to impersonate rbarthelmie

Location

%%logs and Livy Server

Remediation

Check /etc/hadoop-kms/conf.empty/kms-site.xml for these entries:

<property>
    <name>hadoop.kms.proxyuser.livy.users</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.kms.proxyuser.livy.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.kms.proxyuser.livy.groups</name>
    <value>*</value>
</property>

And /etc/hadoop/conf/ for these:

<property>
    <name>hadoop.proxyuser.livy.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.livy.groups</name>
    <value>*</value>
</property>

Correct any omissions or mistakes and restart the impacted services on both the master and the slave nodes.

Error

INFO LineBufferedStream: stdout: main : requested yarn user is rbarthelmie
INFO LineBufferedStream: stdout: User rbarthelmie not found

Location

Livy Server

Remediation

Linux user does not exist on the node. useradd '<user>' with root on all nodes.

Error

HTTP ERROR 401

Location

%%logs

Livy Server

Remediation

Usually this is a failure to kinit prior to attempting to start a session.

Error

Server not found in Kerberos database

Location

Kerberos Server

Remediation

This will be a mismatch between the sparkmagic configuration and the list of principals in Kerberos (these can be found using kadmin.local list_principals). Check for the use of AWS internal and external hostnames and ensure there is communication and resolution of these names between the configurations.

Error

GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

Location

%%logs

Livy Server

Remediation

HDFS cannot obtain a delegation token as its keytab has expired. To fix:

kinit hdfs/ip-aaa-bbb-ccc-ddd.ec2.internal@MYEMRREALM.ORG -k -t /etc/hdfs.keytab

Error

INFO LineBufferedStream: stdout: Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=rbarthelmie, access=WRITE, inode=”/user”:hdfs:hadoop:drwxr-xr-x

Location

%%logs

Livy Server

Remediation

This is due to the lack of a home directory to write results for user rbarthelmie. Fix with:

su hdfs
hdfs dfs -mkdir /user/rbarthelmie
hdfs dfs -chown -R rbarthelmie:rbarthelmie /user/rbarthelmie
hdfs dfs -ls /user

Restarting Services

If configuration changes are made it will be necessary to restart the hadoop services that are impacted. A list of services can be obtained from the initctl list command.

Slave Node Services:

sudo stop hadoop-hdfs-datanode
sudo start hadoop-hdfs-datanode
sudo stop hadoop-yarn-nodemanager
sudo start hadoop-yarn-nodemanager

Master Node Services:

sudo stop hive-server2
sudo start hive-server2
sudo stop hadoop-mapreduce-historyserver
sudo start hadoop-mapreduce-historyserver
sudo stop hadoop-yarn-timelineserver
sudo start hadoop-yarn-timelineserver
sudo stop hive-hcatalog-server
sudo start hive-hcatalog-server
sudo stop livy-server
sudo start livy-server
sudo stop hadoop-yarn-resourcemanager
sudo start hadoop-yarn-resourcemanager
sudo stop hadoop-kms
sudo start hadoop-kms
sudo stop hue
sudo start hue
sudo stop hadoop-httpfs
sudo start hadoop-httpfs
sudo stop oozie
sudo start oozie
sudo stop hadoop-yarn-proxyserver
sudo start hadoop-yarn-proxyserver
sudo stop spark-history-server
sudo start spark-history-server
sudo stop hadoop-hdfs-namenode
sudo start hadoop-hdfs-namenode

Further Reading