AWS - Spark

Provide Spark with cross-account access

Provide Spark with cross-account access

cover photo: Emma Döbken 2018.

In case you need to provide Spark with resources from a different AWS account, I found that quite tricky to figure out.

spark-iam-cross-account-access

Let’s assume you have two AWS accounts: the alpha account where you run Python with IAM role alpha-role and access to the Spark cluster; and the beta account where you have the S3 bucket you want to get access to. You could give S3 read access to the alpha-role, but it is more persistent and easier to manage by creating an access-role in the beta account that can be assumed by the alpha-role.

1. Create the access role

Start with creating a policy, e.g. named “read-access-beta-bucket-dataset” with the following JSON content:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::beta-bucket-dataset-eu-west-1"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::beta-bucket-dataset-eu-west-1/*"
        }
    ]
}

Now create the IAM role beta-role and attach the above created policy. To establish trust between the alpha-role and this newly created access role, include the alpha-role as a trusted entity to the trust relationship of the access role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:sts::xxxxxxxxxxxx:role/alpha-role"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

2. Test the access

From a terminal in the alpha environment you can now test the bucket access by assuming the beta-role. Make sure to install the aws client with pip install awscli. Use the following command to get temporary credentials of the beta-role:

aws sts assume-role --role-arn "arn:aws:iam::xxxxxxxxxxxx:role/beta-role" --role-session-name "test-acess"

Create three environment variables to assume the IAM role:

export AWS_ACCESS_KEY_ID=RoleAccessKeyID
export AWS_SECRET_ACCESS_KEY=RoleSecretKey
export AWS_SESSION_TOKEN=RoleSessionToken

You should now be able to list and get objects from the bucket:

aws s3 ls s3://beta-bucket-dataset-eu-west-1
aws s3 ls s3://beta-bucket-dataset-eu-west-1 --recursive

You can revert to the original role by unsetting the environmental variables:

unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN

3. Spark access

Finally we can give cross-account read access to Spark with temporary credentials. Note that the specific AWS and Hadoop jars are really brittle accross versions, in my case I ran with Spark 2.4.6 and Hadoop 2.9.2.

from pyspark.sql import SparkSession
from pyspark import SparkConf
import boto3

role_session_name = "s3read"
role_arn = "arn:aws:iam::xxxxxxxxxxxx:role/beta-role"
duration_seconds = 60*15 # durations of the session in seconds

# obtain the temporary credentials
credentials = boto3.client("sts").assume_role(
    RoleArn=role_arn,
    RoleSessionName=role_session_name,
    DurationSeconds=duration_seconds
)['Credentials']


conf = SparkConf()
conf.set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

conf.set(
    "spark.jars.packages",
    "com.amazonaws:aws-java-sdk:1.11.199,org.apache.hadoop:hadoop-aws:2.9.2",
)

conf.set(
    "spark.hadoop.fs.s3a.aws.credentials.provider",
    "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider",
)
conf.set(
    "spark.hadoop.fs.s3a.access.key",
    credentials.get("AccessKeyId"),
)
conf.set(
    "spark.hadoop.fs.s3a.secret.key",
    credentials.get("SecretAccessKey"),
)
conf.set(
    "spark.hadoop.fs.s3a.session.token",
    credentials.get("SessionToken"),
)
sql_context = SparkSession.builder.config(conf=conf)
sparksession = sql_context.getOrCreate()

df = sparksession.read \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .csv("s3a://beta-bucket-dataset-eu-west-1/*.csv")

df.show()