
Introduction
In this post we will show how to deploy a “stateless” Apache Spark cluster on Kubernetes. Spark is a fast analytics engine designed for large-scale data processing. Furthermore, we will then run analytics queries against data sitting in S3, in our case StorageGRID Webscale.
We use S3 as the data source/target because it is an elegant way to decouple the analytics platform from its data. In contrast to HDFS, where data is “siloed” in the Hadoop nodes, this enables a very broad set of commercial and self-written applications to read and write data from and to S3.
In this post we will use Spark, but in the future upcoming machine learning frameworks could be used. Especially in scenarios where large quantities of data are stored for a very long time, S3 opens up a lot more doors to do new things with it.
Architecture Overview
Our setup assumes a running Kubernetes cluster on which we’ll deploy a Spark master node, as well as two worker nodes. In the backend, we will use StorageGRID Webscale via its S3 API for loading data into the workers’ memory, as well as write the results back to S3.
Deploying Spark on Kubernetes
Before starting, clone the corresponding GitHub repository with the example from here. Most noteworthy, many steps are based off the “official” Spark example, which also contains a few more details on the deployment steps.
In order to start, create a new namespace:
$ kubectl create -f namespace-spark-cluster.yaml
Then, configure kubectl to work with the new namespace:
$ CURRENT_CONTEXT=$(kubectl config view -o jsonpath='{.current-context}') $ USER_NAME=$(kubectl config view -o jsonpath='{.contexts[?(@.name == "'"${CURRENT_CONTEXT}"'")].context.user}') $ CLUSTER_NAME=$(kubectl config view -o jsonpath='{.contexts[?(@.name == "'"${CURRENT_CONTEXT}"'")].context.cluster}') $ kubectl config set-context spark --namespace=spark-cluster --cluster=${CLUSTER_NAME} --user=${USER_NAME} $ kubectl config use-context spark
Now, we can go ahead and deploy the Spark master Replication Controller and Service:
$ kubectl create -f spark-master-controller.yaml $ kubectl create -f spark-master-service.yaml
$ kubectl create -f spark-worker-controller.yaml
It might take a few minutes until the container image is pulled, so let’s wait until everything is up and running:
$ kubectl get all NAME READY STATUS RESTARTS AGE po/spark-master-controller-5rgz2 1/1 Running 0 9m po/spark-worker-controller-0pts6 1/1 Running 0 9m po/spark-worker-controller-cq6ng 1/1 Running 0 9m NAME DESIRED CURRENT READY AGE rc/spark-master-controller 1 1 1 9m rc/spark-worker-controller 2 2 2 9m NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/spark-master 10.108.94.160 <none> 7077/TCP,8080/TCP 9m
Looks like everything is up and running. Now we can continue and configure the connection to S3.
Configuring S3 Access in Spark
First, let’s fire up a Spark shell by connecting to the master:
$ kubectl exec spark-master-controller-5rgz2 -it spark-shell [...] Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_111) Type in expressions to have them evaluated. Type :help for more information. scala>
As the subsequent step, we need to let Spark know to which S3 endpoint it needs to connect to. Hence we need to configure the S3 endpoint address with its port (SSL is enabled by default), the Access Key and the Secret Access Key. Additionally, we can enable the “fast upload” feature of the s3a connector.
scala> sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.company.com:8082") scala> sc.hadoopConfiguration.set("fs.s3a.access.key", "94IMPM0VXXXXXXXX") scala> sc.hadoopConfiguration.set("fs.s3a.secret.key", "L+3B2xXXXXXXXXXXX") scala> sc.hadoopConfiguration.set("fs.s3a.fast.upload", "true")
If you are using a self-signed certificate (and you haven’t put it in the JVM truststore), you can disable SSL certificate verification. However, please don’t do this in production.
scala> System.setProperty("com.amazonaws.sdk.disableCertChecking", "1")
Running queries against data in S3
In our example, I’ve already created a bucket called “spark” and uploaded a 200MB text file. We can load this object into a Resilient Distributed Dataset (RDD) with the following commands:
scala> val movies = sc.textFile("s3a://spark/movies.txt") movies: org.apache.spark.rdd.RDD[String] = s3a://spark/movies.txt MapPartitionsRDD[1] at textFile at <console>:24 scala> movies.count() res6: Long = 4245028
scala> val godfather_movies = movies.filter(line => line.contains("Godfather")) scala> godfather_movies.saveAsTextFile("s3a://spark/godfather.txt")
Let’s see what Spark wrote to our S3 bucket:
$ sgws s3 ls s3://spark/godfather.txt/ --profile spark

Further notes
This setup is a just an initial introduction on getting StorageGRID S3 working with Apache Spark on Kubernetes. Getting insights out of your data is the next step, but also optimizing performance is an important topic. For example, using Spark’s parallelize
call to execute object reads in parallel can yield massive performance improvements over using a simple sc.textFiles(s3a://spark/*)
as used in this example.
Summary
In this post we have shown a simple way to run a Spark cluster on Kubernetes and consume data sitting in StorageGRID Webscale S3. There are more topics to cover like more sophisticated queries and performance tuning, but those are topics for another post. The full source code can be found on GitHub.