Deploying Apache Spark on Kubernetes with S3 Support

Spark-Kubernetes-storagegrid-logo

Introduction

In this post we will show how to deploy a “stateless” Apache Spark cluster on Kubernetes. Spark is a fast analytics engine designed for large-scale data processing. Furthermore, we will then run analytics queries against data sitting in S3, in our case StorageGRID Webscale.

We use S3 as the data source/target because it is an elegant way to decouple the analytics platform from its data. In contrast to HDFS, where data is “siloed” in the Hadoop nodes, this enables a very broad set of commercial and self-written applications to read and write data from and to S3.

In this post we will use Spark, but in the future upcoming machine learning frameworks could be used. Especially in scenarios where large quantities of data are stored for a very long time, S3 opens up a lot more doors to do new things with it.

Architecture Overview

Our setup assumes a running Kubernetes cluster on which we’ll deploy a Spark master node, as well as two worker nodes. In the backend, we will use StorageGRID Webscale via its S3 API for loading data into the workers’ memory, as well as write the results back to S3.

Spark Kubernetes Architecture

Deploying Spark on Kubernetes

Before starting, clone the corresponding GitHub repository with the example from here. Most noteworthy, many steps are based off the “official” Spark example, which also contains a few more details on the deployment steps.

In order to start, create a new namespace:


Then, configure kubectl to work with the new namespace:


Now, we can go ahead and deploy the Spark master Replication Controller and Service:

After those are running, we can start our Spark workers. The corresponding containers already include the s3a connector, so no need to deploy any additional libraries. If you want more workers, feel free to edit the YAML.

It might take a few minutes until the container image is pulled, so let’s wait until everything is up and running:

Looks like everything is up and running. Now we can continue and configure the connection to S3.

Configuring S3 Access in Spark

First, let’s fire up a Spark shell by connecting to the master:

As the subsequent step, we need to let Spark know to which S3 endpoint it needs to connect to. Hence we need to configure the S3 endpoint address with its port (SSL is enabled by default), the Access Key and the Secret Access Key. Additionally, we can enable the “fast upload” feature of the s3a connector.

If you are using a self-signed certificate (and you haven’t put it in the JVM truststore), you can disable SSL certificate verification. However, please don’t do this in production.

Running queries against data in S3

In our example, I’ve already created a bucket called “spark” and uploaded a 200MB text file. We can load this object into a Resilient Distributed Dataset (RDD) with the following commands:

Looks like it is working! Let’s try some filtering and then write the results back to StorageGRID S3:

Let’s see what Spark wrote to our S3 bucket:

As you can see, Spark didn’t write a single object, but rather chunked the output over multiple objects. For small objects (like in this example), this makes limited sense. On the other hand, this can greatly improve overall throughput for writing large datasets to S3 as all workers write in parallel. Concatenating all objects would yield the complete dataset as a single text file.

Further notes

This setup is a just an initial introduction on getting StorageGRID S3 working with Apache Spark on Kubernetes. Getting insights out of your data is the next step, but also optimizing performance is an important topic. For example, using Spark’s parallelize call to execute object reads in parallel can yield massive performance improvements over using a simple sc.textFiles(s3a://spark/*) as used in this example.

Summary

In this post we have shown a simple way to run a Spark cluster on Kubernetes and consume data sitting in StorageGRID Webscale S3. There are more topics to cover like more sophisticated queries and performance tuning, but those are topics for another post. The full source code can be found on GitHub.

Clemens Siebler on GithubClemens Siebler on LinkedinClemens Siebler on Twitter
Clemens Siebler
Manager Solution Architects EMEA
Clemens is leading a technical team of Solution Architects in EMEA. In his current role, he and his team are evangelizing upcoming market trends like Containers, Object Storage, OpenStack, and NFV. His current passion is enabling customers to transition their large scale workloads to Object Storage. Before, he worked as a Software Engineer on NetApp’s software products, where he published multiple patents on plug-in frameworks.

Leave a Reply