ONTAP AI and NetApp Trident: Streamlining and simplifying AI workloads

I listened to a lot of Nirvana as a teenager, usually at full volume. “Nevermind” was pretty much on repeat for a time there, and my mom often worried that I’d lose my hearing. One of the things I loved about their music was the energy, but also how raw and simple it was. People who didn’t appreciate Nirvana’s music would lament how “they only knew three chords.” My response was if three chords work why do you need to use more?

Over the years I’ve come to appreciate other types of music, including progressive rock, which is known for how intricate the music can be (think Yes, Dream Theater, Rush, the Mars Volta). I’ve come to recognize the value in intricate skill, but often I yearn for the simple, frenetic sounds in a song that spans a few minutes rather than a majestic, orchestral 20-minute epic.

I believe there is a parallel with IT operations here. In this case, artificial intelligence (AI)/machine learning workloads will be the example.

As Santosh Rao pointed out in his first blog on AI, “enterprises are eager to take advantage of AI technologies as a means to introduce new services and enhance insights from company data.” But what enterprises are finding is that these workloads are not “Come as You Are,” but instead are often massively complex – featuring multiple moving parts and applications. This kind of complexity can drive people away from a technology – recall the initial adoption of OpenStack. While adoption rates for OpenStack have steadily increased over recent years, early adoption of the platform was slow, partly due to the complexity around setup and management.

NetApp® also learned lessons about over complicating things from early releases of ONTAP®, and since then has been applying those lessons to the release of new products and architectures. The objective now is to make the technology as simple as possible so that using our products doesn’t “Drain You.”

Simplicity through Convergence

Recently, NetApp announced the release of a new converged infrastructure to help enterprises quickly and easily stand up AI architecture at scale with NetApp ONTAP® AI, which features ONTAP, AFF, and NVIDIA DGX-1 servers.

With this solution, you can deploy a new AI infrastructure without all the configuration and guesswork, as well as scale the solution out as the AI workload demands. With NetApp’s cloud-driven strategy, you can also leverage an edge/core/cloud mentality, powered by ONTAP anywhere and everywhere and designed to address bottlenecks.

As ONTAP Senior VP Octavian Tanase put it in this blog, “ONTAP AI is designed to address AI bottlenecks wherever they occur. NVIDIA and NetApp technologies eliminate performance issues and enable secure, non-disruptive data access, delivering AI performance at scale. The NetApp Data Fabric enables you to integrate diverse, dynamic, and distributed data sources with complete control and protection.”

There’s just “Something in the Way” NetApp’s AI solution takes a hard problem to solve and makes it easy. (Ok, I’ll stop with the Nirvana references now)

Simplicity through Storage Management

AI workloads can chew through a lot of data and a high number of files. When you buy a multi-node ONTAP cluster, you want peace of mind that all your hardware resources are being used to process the workload as efficiently as possible.

In AI/Deep Learning workflows, data will travel through several places using a data pipeline. Santosh Rao covered it here, but let’s recap:

  • Data ingest – ingestion usually occurs at the edge, for example capturing data streaming from cars or point-of-sale devices. Depending on the use case, IT infrastructure might be needed at or near the ingestion point. For instance, a retailer might need a small footprint in each store, consolidating data from multiple devices.
  • Data prep – preprocessing is necessary to normalize data before training. Preprocessing takes place in a data lake, possibly in the cloud in the form of an S3 tier, or on premises as a file store or object store.
  • Training – for the critical training phase of deep learning, data is typically copied from the data lake into the training cluster at regular intervals. Servers used in this phase use GPUs to parallelize operations, creating a tremendous appetite for data. Raw I/O bandwidth is crucial.
  • Deployment – the resulting model is pushed out to be tested and then moved to production. Depending on the use case, the model might be deployed back to edge operations. Real-world results of the model are monitored, and feedback in the form of new data flows back into the data lake, along with new incoming data to iterate on the process.
  • Archive – Cold data from past iterations may be saved indefinitely. Many AI teams want to archive cold data to object storage, in either a private or public cloud.

For many of the use cases in the pipeline, storage administrators can provision ONTAP’s newest NAS storage container – the FlexGroup volume – to deliver high performance, massive capacity and simple management of a volume that spans multiple nodes in a cluster.

For example, a FlexGroup volume is great for ingesting large amounts of data. Why not stand one up at the edge?

FlexGroup volumes also can handle massive amounts of data (20PB and beyond), so if you’re planning on hosting your data lake as an on-premises file store, why not use a FlexGroup volume there?

Training clusters require data ingest at regular intervals, with many parallel operations being thrown at the storage system, often to a single namespace. This is where the biggest bottleneck can occur. Can you guess which ONTAP container performs very well under these conditions? (hint: It’s a FlexGroup volume)

In addition, for archive to cloud/cold data tiering, ONTAP provides FabricPool to S3 for on-premises (StorageGRID, for example) or to public cloud. While this is currently not available for FlexGroup volumes, you can still leverage this with FlexVol volumes today.

Santosh also covers more information on where bottlenecks can occur in the data pipeline here:

Accelerate I/O for Your Deep Learning Pipeline

FlexGroup volumes are increasingly prevalent in these high-traffic workloads, and the story only gets better as ONTAP releases new versions with additional features.
But ONTAP isn’t the only NetApp piece of this puzzle.

Simplicity through Automation

Trident is an open source storage provisioner and orchestrator for the NetApp portfolio.

You can read more about it here:

https://netapp.io/2016/12/23/introducing-trident-dynamic-persistent-volume-provisioner-kubernetes/

You can also read about how we use it for AI/ML here:

https://www.theregister.co.uk/2018/08/03/netapp_a800_pure_airi_flashblade/

When you’re using Trident, you can create Docker-ready NFS exported volumes in ONTAP to provide storage to all of your containers just by specifying the -v option during your “docker run” commands.

For example, here’s an NFS exported volume created using Trident:

Here’s what shows up on the ONTAP system:

Then, I can just start up the container using that volume:

There are a vast number of use cases for having a centralized NFS storage volume for your containers to rely on. A centralized system provides access for reading and writing to the same location across a network on a high-performing storage system with all sorts of data protection capabilities to ensure high availability and resiliency.

NetApp Trident has added support for FlexGroup volumes in the 18.07 release, which provides easy and repeatable automation of volume deployment in AI/ML/DL environments. As an added bonus, because Trident integrates seamlessly with Docker and Kubernetes, the automation of the entire AI/ML/DL data pipeline can be deployed using pods which allow end to end automation of the entire workflow.

To change the configuration, change the /etc/netappdvp/config.json file to use the FlexGroup driver.

When a new job needs to kick off, the operations team doesn’t have to do anything because they’ve already programmed Kubernetes (correctly, hopefully) to configure and deploy the necessary resources to the applications. Resources include: storage (such as a FlexGroup volume or FlexVol® volume), containers that run the application, NFS bind mounts to the containers, and even tagging of NVIDIA GPUs to the workloads. This is simplicity built on top of multiple layers of simplicity, all to solve some complex problems when deploying AI/ML/DL workflows.

Simplicity through Configuration

Now that I have your attention, let’s talk about how we can configure Kubernetes to use Trident to provision storage for AI/ML/DL pods that can leverage NVIDIA GPUs.

First of all, you need to build your Kubernetes cluster. There are a variety of ways to do this, so I won’t cover that here. But understand that I’d never set up Kubernetes before and NetApp Trident developer Jonathan Rippy helped me get a running single node instance going in about an hour.
The complete steps for getting Trident going in Kubernetes can be found here:

https://netapp-trident.readthedocs.io/en/latest/kubernetes/deploying.html

Once you have a Kubernetes cluster, you’ll need to validate it for use with Trident, download the installer for Trident and then configure and install it. I ran my Kubernetes node on CentOS 7.x and ran into a couple one-off issues. Basically, make sure you:

• Set the hostname properly (needed for security certificates, node lookup, and so on)
• Create a symlink for Docker runc (https://access.redhat.com/solutions/2876431)
• Turn swap off (swapoff… and disable for future reboots)

Once Trident is installed, you can start to add storage classes, backends, and so on. You can add backends for multiple SVMs, multiple clusters and multiple volume types to create a robust Kubernetes deployment model for your storage needs.

You can then use persistent volume claims (PVC) to dynamically provision and attach volumes to pods for your NVIDIA GPU accelerated AI/ML/DL workloads, as well as your traditional stateful workloads.

To add NVIDIA-specific bits to Docker, you can use this reference:

https://github.com/NVIDIA/nvidia-docker

NVIDIA has also provided Kubernetes-specific steps for adding NVIDIA:

https://github.com/NVIDIA/k8s-device-plugin

Here’s a cheat sheet of sorts to add the plugin and configure it (thanks Rippy!).

First you should confirm that the nvidia runtime is your default runtime on all your Docker nodes. You are editing the docker daemon config file which is usually present at /etc/docker/daemon.json.

Add this line:
"default-runtime": "nvidia",

For example:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}

Then restart Docker:
systemctl restart docker

 

Once you have enabled this option on all the GPU-capable nodes you wish to use, you can then enable GPU support in your cluster by deploying the following Daemonset:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml

 

Here are the extra bits you need to request GPUs for the pods:

$ cat gpu-pod-trident.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/digits:6.0
env:
- name: CUDA_VISIBLE_DEVICES
value: "1"
volumeMounts:
- name: nfsvol
mountPath: /data
- name: nfsvol2
mountPath: /data2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
volumes:
- name: nfsvol
persistentVolumeClaim:
claimName: pvc-nfs
- name: nfsvol2
persistentVolumeClaim:
claimName: basic

Naturally, you’d need NVIDIA-specific hardware, such as DGX-1 servers. As I mentioned before, NetApp has a solution for that! Be sure to look into the ONTAP AI converged architecture and the build orchestration around it for a truly simple way to deploy and manage AI/ML/DL workflows.

Justin Parisi on Email
Justin Parisi
Justin Parisi is an ONTAP veteran, with over 10 years of experience at NetApp. He is co-host of the NetApp Tech ONTAP Podcast, Technical Advisor with the NetApp A-Team and works as a Technical Marketing Engineer at NetApp, focusing on FlexGroup volumes and all things NAS-related.

Leave a Reply