Starting in v1.2.1 the NetApp Data Science Toolkit (DSTK) for Kubernetes officially supports BeeGFS. While some parallel file systems get a bad reputation for being overly complicated to manage, BeeGFS was designed to deliver a simpler experience without compromising on performance. Today I’m going to demonstrate how to set up BeeGFS, then use it with the DSTK to create a containerized JupyterLab workspace in Kubernetes.

Prerequisites

While this post will highlight BeeGFS, if your Kubernetes cluster already has Trident set up and working with one of the supported Trident backend types configured as a storage class, you could skip past the BeeGFS set up, to where we deploy JupyterLab workspaces.

Otherwise, to follow along you’ll need a few things:

  • A host (VM or bare metal) where we can set up BeeGFS, or an existing BeeGFS file system that is accessible from the Kubernetes cluster.
  • A Kubernetes cluster (one node is fine) where we can deploy the BeeGFS CSI Driver and create JupyterLab workspaces. Note to keep this article focused on BeeGFS, CSI, and Jupyter, we won’t provide guidance on deploying Kubernetes itself. However, if you don’t already have a cluster, you might want to check out the Kubespray.
  • A machine with Python 3.6+ and git installed, along with kubectl access to the Kubernetes cluster.

You’ll also need your favorite text editor and perhaps a cup of coffee, but that’s about it. Ready? Let’s go. In and out. Twenty-minute adventure.

Step 1: Set up a BeeGFS Parallel File System

To try things out we’ll set up a single node BeeGFS file system roughly following the BeeGFS Quick Start Guide (you can refer to the guide for more details on each step). Everything in this section should be completed on the host where you wish to set up BeeGFS:

  1. Download the BeeGFS repository file for your distribution, for example if you’re running RedHat 7/CentOS 7:
    wget -O /etc/yum.repos.d/beegfs_rhel7.repo https://www.beegfs.io/release/beegfs_7.2.1/dists/beegfs-rhel7.repo
  2. Install all the BeeGFS server packages:
    yum install beegfs-mgmtd beegfs-meta beegfs-storage
    Note: If you want RDMA support also install libbeegfs-ib
  3.  Next, we’ll create a directory where BeeGFS can store data:
    mkdir -p /data/beegfs/
  4. Then we’ll run the following commands to set up each service:/opt/beegfs/sbin/beegfs-setup-mgmtd -p /data/beegfs/beegfs_mgmtd
    /opt/beegfs/sbin/beegfs-setup-meta -p /data/beegfs/beegfs_meta -s 1 -m localhost
    /opt/beegfs/sbin/beegfs-setup-storage -p /data/beegfs/beegfs_storage -s 1 -i 101 -m localhostNote: In production use cases for each metadata/storage target represented by “-p” we’d want to use a dedicated drive, or more likely multiple drives presented as a logical volume from an enterprise storage system that protects data using RAID or erasure coding (like NetApp E-Series – but I may be slightly biased).
  5. If you are running a firewall you will need to open ports so the BeeGFS services can communicate. Please see the ports listed here.
  6. Using your favorite text editor create a file “/etc/beegfs/connInterfaceFile.conf” and add at least one network interface accessible by the Kubernetes nodes, for example:
    # cat connInterfaceFile.conf
    eth0
    eth1
  7. In each “/etc/beegfs/beegfs-[mgmtd|meta|storage].conf” file update the connInterfacesFile parameter to point at “/etc/beegfs/connInterfaceFile.conf”.Note: This ensures the Kubernetes nodes (BeeGFS clients) don’t try to connect over a path that is impossible, which can otherwise be a source of frustration.
  8. Run the following commands to start the BeeGFS services then enable them so they start when the system reboots:
    systemctl start beegfs-mgmtd beegfs-meta beegfs-storage
    systemctl enable beegfs-mgmtd beegfs-meta beegfs-storage
  9. Use “systemctl status <service>” to verify everything started up properly.

Step 2: Configure Kubernetes Nodes as BeeGFS Clients

The BeeGFS Container Storage Interface (CSI) driver allows us to use BeeGFS with Kubernetes. While the driver does most of the heavy lifting, we do have to install the BeeGFS DKMS client. The following steps should be taken on each Kubernetes node that needs BeeGFS access:

  1. As with the server setup you’ll need to download/install the BeeGFS repository file for your Linux distribution.
  2. Install the BeeGFS Client DKMS using your package manager:
    • If on RedHat/CentOS you’ll need to activate the EPEL repository then run:
      yum install kernel-devel-$(uname -r) beegfs-client-dkms beegfs-helperd beegfs-utils
    • If on Ubuntu run:
      sudo apt-get install linux-headers-$(uname -r) beegfs-client-dkms beegfs-helperd beegfs-utils
  3. Load the module by running:
    sudo modprobe beegfs
  4. To ensure the module is loaded automatically at boot create the file “/etc/modules-load.d/beegfs-client-dkms.conf” with the following contents:
    # Load the BeeGFS client module at boot
    beegfs
  5. Run the following to start the BeeGFS helper daemon (facilitates logging and hostname resolution):
    systemctl start beegfs-helperd && systemctl enable beegfs-helperd

Note that mounting and unmounting BeeGFS will be handled by the CSI driver.

Step 3: Deploy the BeeGFS CSI Driver to Kubernetes

To install the BeeGFS CSI driver run the following from a terminal on a machine (could be the master node, jumphost, or your local machine) that has kubectl access to the Kubernetes cluster where you want to deploy the driver:

  1. Run this command to clone the driver then deploy it to your Kubernetes cluster:
    git clone https://github.com/NetApp/beegfs-csi-driver.git && cd beegfs-csi-driver && kubectl apply -k deploy/prod && kubectl get pods -n kube-system | grep csi-beegfs
  2. Once “kubectl get pods -n kube-system | grep csi-beegfs” indicates all Pods are running the driver is ready for use.

If you wish to customize the BeeGFS client configuration see the full Deployment guide for a comprehensive overview including sample outputs.

Step 4: Tell Kubernetes about the BeeGFS File System

The BeeGFS CSI driver enables both static and dynamic storage provisioning. With static provisioning an existing directory in BeeGFS can be exposed to applications in Kubernetes. With dynamic provisioning applications can create/access new directories in BeeGFS. This blog post focuses on using the dynamic workflow to provide on-demand storage for JupyterLab workspaces.

To tell Kubernetes about the BeeGFS file system we set up, we have to create a storage class. On the machine with kubectl access perform the following steps:

  1. Create a file called “beegfs-dyn-sc.yaml” with the following contents, replacing the value for “sysMgmtdHost” with the IP (or hostname) of the node where you installed the BeeGFS server services. If desired update “volDirBasePath” to reflect the path where you want the driver to create on-demand directories (note this directory will be created automatically if needed):
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: csi-beegfs-dyn-sc # Note: This will be needed when we create a JupyterLab workspace.
    provisioner: beegfs.csi.netapp.com
    parameters:
      sysMgmtdHost: 10.113.73.172
      volDirBasePath: k8s/
    reclaimPolicy: Delete
    volumeBindingMode: Immediate
    allowVolumeExpansion: false
    Note: Additional BeeGFS Storage Class parameters are available to control data placement and optimize performance using stripe patterns. Check out this blog post if you’re interested in learning more.
  2. Then run the following to apply the new storage class:
    kubectl apply -f beegfs-dyn-sc.yaml
    storageclass.storage.k8s.io/csi-beegfs-dyn-sc created

To avoid conflicting with any existing storage classes you have configured we aren’t setting the new storage class as the default. If this is desired see the Kubernetes documentation on changing the default Storage Class for more details.

Step 5: Launch JupyterLab Workspaces

There are two “flavors” of the NetApp Data Science Toolkit:

  • The Traditional (sometimes known as the “Classic”) flavor of the toolkit provides a Python based command line utility and library of functions that simplifies interacting with NetApp storage systems. This version is typically used in bare metal environments, for example if you just SSH to your GPU servers or train on your local machine.
  • The Kubernetes flavor provides similar functionality but can up-level management of storage resources and Kubernetes workloads to the data science workspace level using JupyterLab.

The remainder of this blog post will focus on using the Kubernetes flavor of the toolkit. On the machine where you have kubectl access to Kubernetes perform the following steps:

  1. Run the following one liner to clone the repository, create/activate a new virtual environment, install the required dependencies, set up a gitignore file, and cd to the Kubernetes directory:
    git clone https://github.com/NetApp/netapp-data-science-toolkit.git && cd netapp-data-science-toolkit && python3 -m venv venv/ && source venv/bin/activate && pip3 install ipython kubernetes pandas tabulate && echo -e ".gitignore\nvenv/" >> .gitignore && cd Kubernetes
  2. Now we can use the toolkit to create a new JupyterLab workspace using the BeeGFS storage class we set up earlier. Run the following from the Kubernetes directory replacing workspace with the desired name:
    ./ntap_dsutil_k8s.py create jupyterlab --storage-class=csi-beegfs-dyn-sc --workspace-name=<workspace> --size=1Gi
    Sample Output:
    ./ntap_dsutil_k8s.py create jupyterlab --storage-class=csi-beegfs-dyn-sc --workspace-name=joe --size=1Gi
    Set workspace password (this password will be required in order to access the workspace):

    Re-enter password:Creating persistent volume for workspace...
    Creating PersistentVolumeClaim (PVC) 'ntap-dsutil-jupyterlab-joe' in namespace 'default'.
    PersistentVolumeClaim (PVC) 'ntap-dsutil-jupyterlab-joe' created. Waiting for Kubernetes to bind volume to PVC.
    Volume successfully created and bound to PersistentVolumeClaim (PVC) 'ntap-dsutil-jupyterlab-joe' in namespace 'default'.Creating Service 'ntap-dsutil-jupyterlab-joe' in namespace 'default'.
    Service successfully created.Creating Deployment 'ntap-dsutil-jupyterlab-joe' in namespace 'default'.
    Deployment 'ntap-dsutil-jupyterlab-joe' created.
    Waiting for Deployment 'ntap-dsutil-jupyterlab-joe' to reach Ready state.
    Deployment successfully created.
    Workspace successfully created.

    To access workspace, navigate to http://10.113.72.26:30076Note: When creating JupyterLab workspaces there are a number of additional options including the ability to request GPUs or specify memory/CPU available for the JupyterLab workspace. Those parameters are omitted for brevity, but documented here.
  3. At this point we can navigate to the new JupyterLab server in a browser and verify it is indeed backed by our BeeGFS file system:
Output of the df command run in a JupyterLab notebook showing available BeeGFS capacity.

Running df from a Jupyter Notebook to demonstrate it is backed by BeeGFS. If you’re wondering why df shows 462G when we only requested 1G see the BeeGFS Driver’s documentation on capacity.

It’s really that simple! And of course, this is just where the fun begins. If you had a full blown BeeGFS file system you would have access to a JupyterLab workspace that can scale to PBs of capacity with 100s of GB/s of performance.

Step 6: List Running Workspaces/Volumes

  1. We can list all JupyterLab workspaces as follows:
    ./ntap_dsutil_k8s.py list jupyterlabs
    Workspace Name    Status    Size    StorageClass       Access URL                 Clone    Source Workspace    Source VolumeSnapshot
    ----------------  --------  ------  -----------------  -------------------------  -------  ------------------  -----------------------
    joe               Ready     1Gi     csi-beegfs-dyn-sc  http://10.113.72.26:30076  No
  2. If we want to see a list of volumes regardless of if they’re associated with a JupyterLab workspace we can use the following:
    ./ntap_dsutil_k8s.py list volumes
    PersistentVolumeClaim (PVC) Name    Status    Size    StorageClass       Clone    Source PVC    Source VolumeSnapshot
    ----------------------------------  --------  ------  -----------------  -------  ------------  -----------------------
    ntap-dsutil-jupyterlab-joe          Bound     1Gi     csi-beegfs-dyn-sc  No
  3. Lastly if we no longer need the JupyterLab workspace we can delete it:
    ./ntap_dsutil_k8s.py delete jupyterlab --workspace-name=joe
    Warning: All data associated with the workspace will be permanently deleted.
    Are you sure that you want to proceed? (yes/no): yes
    Deleting workspace 'joe' in namespace 'default'.
    Deleting Deployment...
    Deleting Service...
    Deleting PVC...
    Deleting all VolumeSnapshots associated with PersistentVolumeClaim (PVC) 'ntap-dsutil-jupyterlab-joe' in namespace 'default'...
    Deleting PersistentVolumeClaim (PVC) 'ntap-dsutil-jupyterlab-joe' in namespace 'default' and associated volume.
    PersistentVolumeClaim (PVC) successfully deleted.
    Workspace successfully deleted.

Step 7: Clean Up (Optional)

  1. After deleting any JupyterLab workspaces and BeeGFS persistent volumes from your Kubernetes cluster, if you want to uninstall the driver simply run “kubectl delete -k deploy/prod” from the directory where you cloned the BeeGFS CSI Driver repository.
  2. If you want to uninstall the BeeGFS file system, on the host where they’re installed stop the services in the order beegfs-storage, beegfs-meta, beegfs-storage then uninstall using your package manager (e.g., yum remove <package>) and delete the /data/beegfs directories. Optionally delete /etc/beegfs.
  3. Uninstall the BeeGFS DKMS Client from your Kubernetes nodes by running “sudo rmmod beegfs” then uninstalling the beegfs-client-dkms, beegfs-helperd, and beegfs-utils packages using your package manger then deleting the “/etc/modules-load.d/beegfs-client-dkms.conf” file. Optionally delete /etc/beegfs.
    Note: If you don’t need them, you can also uninstall the kernel-devel or linux-headers packages.

Now What?

That’s up to you! (Did I mention you’d be driving this rocket ship?)

To learn more about the latest features of the NetApp Data Science Toolkit or the BeeGFS CSI Driver check out their respective repositories on GitHub. And a few plugs for some of my other blog posts:

  • If you’re wondering how BeeGFS fits into AI and modern analytics I’ve got a post for you.
  • While everything we talked about is available/open source (it’s not a trap!), you might want a fully supported solution with NetApp E-Series providing six-nines reliability before moving to production.

And as always, we’d love to hear about your challenges around data. To continue the conversation drop us a line at ng-ai-inquiry@netapp.com.

Joe McCormick
Software Engineer at NetApp
Joe develops storage solutions for high performance computing, focusing in particular on solving challenges for AI at scale. From parallel file systems to Kubernetes, he enjoys tackling big storage problems in a cloud-native world. Outside work you may find him at the lake, buried in a book, or re-watching The Office.

Pin It on Pinterest