The BeeGFS CSI Driver enables data scientists to build Kubernetes AI pipelines capable of provisioning and consuming BeeGFS file system storage. But applications in different pipeline steps often use different IO patterns and require different storage characteristics. Administrators can prepare their environments for these needs with a simple combination of Kubernetes storage classes and BeeGFS stripe patterns and storage pools.

When a file is stored in BeeGFS, chunks of a configurable size are placed on a configurable number of targets. A target can be anything from a directory within a server’s root filesystem to a partition on a single drive to a RAID volume on a high performance NetApp storage array. Not all targets are made equal and each target can be highly configurable in-and-of-itself. For example, a RAID volume may consist of SSDs or HDDs, have one of a variety of RAID levels, or use a particular segment size. BeeGFS provides the ability to assign targets to different storage pools representative of their underlying properties.

A BeeGFS administrator can use the beegfs-ctl setpattern command with the chunksize, numtargets, and storagepoolid parameters when creating a directory to control how BeeGFS will stripe files within that directory, and thus how a particular IO pattern will perform. These BeeGFS configuration parameters can make a real difference, but no organization wants their data scientists spending time and energy understanding the minutiae of storage configuration. Fortunately, the BeeGFS CSI Driver makes it so that they don’t have to.

When deploying and configuring the BeeGFS CSI Driver, administrators create one or more Kubernetes storage classes. A storage class carries the essential details about how a Kubernetes persistent volume will be provisioned, and a storage class referencing the BeeGFS CSI Driver can carry all the configuration information described above. For example, as a storage administrator, you might create a BeeGFS file system that includes multiple low latency, high performance NVMe SSD RAID targets from one or more NetApp EF600 arrays and multiple cost-effective NL-SAS HDD RAID targets from one or more NetApp E5700 arrays. You place the targets into separate storage pools and create the following storage classes:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: small-file-random-multi-process
provisioner: beegfs.csi.netapp.com
parameters:
  sysMgmtdHost: 10.10.10.10
  volDirBasePath: /kubernetes_vols/
  stripePattern/storagePoolID: "1"  # pool with SSD NVMe targets
  stripePattern/chunkSize: 512k
  stripePattern/numTargets: "4"

---

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: small-file-random-single-process
provisioner: beegfs.csi.netapp.com
parameters:
  sysMgmtdHost: 10.10.10.10
  volDirBasePath: /kubernetes_vols/
  stripePattern/storagePoolID: "1"  # pool with SSD NVMe targets
  stripePattern/chunkSize: 512k
  stripePattern/numTargets: "1"

---

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: large-file-sequential
provisioner: beegfs.csi.netapp.com
parameters:
  sysMgmtdHost: 10.10.10.10
  volDirBasePath: /kubernetes_vols/
  stripePattern/storagePoolID: "2"  # pool with HDD NLSAS targets
  stripePattern/chunkSize: 4m
  stripePattern/numTargets: "4"

As a storage administrator, you have a good understanding of how these parameters affect performance. For example, four HDD-based targets and a large chunk size can support highly parallel streams of sequential data, while a single target with a small chunk size is better for a single random writer. Your data scientists may not understand these parameters, and that’s fine! When a data scientist queries the cluster to see what storage is available, they see the following:

--> kubectl get sc
NAME                               PROVISIONER             RECLAIMPOLICY   AGE
small-file-random-multi-process    beegfs.csi.netapp.com   Delete          2d
small-file-random-single-process   beegfs.csi.netapp.com   Delete          2d
large-file-sequential              beegfs.csi.netapp.com   Delete          2d

Of course, they could examine each storage class to get more information, but they don’t need to. Each pod they provision in an AI pipeline includes a reference to one or more Kubernetes persistent volume claims. Each persistent volume claim is bound to a Kubernetes persistent volume which describes a brand new directory created on your BeeGFS file system, striped across your intended storage pool with your intended striping parameters. For example, if a data scientist needs to transform some large external dataset into several large files which will eventually be accessed by future pipeline steps, they can submit the following claim:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: transformed-data-for-future-steps
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 500Gi
  storageClassName: large-file-sequential

If that same scientist needs a shared scratch space that can handle random IO from many Kubernetes pods simultaneously, they can submit the following claim instead of (or in addition to) the one above:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: multi-pod-scratch-space
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 50Gi
  storageClassName: small-file-random-multi-process

Kubernetes and the BeeGFS CSI Driver ensure that required storage is provisioned and the appropriate pods can access it. If a name is not enough to help data scientists distinguish between classes, you can use Kubernetes labels to provide additional context. If the level of detail in these example names is too much or doesn’t apply to your particular set of use cases, you can choose whatever makes the most sense (e.g., gold, silver, and bronze or something else entirely).

The flexibility provided by BeeGFS and the BeeGFS CSI Driver ensures that your NetApp-backed BeeGFS storage will always meet your data scientists’ needs. As always, remember to visit netapp.com/ai to learn more about this and other NetApp AI and HPC solutions.

Eric Weber on Github
Eric Weber
Software Engineer at NetApp
Eric is a relative newcomer to the industry, having joined NetApp four years ago after an early career as a high school science and engineering teacher. He took an immediate interest in both high performance computing and cloud technologies, and spends most of his time developing solutions that make use of the latter to serve the needs of the former. Eric, his wife, and his two children enjoy many things about their life in Wichita, KS, but they take every opportunity to get away to the mountains, where hiking and backpacking are their favorite pastimes.

Pin It on Pinterest