The BeeGFS CSI Driver enables data scientists to build Kubernetes AI pipelines capable of provisioning and consuming BeeGFS file system storage. But applications in different pipeline steps often use different IO patterns and require different storage characteristics. Administrators can prepare their environments for these needs with a simple combination of Kubernetes storage classes and BeeGFS stripe patterns and storage pools.
When a file is stored in BeeGFS, chunks of a configurable size are placed on a configurable number of targets. A target can be anything from a directory within a server’s root filesystem to a partition on a single drive to a RAID volume on a high performance NetApp storage array. Not all targets are made equal and each target can be highly configurable in-and-of-itself. For example, a RAID volume may consist of SSDs or HDDs, have one of a variety of RAID levels, or use a particular segment size. BeeGFS provides the ability to assign targets to different storage pools representative of their underlying properties.
A BeeGFS administrator can use the beegfs-ctl setpattern command with the chunksize, numtargets, and storagepoolid parameters when creating a directory to control how BeeGFS will stripe files within that directory, and thus how a particular IO pattern will perform. These BeeGFS configuration parameters can make a real difference, but no organization wants their data scientists spending time and energy understanding the minutiae of storage configuration. Fortunately, the BeeGFS CSI Driver makes it so that they don’t have to.
When deploying and configuring the BeeGFS CSI Driver, administrators create one or more Kubernetes storage classes. A storage class carries the essential details about how a Kubernetes persistent volume will be provisioned, and a storage class referencing the BeeGFS CSI Driver can carry all the configuration information described above. For example, as a storage administrator, you might create a BeeGFS file system that includes multiple low latency, high performance NVMe SSD RAID targets from one or more NetApp EF600 arrays and multiple cost-effective NL-SAS HDD RAID targets from one or more NetApp E5700 arrays. You place the targets into separate storage pools and create the following storage classes:
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: small-file-random-multi-process provisioner: beegfs.csi.netapp.com parameters: sysMgmtdHost: 10.10.10.10 volDirBasePath: /kubernetes_vols/ stripePattern/storagePoolID: "1" # pool with SSD NVMe targets stripePattern/chunkSize: 512k stripePattern/numTargets: "4" --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: small-file-random-single-process provisioner: beegfs.csi.netapp.com parameters: sysMgmtdHost: 10.10.10.10 volDirBasePath: /kubernetes_vols/ stripePattern/storagePoolID: "1" # pool with SSD NVMe targets stripePattern/chunkSize: 512k stripePattern/numTargets: "1" --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: large-file-sequential provisioner: beegfs.csi.netapp.com parameters: sysMgmtdHost: 10.10.10.10 volDirBasePath: /kubernetes_vols/ stripePattern/storagePoolID: "2" # pool with HDD NLSAS targets stripePattern/chunkSize: 4m stripePattern/numTargets: "4"
As a storage administrator, you have a good understanding of how these parameters affect performance. For example, four HDD-based targets and a large chunk size can support highly parallel streams of sequential data, while a single target with a small chunk size is better for a single random writer. Your data scientists may not understand these parameters, and that’s fine! When a data scientist queries the cluster to see what storage is available, they see the following:
--> kubectl get sc NAME PROVISIONER RECLAIMPOLICY AGE small-file-random-multi-process beegfs.csi.netapp.com Delete 2d small-file-random-single-process beegfs.csi.netapp.com Delete 2d large-file-sequential beegfs.csi.netapp.com Delete 2d
Of course, they could examine each storage class to get more information, but they don’t need to. Each pod they provision in an AI pipeline includes a reference to one or more Kubernetes persistent volume claims. Each persistent volume claim is bound to a Kubernetes persistent volume which describes a brand new directory created on your BeeGFS file system, striped across your intended storage pool with your intended striping parameters. For example, if a data scientist needs to transform some large external dataset into several large files which will eventually be accessed by future pipeline steps, they can submit the following claim:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: transformed-data-for-future-steps spec: accessModes: - ReadWriteMany resources: requests: storage: 500Gi storageClassName: large-file-sequential
If that same scientist needs a shared scratch space that can handle random IO from many Kubernetes pods simultaneously, they can submit the following claim instead of (or in addition to) the one above:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: multi-pod-scratch-space spec: accessModes: - ReadWriteMany resources: requests: storage: 50Gi storageClassName: small-file-random-multi-process
Kubernetes and the BeeGFS CSI Driver ensure that required storage is provisioned and the appropriate pods can access it. If a name is not enough to help data scientists distinguish between classes, you can use Kubernetes labels to provide additional context. If the level of detail in these example names is too much or doesn’t apply to your particular set of use cases, you can choose whatever makes the most sense (e.g., gold, silver, and bronze or something else entirely).
The flexibility provided by BeeGFS and the BeeGFS CSI Driver ensures that your NetApp-backed BeeGFS storage will always meet your data scientists’ needs. As always, remember to visit netapp.com/ai to learn more about this and other NetApp AI and HPC solutions.