The NetApp Data Science Toolkit is a Python program that makes it simple for data scientists and data engineers to perform advanced data management tasks. This Python program can function as either a command line utility or a library of functions that can be imported into any Python program or Jupyter notebook. This enables data scientists and data engineers to augment the typical AI training workflow with advanced data management capabilities in order to greatly accelerate it. In this post, I will outline the typical workflow and provide details on how the NetApp Data Science Toolkit can help.

Step 1 – Identify training dataset

The first step in any AI training workflow is identifying the training dataset. This is the set of data that is going to be used to train the AI model.

Step 2 – Retrieve training dataset

Once the dataset has been identified, the data needs to be brought into the AI training environment. Typically, this is an environment that pairs AI-focused compute with high performance storage. This could be an on-premises NetApp ONTAP AI system or on-demand cloud-based GPU compute instances that are paired with NetApp Cloud Volumes ONTAP.

At this point in the process, the NetApp Data Science Toolkit can be used to provision a new data volume on which the training dataset will reside throughout the training process. The following CLI command can be executed from the training host (for example, the GPU compute) in order to provision a new data volume named ‘project1_gold’, of size 2TB, and mount the newly created volume at ‘~/project1_gold’ on the training host.

sudo -E ./ntap_dsutil.py create volume --name=project1_gold --size=2TB --mountpoint=~/project1_gold

Alternatively, the following function can be invoked from within a Jupyter notebook or Python program in order to perform the same operation.

createVolume(
    volumeName="project1_gold", 
    volumeSize="2TB", 
    mountpoint="~/project1_gold", 
    printOutput=True
)

After the new volume has been provisioned, the training dataset can be retrieved from the source location and placed on the new volume. This typically will involve pulling the data from an S3 or Hadoop data lake and sometimes will involve help from a data engineer. NetApp offers tools such as Cloud Sync and XCP that can help with this.

Step 3 – Prepare training dataset

A training dataset usually needs to be reformatted and normalized before it can be used to train a model. Data scientists often experiment with different dataset formats and normalization techniques in order to determine the format and technique that produce the most accurate model. This usually necessitates the maintaining of different versions of the same dataset and introduces a significant amount of time into the training process. In typical setups, with each dataset version that is created, the data scientist will need to sit around and wait while the dataset is copied.

With the NetApp Data Science Toolkit, however, a data scientist can near-instantaneously create a new data volume that is an exact copy of an existing volume, even if the existing volume contains TBs or even PBs worth of data. This enables data scientists to rapidly create clones of datasets that they can reformat, normalize, and manipulate as needed, while preserving the original gold-source dataset. The following CLI command can be executed from the training host in order to create a volume named ‘project1_exp1’ that is an exact copy of the contents of volume ‘project1_gold’ and mount the newly created volume at ‘~/project1_exp1’ on the training host.

sudo -E ./ntap_dsutil.py clone volume --name=project1_exp1 --source-volume=project1_gold --mountpoint=~/project1_exp1

Alternatively, the following function can be invoked from within a Jupyter notebook or Python program in order to perform the same operation.

cloneVolume(
    newVolumeName="project1_exp1", 
    sourceVolumeName="project1_gold", 
    mountpoint="~/project1_exp1", 
    printOutput=True
)

Step 4 – Determine model architecture and train model

Finally, the data scientist is ready to actually train a model. They will determine the model architecture that they want to use and then train the model using the training dataset.

Step 5 – Repeat the process

Data scientists will typically repeat steps 3 and 4 many times, comparing model accuracy across different dataset formats and model architectures.

Step 6 – Identify best performing model

Of the different model versions that were produced in steps 3-5, the data scientist will determine the highest performing model and will plan to move forward with that version. This is the model that will eventually be deployed to production.

In regulated industries, it is necessary to implement dataset-to-model traceability for any model that is deployed to production. This usually involves yet another time-consuming dataset copy process, as the dataset that was used to train the chosen model is copied to a location where it will be kept.

With the NetApp Data Science Toolkit, however, a data scientist can near-instantaneously save a space-efficient, read-only copy of an existing data volume. This functionality, which uses NetApp’s famed Snapshot technology under the hood, can be used to save read-only versions of a dataset. The following CLI command can be executed from the training host in order to create a snapshot named ‘final_dataset’ for the volume named ‘project1_exp1’.

./ntap_dsutil.py create snapshot --volume=test1 --name=final_dataset

Alternatively, the following function can be invoked from within a Jupyter notebook or Python program in order to perform the same operation.

createSnapshot(
    volumeName="project1_exp1", 
    snapshotName="final_dataset", 
    printOutput=True
)

The data volume name and snapshot name can then be saved in the model repository as attributes of the model in order to quickly and easily implement dataset-to-model traceability.

Step 7 – Clean up training environment

At this point in the process, any unneeded dataset clones can be deleted in order to free up valuable high-performance storage space for other projects. All clones that were created as step 3 was repeated multiple times, aside from the clone containing the dataset that was used to train the chosen model, can be deleted. If dataset-to-model traceability is implemented in a different environment or if traceability isn’t required, then all clones and the original dataset volume that was created in step 2, assuming that it isn’t going to be used for another project, can be deleted. The following CLI command can be executed from the training host in order to delete an existing data volume named ‘project1_exp2’.

./ntap_dsutil.py delete volume --name=project1_exp2

Alternatively, the following function can be invoked from within a Jupyter notebook or Python program in order to perform the same operation.

deleteVolume(volumeName="test1_clone_team1", printOutput=True)

Step 8 – Deploy model

Finally, the model will be deployed in production. Typically, this involves the data scientist handing the model off to an application development team that will deploy it as part of a projection application. Some more agile organizations have even implemented CI/CD processes that enable their data scientists to directly commit models to production – a CI/CD process will automatically pick up any model that a data scientist commits, run some tests against it, and, if all required tests pass, deploy it to production.

As you can see, the NetApp Data Science Toolkit greatly accelerates the typical AI training process by eliminating bottlenecks related to dataset management. With NetApp’s help, data scientists spend less time waiting for data and more time actually doing data science and delivering business value.

Mike Oglesby on GithubMike Oglesby on Linkedin
Mike Oglesby
Technical Marketing Engineer at NetApp
Mike is a Technical Marketing Engineer at NetApp focused on MLOps and Data Pipeline solutions. He architects and validates full-stack AI/ML/DL data and experiment management solutions that span a hybrid cloud. Mike has a DevOps background and a strong knowledge of DevOps processes and tools. Prior to joining NetApp, Mike worked on a line of business application development team at a large global financial services company. Outside of work, Mike loves to travel. One of his passions is experiencing other places and cultures through their food.

Pin It on Pinterest