A data scientist will often modify their training dataset many times as they iterate through different training runs, using different normalization techniques, hyperparameters, etc. This can become a problem if the data scientist ever needs to revert back to a previous dataset structure/format. If they didn't save a copy of the dataset in the previous structure/format, then they will need to re-create it, which can be time-consuming and cumbersome.
Data scientists operating in regulated industries face an additional challenge, because in these industries, it is necessary to implement dataset-to-model traceability for any model that is deployed to production. Traceability is also a key component of "Explainable AI," a concept that is becoming increasingly important as AI finds its way into ever more facets of society. At a high level, the concept of "Explainable AI" describes the need to be able to explain the decisions that an AI algorithm is making. This article from Forbes explores the topic in deeper depth. Explainability and transparency are necessary if we are ever to be able to trust AI algorithms, and dataset-to-model traceability is an integral part of providing this. The following chart from IDC highlights the importance of "Explainable AI."
If a data scientist is required to implement traceability, then this means that each time that they wish to reformat or modify a dataset, they will need to save a copy of the existing dataset so as to not lose that version of the dataset. In the event that the model that was trained using that version of the dataset is the one that ends up being deployed to production, then the data scientist will need to have maintained that dataset version in order to satisfy traceability requirements.
In typical setups, when a data scientist needs to save a copy of an existing dataset, they will need to sit around and wait while the dataset is copied to a backup location. Assuming that they are copying the dataset across a standard, non-performance-optimized corporate network, assuming that it takes roughly 6 seconds to copy a 1 GB file across such a network, and assuming that they are working with a 10 TB dataset, then they will be waiting 17 hours for this operation to complete! This is 17 hours that the data scientist won't be able to spend experimenting with different dataset formats and improving the accuracy of their model.
The NetApp Data Science Toolkit can reduce this wait time from hours to seconds. With the NetApp Data Science Toolkit, a data scientist can near-instantaneously save a space-efficient, read-only copy of an existing data volume. This functionality, which uses NetApp’s battle-tested Snapshot technology under the hood, can be used to quickly save read-only versions of a dataset. For example, a data scientist could execute the following CLI command in order to create a snapshot named 'run1' for the volume named 'project1'.
./ntap_dsutil.py create snapshot --volume=project1 --name=run1
Alternatively, a data scientist could invoke the following function from within a Jupyter notebook or Python program in order to perform the same operation.
createSnapshot( volumeName="project1", snapshotName="run1", printOutput=True )
Implementing dataset-to-model traceability with MLflow and the NetApp Data Science Toolkit
If a data scientist is using an experiment tracking server and/or model repository tool to track training runs and model versions, then implementing dataset-to-model traceability with the NetApp Data Science Toolkit is incredibly simple. The data scientist can simply create a snapshot of the dataset volume using the NetApp Data Science Toolkit and then save the snapshot name as an attribute of the training run or model.
For example, a data scientist that is using MLflow, a popular open source AI lifecycle management platform, can execute the following lines of Python code in order to implement traceability. This code will save the data volume name and snapshot name as tags associated with the specific training run.
with mlflow.start_run() : ... dataVolumeName = "project1" # Name of volume containing dataset snapshotName = "run1" # Name new snapshot # Create snapshot createSnapshot( volumeName=dataVolumeName, snapshotName=snapshotName, printOutput=True ) # Log data volume name and snapshot name as "tags" # associated with this training run in mlflow. mlflow.set_tag("data_volume_name", dataVolumeName) mlflow.set_tag("snapshot_name", snapshotName) ...
The data volume name and snapshot name will be visible in the MLflow tracking UI as shown in the following screenshot.
Although this example uses MLflow, the same concept is applicable to other model repositories and tracking servers.
As you can see, the NetApp Data Science Toolkit can greatly simplify and accelerate the process of implementing dataset-to-model traceability for AI projects. With NetApp’s help, data scientists spend less time making copies of datasets and more time actually doing data science and delivering business value. To get started with the NetApp Data Science toolkit, visit the toolkit's GitHub repository.