Jupyter Notebook is an open source web application targeted towards data scientists and data engineers that enables users to create wiki-like documents, called notebooks, that contain blocks of live code paired with descriptive text. These code blocks can be executed on demand from within the interactive Jupyter Notebook web interface, which facilitates experimentation and rapid iteration. Jupyter Notebooks are widely used in the artificial intelligence (AI), machine learning (ML), and deep learning (DL) communities as a means of documenting, sharing, and storing projects.

When the datasets used for ML/DL experimentation are stored on NetApp ONTAP volumes that are mounted into the Jupyter Notebook workspace, users can easily and instantly trigger NetApp Snapshots to save read-only copies of the dataset. These Snapshots can be used to version datasets and implement dataset-to-model traceability in a self-service manner directly from within the Jupyter Notebook interface. This versioning and traceability mechanism is incredibly storage-efficient, as a Snapshot only consumes the storage capacity needed to preserve incremental changes to the source volume. This post walks through the process of creating a Snapshot from within a Jupyter Notebook.

First, ensure that you have the ‘netapp_ontap’ Python library installed within your Jupyter Notebook environment by executing the following:

%pip install --user netapp_ontap

Next, import all needed functions and classes:

from netapp_ontap import config as netappConfig
from netapp_ontap.host_connection import HostConnection as NetAppHostConnection
from netapp_ontap.resources import Volume, Snapshot
from datetime import datetime
import json

Now you will need to set up the connection to your ONTAP cluster. Replace the variable values with the connection details for your ONTAP cluster, and execute the following:

## Enter connection details for your ONTAP cluster/instance
ontapClusterMgmtHostname = ''
ontapClusterAdminUsername = 'admin'
ontapClusterAdminPassword = 'NetApp!23'
verifySSLCert = False

netappConfig.CONNECTION = NetAppHostConnection(
    host = ontapClusterMgmtHostname,
    username = ontapClusterAdminUsername,
    password = ontapClusterAdminPassword,
    verify = verifySSLCert

You will need to set the ‘volumeName’ variable. If you are running your Jupyter Notebook on Kubernetes, using NetApp Trident to orchestrate persistent storage within the Kubernetes cluster, and you did not specify a custom storagePrefix when creating your Trident backend, you can use the following code to convert a Kubernetes PersistentVolume (PV) name to an ONTAP volume name. Simply set the ‘pvName’ variable equal to the name of the PV for which you wish to create a Snapshot, and execute this code.

## Enter the name of pv for which you are creating a snapshot
##   Note: To get the name of the pv, you can run `kubectl -n <namespace>
##         get pvc`. The name of the pv that corresponds to a given pvc 
##         can be found in the 'VOLUME' column.
pvName = 'pvc-db963a53-abf2-4ffa-9c07-8815ce78d506'

# The following will not work if you specified a custom storagePrefix when
#   creating your Trident backend.
volumeName = 'trident_%s' % pvName.replace("-", "_")
print('pv name: ', pvName)
print('ONTAP volume name: ', volumeName)

If you are not running your Jupyter Notebook on Kubernetes, you are not using NetApp Trident to orchestrate persistent storage, you set a custom storagePrefix, or you simply chose not to execute the above code, you will need to directly set the ‘volumeName’ variable equal to the name of ONTAP volume for which you wish to create a Snapshot, as shown in the following code.

volumeName = 'ml_dataset_vol'

You can now proceed to creating a Snapshot of the specified volume by executing the following code.

volume = Volume.find(name = volumeName)
timestamp = datetime.today().strftime("%Y%m%d_%H%M%S")
snapshot = Snapshot.from_dict({
    'name': 'jupyter_%s' % timestamp,
    'comment': 'Snapshot created from within a Jupyter Notebook',
    'volume': volume.to_dict()
response = snapshot.post()
print("API Response:")

If you wish to retrieve all details for the Snapshot that you just created, then simply execute the following code.

print(json.dumps(snapshot.to_dict(), indent=2))

Lastly, you can retrieve a list of all Snapshots that exist for a particular volume by executing the following code. Note that this code will retrieve a maximum of 256 Snapshots.

totalSnapshots = 0

for volumeSnapshot in Snapshot.get_collection(volume.uuid, max_records = 256) :
    totalSnapshots += 1
    print("Snapshot #%s:" % totalSnapshots)
    print(json.dumps(volumeSnapshot.to_dict(), indent=2), "\n")
print("Total Snapshots: %s" % totalSnapshots)

These code snippets can be used to implement efficient on-demand dataset and model versioning directly from within a Jupyter Notebook. All saved versions will be read-only and storage-efficient. When you consistently save dataset and model versions as part of your experimentation and training processes, you will be able to trace every single trained model back to the exact dataset(s) that the model was trained and/or validated with.

For more information on integrating NetApp Data Management functions into your AI/ML/DL environment, I suggest taking a look at the NetApp AI Control Plane:

For information on all of NetApp’s AI/ML/DL solutions, refer to netapp.com/ai.

For more information on the netapp_ontap Python library, refer to https://devnet.netapp.com/restapi.

About Mike Oglesby

Mike is a Technical Marketing Engineer at NetApp focused on MLOps and Data Pipeline solutions. He architects and validates full-stack AI/ML/DL data and experiment management solutions that span a hybrid cloud. Mike has a DevOps background and a strong knowledge of DevOps processes and tools. Prior to joining NetApp, Mike worked on a line of business application development team at a large global financial services company. Outside of work, Mike loves to travel. One of his passions is experiencing other places and cultures through their food.

Pin It on Pinterest