Cinder Replication with NetApp : The perfect cheesecake recipe!

Warning: Some of this content may be outdated. For an updated perspective, see There and Back Again: An OpenStack Tale.

 

Written by Goutham Pacha Ravi & Sumit Kumar

53% ! Yup, that’s the percentage of organizations that can tolerate less than an hour of downtime before significant revenue loss!
1

Here comes Cheesecake to the rescue! No, we’re not talking about the kind that you can eat and forget all your problems (sorry!). Cheesecake is the codename given to Cinder replication for Disaster Recovery (DR) use-cases by the OpenStack community. Here’s a link to the design specification : https://specs.openstack.org/openstack/cinder-specs/specs/mitaka/cheesecake.html

‘Wait, I thought I could already have replication with Cinder?!’ –  Well, yes – while you did have the option to set up pool-level (NetApp FlexVol) replication with the NetApp driver for Cinder, Cheesecake enables you to implement a backend-level disaster recovery mechanism. Thus, instead of failing over on a per pool (FlexVol) basis, you can now failover on a backend* (SVM) basis which significantly reduces administrative complexity and service interruption!

*  A cinder backend for cDOT is considered as a set of FlexVols on a given Vserver. These FlexVols are identified with the netapp option “netapp_pool_name_search_pattern”

cinder_cheesecake_setup

Why Cheesecake?

  • Business environments have always desired and required 24/7 data availability. An enterprise’s storage must deliver the base building block for IT infrastructures, providing data storage for all business applications and objectives. Therefore, constant data availability begins with architecting storage systems that facilitate nondisruptive or least downtime operations. This functionality is desired in three principal areas: hardware resiliency, hardware and software lifecycle operations, and hardware and software maintenance operations.
  • Cheesecake provides a way for you to configure one or more disaster recovery partner storage systems to your cinder backend. So, if your cinder backend fails, you may, via a cinder API, flip a switch to continue your operations from one of the disaster recovery partner storage systems without losing access to your critical data for long periods of time.

How do I eat set it up?

We do realize that this section is a little long, but configuration is simple and straightforward – we promise!

  1. Set up your NetApp backend to enable replication

    If you’re setting up a new NetApp backend for Cinder, configure your NetApp backend as required – you can access the complete guide on configuring NetApp backend with Cinder by going to our
    Deployment and Operations Guide. Now once that’s done (or if you already have a NetApp backend), go ahead and add these two new parameters to your backend stanza in the cinder.conf file to enable Cinder replication.

    • replication_device = backend_id:target_cmodeiSCSI
      This parameter allows you to set the backend that you want to use as your replication target using its backend IDHere, target_cmodeiSCSI denotes the name of another NetApp backend section that you may want to use as your replication target.Please note that while you can have this secondary / target backend added to the “enabled_backend” parameter in cinder.conf, we highly recommend NOT doing so. Setting up your target backend as an enabled backend in Cinder may cause the Cinder scheduler to place volumes in it – thus reducing the available space for your host replicas.
    • netapp_replication_aggregate_map = backend_id:target_cmodeiSCSIsource_aggr_1:destination_aggr_1,source_aggr_2:destination_aggr_2 

      As the name suggests, this parameter allows you to create a source-to-destination aggregate map for your replicated FlexVols. It is recommended that you try to match the characteristics of the containing aggregates for all the FlexVols that make up your cinder backend on your target backend. Please note that the storage efficiency properties of the source FlexVol will be preserved in the target FlexVol.cinder_replication_source_properties
      cinder_replication_target_properties
      NetApp does support one to many target relationships. Both the “replication_device” and “netapp_replication_aggregate_map” parameters are repeatable. So if you don’t want to rely on a single target and want to replicate to multiple locations, you can easily do so.Here’s an example:


      Example cinder.conf:

      Below is an example of what your cinder.conf file might look like with replication enabled. Please note that each replication target needs to have it’s own configuration stanza / section as part of the same cinder.conf. This is necessary as the driver addresses replication targets by their name, i.e, the replication_device’s backend_id parameter.

  2. Enable Vserver (SVM) peering                                                                                                                                                            
    You can read more about cluster and Vserver peering in our Express Guide.
  3. Restart Cinder service
  4. Ensure that you have everything set up properly                                                                                                                                                            
    You may use “cinder service-list --withreplication” to check the service list with replication related information: replication_status, active_backend_id, etc. The active_backend_id field for the cinder volume service should currently have no value. This field will be populated with the target backend name after a failover has been initiated.The driver will set up all the SnapMirror relationships while initializing, and performs a periodic check to ensure that SnapMirrors are healthy and updating. Any unexpected errors in this process are logged in the cinder-volume log file.

    cinder_replication_snapmirror
  5. Create a new volume type

    We’re almost there! In order to specify the volumes that you want to replicate, create a new volume type with the following extra-spec:

    replication_enabled = ‘<is> True’

    Please note that the value of this extra-spec is case-sensitive. Here’s an example :


    Now this is pretty obvious, but in case you have multiple back-ends, any volume created with the above extra spec will get created on your replication backend. If you don’t set the extra-spec, it may or may not end up on the replication backend depending on the Cinder scheduler.

    In case you want to ensure that a specific Cinder volume does not get replicated, please set it up with the replication_enabled extra spec set to False : replication_enabled='<is> False'
    Please note that if the SVM has other FlexVols that are accessible, and are part of the netapp_pool_name_search_pattern parameter in the cinder.conf file, they will get replicated as well.

Failing Over

Ok now that you have everything set up, we’re ready to fail over! Before we begin though, please note that the Cheesecake implementation allows only a one-time failover option. Failing back is not as simple as fail over and requires some additional steps and considerations – we’ll cover more details in another blog post at a later time.

Also, it’s good practice to use System Manager to monitor vital details like the SnapMirror health, when it was last updated, etc. As of now, Cinder has no way to check these details for you.

Setting up Nova VM to test for fail-over success:

In order to test the failover operation, we will now boot a Nova VM from a Cinder volume on the replication backend. You may skip this section.

First, here’s a list of my Cinder backends and volumes:

Next, let’s boot the Nova VM:

Failing over to a target:

You can failover by using the $cinder failover-host <hostname> --backend_id <failover target> command. If you have just 1 failover target, you can skip the part in the above command, but it’s good practice anyway.

Here’s an example:

After receiving the command, Cinder will disable the service and send a call to the driver to initiate the failover process. The driver then breaks all SnapMirror relationships, and the FlexVols under consideration become Read/Write. The driver also marks the primary site as dead, and starts to proxy the target (secondary) site as the primary.
So if you run $cinder service-list --withreplication again, you’ll notice that the service has been disabled.

 

cinder_replication_broken_off

 

If you need to re-enable the service so that new volumes can be created on the backend, you may do so using $cinder_service_enable command:

NOTE:

For NFS, if using shares.conf to specify FlexVol mount paths, ensure that the NFS Data LIFs of the actual active_backend_id are reflected in the file and cinder volume service is restarted after a failover

Please note that since Cinder is proxying the secondary site (backend) as the primary, any new volumes that are created will have the backend-id (and other properties) of the first (primary) site.

To ensure that our services are still up after failing over, let’s try to attach our failed-over volume to a VM.

 

cinder_cheesecake_post_failover

Limitations & Considerations

One of the limitations is that the failover process for the Cinder backends needs to be initiated manually.

Also, since Nova does not know about the primary site (backend) going down, you will most likely end up with Zombie volumes because of unresolved connections to the primary site! This can potentially cause some service-level disruption.

To get around it, we recommend that you reset the Cinder state using the $cinder reset-state command, and have a script re-attach volumes to your Nova VMs. You may even do it manually using the $nova volume-detach and the $nova volume-attach commands.

Your snapshot policies will also need to be added again on the secondary backend since they are not preserved during the failover process.

Even though this is planned to be changed in the Ocata release of OpenStack, the default RPO as of today is 60 minutes, i.e. SnapMirror updates only take place after an hour has passed since the last successful replication. So please keep that in mind as you’re putting together your Disaster Recovery strategy.

Resources

1 ESG Research Review Data Protection Survey

Sumit Kumar on Twitter
Sumit Kumar
Sumit joined NetApp as a Technical Marketing Engineer for OpenStack in May 2015. He has been an active participant in various Openstack meetups, and has presented sessions on various topics in multiple OpenStack forums. He is very excited about the future and potential of OpenStack, Docker containers, and various infrastructure and DevOps tools!

Leave a Reply