There and Back Again: An OpenStack Tale

By Chad Morgenstern and Jenny Yang

Disasters happen for multiple reasons ranging from site failures to regional failures to human errors. Remember the Hyperscaler site failures over the past few years?  If It happens, are you prepared for it? Being at a storage company, living and working in 3rd platform, we are particularly interested in how data is recovered in a cloudy world; what are the options, how do they work, and how well do they work?

Our current focus is on OpenStack, particularly Cinder and Manila.  We will start by focusing on Cinder, and move on from there.  By the time this study is over, we should have a pretty good understanding of our options. The plan is simple; focus on individual use cases and poke and prod at them until there is nothing left to learn, etc…

Before diving into Cinder, let’s take a minute to explore some disaster recovery concepts that everyone ought to know:

  • Recovery point objective: RPO (recovery point objective) refers to the amount of data at risk. It’s determined by the amount of time between data protection events and reflects the amount of data that potentially could be lost during a disaster recovery. The metric is an indication of the amount of data at risk of being lost (Vellante, 2008).
  • Recovery time objective: The RTO (recovery time objective) is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity (Singh, 2008).

Understanding these concepts sets up the working space within which all decisions must be made. By way of example, does the technology you are investigating have a fixed replication window? If so, is the replication window sufficiently small to meet the RPO defined by business needs? Or another example, if you agree to a specific recovery time objective, how are you going to develop a level of confidence that it can be met? Does the technology allow for failover testing? RTO cannot realistically be agreed upon without testing.

Relating to RPO and RTO, you must take into consideration what type of DR planning needs to be done in the first place.  Consider the following questions:

  • Is DR necessary for physical disasters? Does your application control its own high availability/does it have catastrophic disaster recovery built in at the application layer? For example, is your application built for 3rd platform like MongoDB or Cassandra, which if replicated across different locales addresses all but human error essentially on its own?
  • Is DR necessary for logical failures? Can the data be recovered from backup in the event of human error, and if so, is the recovery from backup sufficiently fast to meet RTO commitments? 3rd platform applications are not immune to this scenario; say, for example, a MongoDB DBA deletes the wrong set of documents (Mongo term for database rows), and the deletion propagates across replicas.
  • How about 2nd platform applications and DR? Some 2nd platform applications have DR capabilities. Take for example Oracle with Data Guard or Advanced Data Guard. Oracle can fend for itself with these capabilities, however the Data Guard technologies are not cheap, as the license model is (or at least was) per core.
  • Is DR necessary at all? Not all data is created equal. Some data can be thrown away and other data can be re-generated quickly enough to meet RPO and RTO objectives. If this is your scenario, DR may not be in your wheel house.

In short, when approaching DR, you need to start by thinking about three lines of questions:

  1. What are you trying to achieve (recover from physical or logical failure)?
  2. How much time do you have to achieve it (RPO and RTO)?
  3. Do you have the budget to achieve business resilience (DR) in the manner you are exploring?

Over the next few months we will spend time primarily on the capabilities afforded by Cinder and by Manila. Though interesting, the capabilities of 3rd platform application are beyond the scope of this series.  With that said, we may dive into applications from time to time as need or interest demands. We may also dive into other OpenStack data DR projects and potentially 3rd party products as well.

This is what we are currently considering; we will go deep but not too deep.  We will start the series off with Cinder failover and drive from that point:

  • Recovery by Cinder failover
  • Recovery from NetApp ONTAP Storage Virtual Machine (SVM) DR following the loss of an NFS backend while the root volumes remain accessible
  • Recovery from ONTAP SVM DR following the loss of an NFS backend
  • Recovery from ONTAP SVM DR following the loss of an iSCSI backend
  • Then onto Manila and DR…
  • Onto SolidFire…
  • A study of project Karbor
Chad Morgenstern
Chad is a 10 -year veteran at NetApp having held positions In both escalations, reference architecture teams, and most recently as part of the workload performance team where he contributed to significant understandings of AFF performance for things such as VDI environments, working on the SPC-1 benchmark, and more. In addition, Chad spent time building automated tools and frameworks for performance testing with a few of them based on opensource technologies such as cloud and containers.

Chad is happily married and is the father of four daughters, and soon to be owner of two ferrets.

1 thought on “There and Back Again: An OpenStack Tale

Leave a Reply