StorageGRID Webscale 11 now supports native integration into the AWS cloud. This includes cross-region replication to AWS S3 using CloudMirror, triggering of notifications to Simple Notification Service (via SNS), as well as metadata streaming into Elasticsearch for search capability.
With Amazon’s Simple Storage Service (S3) being “the internet’s hard drive”, many younger companies already rely on S3 to store massive data sets. In contrast, long-standing companies have large data sets stored in traditional on-premises infrastructure. These customers are looking at ways to drive new value from their existing data and to operate with greater efficiency. Moving their data to the cloud seems to be the obvious answer. This usually means thinking about how to bridge from on-premises to the public cloud to build powerful data processing pipelines.
Unfortunately, this usually means making a difficult decision:
- Leverage on-premises solutions and be more cost-effective at scale, but potentially miss out on new analytics and apps offered by public cloud or face the complexity and investment of implementing the same apps on-premises, or
- Rely on public clouds and the compelling feature sets in terms of analytics and machine learning, but potentially receive huge bills in moving data across on-premises and public cloud as well as storing large data sets in the public cloud
For taking advantage of the upsides of both options, we introduced native integration with Amazon Web Services (AWS) in StorageGRID Webscale 11.0. StorageGRID is now able to replicate data into AWS S3, trigger the AWS Simple Notification Service (SNS), as well as stream metadata into ElasticSearch for indexing. This allows for easy consumption of AWS data for building powerful data processing pipelines. Customers can choose which data to replicate to the cloud with precision or even trigger AWS functions to run against data stored on premises. This creates cost advantages by giving customers the flexibility to store colder, PB scale data sets on premises to control costs, with the flexibility to take full advantage of the ever-increasing features of a public cloud.
In the remainder of this post, we’ll show how to configure CloudMirror replication into StorageGRID and explain how it works.
CloudMirror replication allows replicating objects from one source S3 bucket to a destination S3 bucket. The replication runs asynchronously in near-realtime and is independent of the geographical location of the buckets. Furthermore, an arbitrary combination of versioned and non-versioned buckets can be used. As an example, the source could be non-versioned, but the destination versioned.
With StorageGRID 11.0, you can replicate on-premises data to one or more AWS regions. You can use this to create an additional copy of your data for superior protection. More importantly, this enables using Amazon’s analytics or machine learning services to derive new insights from your data. Let’s look into this with some more details!
Use Case Examples
- Extract new insights from existing data – By replicating objects from an on-premises StorageGRID to AWS S3, we can leverage the rich analytics services from AWS. For example, we can trigger EMR for running Map-Reduce jobs on large datasets that have been replicated. Alternatively, we can use AWS Lambda for feeding replicated data into data processing pipelines using Kinesis and other services. With the rise of machine learning, we can leverage AWS Rekognition or directly feed data into machine learning algorithms for training new classifiers or classifying data. The data flow couldn’t be any simpler:
- Write data to StorageGRID
- Replicate to AWS S3
- Trigger EMR, Rekognition, Lambda or feed data into Kineses, machine learning, etc.
- Secondary disaster copy off-site – Replicating objects from an on-premises StorageGRID to AWS S3 allows us to potentially rely on a more cost-efficient data protection scheme. This is because we now have an additional disaster copy in Amazon. Furthermore, we could also replicate data to endpoints other than AWS, such as a secondary, stand-alone StorageGRID. The secondary could be another S3 endpoint operated by a service provider or a StorageGRID instance run by a different entity within the same organization.
Obviously, there are more use cases possible, only limited by what you can do with the AWS and your imagination! Next, let’s look at how to configure it.
Configuring CloudMirror Replication
First, let’s create a bucket in AWS S3 that we’ll use as the replication destination:
$ aws s3 mb s3://sgws11-rocks --profile aws-s3 make_bucket: sgws11-rocks
As a next step, we need to login to the StorageGRID tenant UI and configure our new S3 bucket as a replication endpoint. To do this, navigate to S3 –> Endpoints –> Create and fill out the fields with the URI of the AWS S3 region endpoint address, the URN in the format
arn:aws:s3:::<bucket_name> and your access and secret access keys. The AWS S3 region endpoint URLs can be found in the AWS documentation.
Let’s create a bucket in StorageGRID on-premises that we’ll use as a replication source:
$ aws s3 mb s3://sgws11-rocks-on-prem/ --endpoint-url https://s3.mycompany.com:8082 --profile sgws11 make_bucket: sgws11-rocks-on-prem
Lastly, we configure CloudMirror replication from this bucket to AWS. Navigate to S3 –> Buckets and select the desired bucket then hit Configure Replication. The format is fairly self-explanatory and aligned with the AWS definition. Note that we use the URN to specify the replication destination:
<ReplicationConfiguration> <Rule> <Status>Enabled</Status> <Prefix></Prefix> <Destination> <Bucket>arn:aws:s3:::sgws11-rocks</Bucket> </Destination> </Rule> </ReplicationConfiguration>
Finally, let’s put an object in StorageGRID:
$ aws s3 cp 2017-11-13NYOP.pdf s3://sgws11-rocks-on-prem/ --endpoint-url https://s3.mycompany.com:8082 --profile sgws11 upload: 2017-11-13NYOP.pdf to s3://sgws11-rocks-on-prem/2017-11-13NYOP.pdf
And, surprise, here is our new object in the in our AWS S3 bucket:
$ aws s3 ls s3://sgws11-rocks --profile aws-s3 2017-11-17 10:54:12 152431 2017-11-13NYOP.pdf
One thing which is noteworthy here is that replication took around one second in this example, but larger objects might take longer to replicate.
StorageGRID WebScale 11.0 introduced tight integration with AWS. StorageGRID is now able to replicate data into AWS S3, trigger AWS SNS, and stream object metadata into Elasticsearch. Therefore, you can now easily build hybrid-cloud workflows. As a result, you can gain new insights from your data that sits in your on-premises data center. In this post we’ve showed how to configure CloudMirror replication between StorageGRID and AWS S3, a solid foundation for building powerful data pipelines!