tl;dr Performing metadata search on billions of objects is now possible with StorageGRID Webscale by streaming object metadata into Elasticsearch.
Introduction
One advantage of using object over block and file storage is that data can be enriched with metadata and tags. These are little key-value pairs that are attached via the S3 API. Consequently, you want to index metadata for enabling metadata search as your data volume grows.
With StorageGRID Webscale 11.0, you have an open and scalable solution for metadata search. By streaming metadata into Elasticsearch, you can search through the tags of billions of objects. Elasticsearch is one of the world's most popular open source search engines. Furthermore, it has a proven track of introducing innovative features. As a result, this will help you moving faster instead of tying yourself into a proprietary solution.
In this post, you will learn how to configure and use metadata search in StorageGRID.
Object Metadata and Object Tags
StorageGRID supports Metadata and Object Tags with S3. While the concepts are similar, they differ in the following ways:
- Object Tags
- Up to 10 unique tags per object in form of key-value pairs
- Up to 128 characters per key, and 256 per value
- Can be changed independently of the object
- Can drive Policy Engine
- Object Metadata
- Up to 2KB of total information per object in form of key-value pairs
- Can only be changed by overwriting the object (through copy_from with same source and destination)
- Can drive Policy Engine
For example, you could attach an user id and creation device as metadata:
metadata = {'userid': '12345', 'device': 'phone'} obj = s3.Object('metadata-bucket', 'my_unique_object1.txt') obj.put(Body='Random Object Data', Metadata=metadata)
Now let's assume your application stores millions or billions of such data points in StorageGRID. Obviously, you now want to perform search queries like:
- Give me back the most recent objects that were created on a phone
- Give me back all objects from user with id 42
This bring us to the topic of metadata search.
Metadata Search
Metadata search is the challenge of retrieving objects matching certain criteria. Criteria could be object name, size, date, owner, metadata or object tags. However, the AWS S3 API does not offer such an operation. Yet, this is good thing as modern systems should be designed loosely coupled. Consequently, S3 should only take care of storing objects and another system for providing search. As a result, we decided to integrate with Elasticsearch for enabling metadata search.
In our approach, StorageGRID streams all object creations, updates or deletions into Elasticsearch. Therefore, our example would put the following JSON document in Elasticsearch:
{ "bucket": "metadata-bucket", "key": "my_unique_object1.txt", "versionId":"MTE3OTZCNjAtNjM5MC0xMUU2LTgwMDAtMDAwMDAwQkFBNEM2", "accountId":"869284...0126822", "size":42, "md5": "3d6c7634a8543...012855", "metadata":{ "userid":"12345", "device": "phone" }, "tags": { } }
Whenever you overwrite metadata or update the object tags, the Elasticsearch index is updated. Consequently, if Elasticsearch is unavailable, the update will happen once it comes back. With this you can perform arbitrary queries on your objects' metadata. So let's configure it!
Configuration of Metadata Search
First of all, create an Elasticsearch cluster using the Elasticsearch Service on AWS. Alternatively, deploy a cluster manually. In our example, we just use a t2.small instance for testing and allowed access from our own IP address:
Next, let's create an index named "objects" where we store the metadata of our objects:
$ curl -X PUT https://search-xxxxxxx.us-east-1.es.amazonaws.com/objects/ {"acknowledged":true,"shards_acknowledged":true}
Furthermore, you need to configure the Elasticsearch Endpoint in StorageGRID. Once logged into the tenant UI, click S3 --> Endpoints --> Create:
- Paste the Endpoint URL from your AWS Console as URI
- Paste the ARN from the AWS console as the URN and add /objects/metadata at the end
- Example: arn:aws:es:us-east-1:1234567890:domain/sgws11-example/objects/metadata
In this case, the endpoint will connect to your index objects and store metadata under the type metadata.
As a last step, configure your bucket to stream all changes to the Elasticsearch endpoint. Therefore, goto S3 --> Buckets --> Select your Bucket --> Configure Search Integration. You just need to enter the URN as the destination URN:
<MetadataNotificationConfiguration> <Rule> <ID>Rule-1</ID> <Status>Enabled</Status> <Prefix></Prefix> <Destination> <Urn>arn:aws:es:us-east-1:1234567890:domain/sgws11-example/objects/metadata</Urn> </Destination> </Rule> </MetadataNotificationConfiguration>
Optionally, you can configure a prefix for filtering or stream to multiple Elasticsearch clusters. Finally, we can start putting objects into our metadata search enabled bucket!
Usage Example for Metadata Search
To finish off our example, let's query for objects belonging to the user with id 12:
$ curl -XGET https://search-sgws11-xxxxxxxxxx.us-east-1.es.amazonaws.com/objects/_search -d \ '{ "query" : { "term" : { "metadata.userid": "12" } } }' Result: { "took":4, "timed_out":false, "_shards":{ "total":5, "successful":5, "failed":0 }, "hits":{ "total":39, "max_score":3.5784557, "hits":[ { "_index":"objects", "_type":"metadata", "_id":"metadata-bucket_object_251", "_score":3.5784557, "_source":{ "bucket":"metadata-bucket", "key":"object_251", "versionId":"", "accountId":"14224026995336954030", "size":18, "md5":"45dc95d53bb609387922c676448465f8", "metadata":{ "device":"phone", "userid":"12" }, "tags":null } }, { "_index":"objects", "_type":"metadata", "_id":"metadata-bucket_object_439", "_score":3.5784557, "_source":{ "bucket":"metadata-bucket", "key":"object_439", "versionId":"", "accountId":"14224026995336954030", "size":18, "md5":"45dc95d53bb609387922c676448465f8", "metadata":{ "device":"laptop", "userid":"12" }, "tags":null } }, ... ] } }
As a result, we get back a set of objects belonging to user 12. As a next step, you could explore the Elasticsearch DSL for building more complex queries.
Summary
StorageGRID 11.0 introduces an open and scalable solution for metadata search. You can now stream object metadata for indexing and searching into Elasticsearch. Consequently, you can leverage the innovation coming from Elasticsearch instead of locking yourself into a closed ecosystem. As a result, data scientists or developers can use this data to search through billions of object metadata tags. Last but not least, object creations, deletions, or metadata updates automatically update the Elasticsearch cluster.
If you want to learn more about StorageGRID Webscale, check out our other posts.