Multi-Temperature Data Management Using S3 and Glacier

An introduction to the Storage Classes in S3

Purpose

This blog taps into Clairvoyant’s vast experience of working with Amazon S3 to explore how storage classes in Amazon S3 can save your cost by more than 90%.

One of our customers in the Retail industry was looking to allow consumers to upload unlimited numbers of images. Previously, we had used Amazon S3 with S3 Standard as a storage class to store all the user images. However, according to the client’s business, this is a perfect case of multi-temperature data management, as users won’t require most images. So, we decided to put images that may be needed in the Glacier storage class. Read this blog if you want to know more about monitoring measures of S3 storage security.

Identifying images that would be required was a massive task in itself. We applied Data Science to solve this challenge, as it involved predicting consumer behavior. While this application would make for yet another detailed blog, we will discuss S3 and Glacier in this blog.

If the user demands those images back, we retrieve them using an Amazon batch request.

Multi-temperature data management

Multi-temperature data management used for frequently accessed data (hot and warm data) is stored on fast storage devices. In contrast, the rarely accessed data (cold data) is stored on slower storage devices. Archived data is considered dormant and stored on cold storage devices.

What is Amazon S3?

Amazon Simple Storage Service (Amazon S3) is a storage for the Internet. Regardless of the time, amount, and web location, any data can be stored using the Amazon S3 web services interface.

Head to this blog to learn about the best security practices of AWS S3 data.

Storage classes in Amazon S3

Amazon S3 offers a range of storage classes for the objects you store. Depending on your use case scenario and performance access requirements, you choose a storage class.

S3 Standard Vs. S3 Glacier storage class

S3 Standard is the default storage class best suited for performance-sensitive use cases requiring millisecond access time. The S3 Glacier and S3 Glacier deep archive storage classes are designed for the archival of low-cost data, with the same durability and resiliency as the S3 storage. These are two extreme use cases, and in between, there are n numbers of storage classes. For more details, follow the link.

Amazon S3 Object

Amazon S3 is an object store that uses unique key values. The key is the name that you assign to the object, and the value is the content you store.

Lifecycle Management

S3 lifecycle management dictates a set of rules. This set of rules defines actions applied to a group of objects in Amazon S3. We could leverage these rules to manage storage cost-effectively.

There are two types of action- Transition action and Expiration action. The former defines when the S3 object transitions to another storage class, while the latter defines when the S3 object expires.

The below code shows a way to add a lifecycle rule to the s3 bucket to move an object to the glacier after two years using the AWS Java SDK.

Restoring Glacier archives

S3 batch operation can retrieve objects from the Glacier. However, that object remains archived in the Glacier. This method restores a temporary copy of archived objects for a specified number of days, after which Amazon S3 deletes the temporary copy.

Initiate Restore Object

This can be issued using the AWS Java SDK, where you can specify:

The Glacier Job Tier — Expedited, Standard, Bulk — determines the time taken for retrieval, ranging from 5 minutes to 12 hours.
The location of a Manifest file
The expiration for restored objects

CreateRestoreJob.java

Few valuable points related to restoring requests:

This request returns a job Id, which can be used to track the job status.
This job is only responsible for triggering the actual restoration process.
The only way to track the restoration status of the individual objects is through SQS notifications. We will discuss them in detail below.
It is impossible to cancel the glacier restore of an object once it is in progress.

Manifest file contents

A Manifest file is a CSV file stored in S3 that specifies the files that need to be restored from Glacier. The content of a manifest.csv file used to restore a Glacier file, “images/IMG_0001.jpg” from the “restore-bucket” bucket, would be:

You can specify additional files on subsequent lines of this file. According to the documentation, manifest files can contain billions of objects.

JobStatus response

We can use the AWS SDK to query the status of the job using the job id. An example of some of the content of a Job Status response is as below:

Retrieval Status of Individual Objects

It is possible to configure S3 to send notifications for different types of events, viz to SQS, SNS, or Lambdas. The below two events provide useful information to track objects being restored from Glacier.

s3:ObjectRestore: Post — This event is published when the restoration starts for a single glacier object. It will be published for all the objects specified in the manifest file. It contains the following details:

event time
bucket name for the Glacier object
key for the Glacier object

s3:ObjectRestore: Completed — This event is published when the restoration is complete for a single Glacier object. It will be published for all the objects specified in the manifest file. It contains the following details:

event time
bucket name for the Glacier object
key for the Glacier object
lifecycleRestorationExpiryTime — this specifies when the restored object will no longer be available. The object must be downloaded before this expiry time.

Example of an s3:ObjectRestore:Post-event:

Example of an s3:ObjectRestore: Completed event:

Time Taken to Restore Objects: Findings

S3 specifies a time duration of 5–12 hours for a restoration operation. When we tested by running a restore job for a single object, it took around 5 hours of restoration time as per expectation.

We started the restoration job with 160 images to determine whether having more objects would extend the time taken to complete the job. We recorded the start time and restoration time for every object (the end time was available in the S3 restore completed event that was sent to SQS). We found that all the objects' restoration time was between 5–6 hours based on this run.

This run indicates that individual objects should take 5–12 hours to complete, regardless of the total number of objects specified in the restore job. However, this may not be true for a larger number of objects (more than a million).

We ran a restore job with 110,000 records. We initiated 11 parallel batches of restore jobs of 10,000 records. The findings are as below:

The observed average time required for each record for restoration- 5 hours
The observed maximum time required for each record for restoration- 8 hours
The observed minimum time required for each record for restoration- 5 hours

AWS Pricing

The price of S3 Standard storage is $0.021 per GB per month.
The price of S3 Glacier deep archive is $0.00099 per GB per month
The price of bulk retrieval is $0.0025 per GB.
The price of bulk retrieval requests is $0.025 per 1,000 requests.

Summary

In this article, we have focused on two storage classes available in S3. However, S3 provides a range of storage classes that can fine-tune multi-temperature data management. You can find details for the same here.

Also, we discussed file restore from the glacier with a bulk job tier through this blog. However, s3 glacier provides an expedited job tier that can retrieve data in 1–5 minutes, and the standard job tier which is the default option can retrieve data within 3–5 hours. Here are the details for the same.

For all your cloud based services requirements reach out to us at Clairvoyant.

Multi-Temperature Data Management Using S3 and Glacier

An introduction to the Storage Classes in S3

Purpose

Multi-temperature data management

What is Amazon S3?

Storage classes in Amazon S3

S3 Standard Vs. S3 Glacier storage class

Amazon S3 Object

Lifecycle Management

Restoring Glacier archives

Initiate Restore Object

Manifest file contents

JobStatus response

Retrieval Status of Individual Objects

Time Taken to Restore Objects: Findings

AWS Pricing

Summary

Author

Fill in your Details

Related Blogs

Monitoring Measures on S3 Storage Security

Will Cloud Technology Enable New Business Models to Emerge in 2022?

Building Data Lake on AWS — Data Insights

Partnerships

What We Offer

Know Us

Multi-Temperature Data Management Using S3 and Glacier

An introduction to the Storage Classes in S3

Purpose

Multi-temperature data management

What is Amazon S3?

Storage classes in Amazon S3

S3 Standard Vs. S3 Glacier storage class

Amazon S3 Object

Lifecycle Management

Restoring Glacier archives

Initiate Restore Object

Manifest file contents

JobStatus response

Retrieval Status of Individual Objects

Time Taken to Restore Objects: Findings

AWS Pricing

Summary

Author

Fill in your Details

Related Blogs

Monitoring Measures on S3 Storage Security

Will Cloud Technology Enable New Business Models to Emerge in 2022?

Building Data Lake on AWS — Data Insights

For More Blogs