An introduction to the Storage Classes in S3
Purpose
This blog taps into Clairvoyant’s vast experience of working with Amazon S3 to explore how storage classes in Amazon S3 can save your cost by more than 90%.
One of our customers in the Retail industry was looking to allow consumers to upload unlimited numbers of images. Previously, we had used Amazon S3 with S3 Standard as a storage class to store all the user images. However, according to the client’s business, this is a perfect case of multi-temperature data management, as users won’t require most images. So, we decided to put images that may be needed in the Glacier storage class. Read this blog if you want to know more about monitoring measures of S3 storage security.
Identifying images that would be required was a massive task in itself. We applied Data Science to solve this challenge, as it involved predicting consumer behavior. While this application would make for yet another detailed blog, we will discuss S3 and Glacier in this blog.
If the user demands those images back, we retrieve them using an Amazon batch request.
Multi-temperature data management
Multi-temperature data management used for frequently accessed data (hot and warm data) is stored on fast storage devices. In contrast, the rarely accessed data (cold data) is stored on slower storage devices. Archived data is considered dormant and stored on cold storage devices.
What is Amazon S3?
Amazon Simple Storage Service (Amazon S3) is a storage for the Internet. Regardless of the time, amount, and web location, any data can be stored using the Amazon S3 web services interface.
Head to this blog to learn about the best security practices of AWS S3 data.
Storage classes in Amazon S3
Amazon S3 offers a range of storage classes for the objects you store. Depending on your use case scenario and performance access requirements, you choose a storage class.
S3 Standard Vs. S3 Glacier storage class
S3 Standard is the default storage class best suited for performance-sensitive use cases requiring millisecond access time. The S3 Glacier and S3 Glacier deep archive storage classes are designed for the archival of low-cost data, with the same durability and resiliency as the S3 storage. These are two extreme use cases, and in between, there are n numbers of storage classes. For more details, follow the link.
Amazon S3 Object
Amazon S3 is an object store that uses unique key values. The key is the name that you assign to the object, and the value is the content you store.
Lifecycle Management
S3 lifecycle management dictates a set of rules. This set of rules defines actions applied to a group of objects in Amazon S3. We could leverage these rules to manage storage cost-effectively.
There are two types of action- Transition action and Expiration action. The former defines when the S3 object transitions to another storage class, while the latter defines when the S3 object expires.
The below code shows a way to add a lifecycle rule to the s3 bucket to move an object to the glacier after two years using the AWS Java SDK.
Restoring Glacier archives
S3 batch operation can retrieve objects from the Glacier. However, that object remains archived in the Glacier. This method restores a temporary copy of archived objects for a specified number of days, after which Amazon S3 deletes the temporary copy.
Initiate Restore Object
This can be issued using the AWS Java SDK, where you can specify:
-
The Glacier Job Tier — Expedited, Standard, Bulk — determines the time taken for retrieval, ranging from 5 minutes to 12 hours.
-
The location of a Manifest file
-
The expiration for restored objects
CreateRestoreJob.java
Few valuable points related to restoring requests:
-
This request returns a job Id, which can be used to track the job status.
-
This job is only responsible for triggering the actual restoration process.
-
The only way to track the restoration status of the individual objects is through SQS notifications. We will discuss them in detail below.
-
It is impossible to cancel the glacier restore of an object once it is in progress.
Manifest file contents
A Manifest file is a CSV file stored in S3 that specifies the files that need to be restored from Glacier. The content of a manifest.csv file used to restore a Glacier file, “images/IMG_0001.jpg” from the “restore-bucket” bucket, would be:
You can specify additional files on subsequent lines of this file. According to the documentation, manifest files can contain billions of objects.
JobStatus response
We can use the AWS SDK to query the status of the job using the job id. An example of some of the content of a Job Status response is as below:
Retrieval Status of Individual Objects
It is possible to configure S3 to send notifications for different types of events, viz to SQS, SNS, or Lambdas. The below two events provide useful information to track objects being restored from Glacier.
s3:ObjectRestore: Post — This event is published when the restoration starts for a single glacier object. It will be published for all the objects specified in the manifest file. It contains the following details:
-
event time
-
bucket name for the Glacier object
-
key for the Glacier object
s3:ObjectRestore: Completed — This event is published when the restoration is complete for a single Glacier object. It will be published for all the objects specified in the manifest file. It contains the following details:
-
event time
-
bucket name for the Glacier object
-
key for the Glacier object
-
lifecycleRestorationExpiryTime — this specifies when the restored object will no longer be available. The object must be downloaded before this expiry time.
Example of an s3:ObjectRestore:Post-event:
Example of an s3:ObjectRestore: Completed event:
Time Taken to Restore Objects: Findings
S3 specifies a time duration of 5–12 hours for a restoration operation. When we tested by running a restore job for a single object, it took around 5 hours of restoration time as per expectation.
We started the restoration job with 160 images to determine whether having more objects would extend the time taken to complete the job. We recorded the start time and restoration time for every object (the end time was available in the S3 restore completed event that was sent to SQS). We found that all the objects' restoration time was between 5–6 hours based on this run.
This run indicates that individual objects should take 5–12 hours to complete, regardless of the total number of objects specified in the restore job. However, this may not be true for a larger number of objects (more than a million).
We ran a restore job with 110,000 records. We initiated 11 parallel batches of restore jobs of 10,000 records. The findings are as below:
-
The observed average time required for each record for restoration- 5 hours
-
The observed maximum time required for each record for restoration- 8 hours
-
The observed minimum time required for each record for restoration- 5 hours
AWS Pricing
-
The price of S3 Standard storage is $0.021 per GB per month.
-
The price of S3 Glacier deep archive is $0.00099 per GB per month
-
The price of bulk retrieval is $0.0025 per GB.
-
The price of bulk retrieval requests is $0.025 per 1,000 requests.
Summary
In this article, we have focused on two storage classes available in S3. However, S3 provides a range of storage classes that can fine-tune multi-temperature data management. You can find details for the same here.
Also, we discussed file restore from the glacier with a bulk job tier through this blog. However, s3 glacier provides an expedited job tier that can retrieve data in 1–5 minutes, and the standard job tier which is the default option can retrieve data within 3–5 hours. Here are the details for the same.
For all your cloud based services requirements reach out to us at Clairvoyant.