AWS Cloud Solution: DynamoDB Tables Backup in S3 (Parquet)

Quick & Easy approach to back up DynamoDB data to S3 in Amazon Web Services

AWS S3 can serve as the perfect low-cost solution for backing up DynamoDB tables and later querying via Athena. To query the data through Athena, we must register the S3 bucket/dataset with the Glue Data Catalog.

For the end-to-end process, S3, Glue, DynamoDB, and Athena will be utilized and will follow these steps:

Crawl the DynamoDB table with Glue to register the metadata of our table with the Glue Data Catalog. Once that’s done, we can query the table with Athena.
Create a Glue job for copying table contents into S3 in parquet format.
Crawl the S3 bucket with Glue to register the bucket with the Glue Data Catalog and query it with Athena to verify the accuracy of the data copy.

Pre-Requisites

Before getting started, we must first create an IAM role to use throughout the process, which can read/write to the S3 bucket and scan DynamoDB.

Navigate to IAM in AWS Management Console
Click on Policies under Access Management on the left menu.
Click “Create Policy”
Click on the JSON tab.
Copy the json code below into the editor. It will give access to the S3 bucket and DynamoDB table. ⚠️ Update the variables listed at the top of the code and delete those lines once the variables have been updated. ⚠️
Save the policy as dynamodb-s3-parquet-policy
Click on Roles under Access Management on the left menu.
Click “Create Role”
Select Glue from the list of services. Click “Next: Permissions”
Add the following policies: AWSGlueServiceRole and dynamodb-s3-parquet-policy
Click “Next:Tags” Add tags as necessary. Click “Next:Review”
Provide a name for the role, such as glue-dynamodb-s3-role
Click “Create Role”

Glue Crawler

We'll set up the Glue Crawler, which will crawl the DynamoDB table and extract the schema.

Navigate to AWS Glue in AWS Management Console
Click on Crawlers in the left menu under the Data Catalog heading.
Click “Add Crawler”
Enter a name for the crawler. Click Next
For the Crawler source type, select “Data Stores.” Click Next
Select DynamoDb from the Data Store dropdown
Enter the name of the table
Optional: ✅ Enable sampling if the table is too big. It’ll save on scanning costs since fewer rows will be scanned to get the schema of the table
Click Next
Select “No” for Add Another Data Store
Select the IAM role created earlier (e.g. glue-dynamodb-s3-role)
Select Run on Demand for frequency if it’s a one-time job. If not, select the appropriate time interval
Click Next
Select the Glue database to store the crawler results in. Click Next
Review everything is accurate. Click Finish to finalize
Once the crawler is created, start it by clicking the checkbox next to the crawler name and clicking Run Crawler
Once the crawler has finished running, navigate to the Tables under Data Catalog to ensure the table is created and the metadata looks accurate such as the DynamoDB table schema and row count

Glue Job

Once the table has been crawled, it’s time to create the Glue job.

Navigate to AWS Glue
Click on Jobs on the left menu under the ETL heading
Click “Add Job”
Enter a name for the job
Select an IAM role for the Glue job to use
The role must have access to the DynamoDB table to read data and the S3 bucket to write data in
Select “Spark” for the Type
Select “Spark 2.4, Python 3” for the Glue version
Under This job runs, select “A proposed script generated by AWS Glue”
Enter a name for the script
Click Next
For Choose a data source, select the table that was created by the crawler
Ensure that the data location column shows the ARN for the table, and it states DynamoDB under the classification column
Under Transform type, click “Change Schema”
Under Data Target, select “Create tables in your Data Target”
For data store, select Amazon S3
Select Parquet for the format
Enter the path of the S3 bucket where you’d like to store the data
Click Next
Ensure the data mapping is accurate and click Save Job and edit script
Review the script code and click Run Code on the top left
Wait for the job to complete running. Monitor the logs and fix issues if they arise relating to access

S3 Check

Let’s verify that the data is written to S3

Navigate to S3 in AWS Management Console
Open the bucket where the DynamoDb data is stored in the Parquet table
Ensure that the files are present there

Glue Data Catalog

Navigate to AWS Glue
Click on Crawlers on the left menu under the Data Catalog heading
Click “Add Crawler”
Enter a name for the crawler. Click Next
For Crawler source type, select Data Stores. Click Next
Select S3 from the Data Store dropdown
Enter the path of the S3 bucket where the data is being stored
Click Next
Select No for Add Another Data Store
Click Create an IAM Role
Select the role created earlier (e.g. glue-dynamodb-s3-role)
Select Run on Demand for frequency if it’s a one-time job. If not, select the appropriate time interval
Click Next
Select the Glue database to store the crawler results in. Click Next
Review everything is accurate. Click Finish to finalize
Once the crawler is created, start it by clicking the checkbox next to the crawler name and clicking Run Crawler
Once the crawler is finished running, navigate to the Tables under Data Catalog to ensure the table is created and the metadata looks accurate such as the record count

Querying via Athena

Next, navigate to AWS Athena
Select AWSDataCatalog for the Data Source on the left menu
Choose the database in which the crawler will store the data.
Use the query editor to run queries against the table created by crawler
Test functionality by running the following command to ensure the format was extracted properly:
```
SELECT * FROM <db>.<table> LIMIT 10; to ensure data has been copied
```
That’s it! DynamoDB table data has been backed up in S3 in Parquet table and can be queried by Athena as needed using SQL! 🍻 Learn about the best security practices of AWS S3 data here. For all your Cloud IT solutions, reach out to us at Clairvoyant.

AWS Cloud Solution: DynamoDB Tables Backup in S3 (Parquet)

Quick & Easy approach to back up DynamoDB data to S3 in Amazon Web Services

Pre-Requisites

Glue Crawler

Glue Job

S3 Check

Glue Data Catalog

Querying via Athena

Author

Fill in your Details

Related Blogs

Monitoring Measures on S3 Storage Security

Will Cloud Technology Enable New Business Models to Emerge in 2022?

Building Data Lake on AWS — Data Insights

Partnerships

What We Offer

Know Us

AWS Cloud Solution: DynamoDB Tables Backup in S3 (Parquet)

Quick & Easy approach to back up DynamoDB data to S3 in Amazon Web Services

Pre-Requisites

Glue Crawler

Glue Job

S3 Check

Glue Data Catalog

Querying via Athena

Author

Fill in your Details

Related Blogs

Monitoring Measures on S3 Storage Security

Will Cloud Technology Enable New Business Models to Emerge in 2022?

Building Data Lake on AWS — Data Insights

For More Blogs