<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=2877026&amp;fmt=gif">

AWS Cloud Solution: DynamoDB Tables Backup in S3 (Parquet)

By Prakshal Jain - November 24, 2020

Quick & Easy approach to back up DynamoDB data to S3 in Amazon Web Services

AWS S3 can serve as the perfect low-cost solution for backing up DynamoDB tables and later querying via Athena. To query the data through Athena, we must register the S3 bucket/dataset with the Glue Data Catalog.

For the end-to-end process, S3, Glue, DynamoDB, and Athena will be utilized and will follow these steps:

  1. Crawl the DynamoDB table with Glue to register the metadata of our table with the Glue Data Catalog. Once that’s done, we can query the table with Athena.

  2. Create a Glue job for copying table contents into S3 in parquet format.

  3. Crawl the S3 bucket with Glue to register the bucket with the Glue Data Catalog and query it with Athena to verify the accuracy of the data copy.

Pre-Requisites

Before getting started, we must first create an IAM role to use throughout the process, which can read/write to the S3 bucket and scan DynamoDB.

  1. Navigate to IAM in AWS Management Console

  2. Click on Policies under Access Management on the left menu.

  3. Click “Create Policy”

  4. Click on the JSON tab.

  5. Copy the json code below into the editor. It will give access to the S3 bucket and DynamoDB table. ⚠️ Update the variables listed at the top of the code and delete those lines once the variables have been updated. ⚠️

  6. Save the policy as dynamodb-s3-parquet-policy

  7. Click on Roles under Access Management on the left menu.

  8. Click “Create Role”

  9. Select Glue from the list of services. Click “Next: Permissions”

  10. Add the following policies: AWSGlueServiceRole and dynamodb-s3-parquet-policy

  11. Click “Next:Tags” Add tags as necessary. Click “Next:Review”

  12. Provide a name for the role, such as glue-dynamodb-s3-role

  13. Click “Create Role”

Glue Crawler

We'll set up the Glue Crawler, which will crawl the DynamoDB table and extract the schema.

  1. Navigate to AWS Glue in AWS Management Console

  2. Click on Crawlers in the left menu under the Data Catalog heading.

  3. Click “Add Crawler”

  4. Enter a name for the crawler. Click Next

  5. For the Crawler source type, select “Data Stores.” Click Next

  6. Select DynamoDb from the Data Store dropdown

  7. Enter the name of the table

  8. Optional: ✅ Enable sampling if the table is too big. It’ll save on scanning costs since fewer rows will be scanned to get the schema of the table

  9. Click Next

  10. Select “No” for Add Another Data Store

  11. Select the IAM role created earlier (e.g. glue-dynamodb-s3-role)

  12. Select Run on Demand for frequency if it’s a one-time job. If not, select the appropriate time interval

  13. Click Next

  14. Select the Glue database to store the crawler results in. Click Next

  15. Review everything is accurate. Click Finish to finalize

  16. Once the crawler is created, start it by clicking the checkbox next to the crawler name and clicking Run Crawler

  17. Once the crawler has finished running, navigate to the Tables under Data Catalog to ensure the table is created and the metadata looks accurate such as the DynamoDB table schema and row count

Glue Job

Once the table has been crawled, it’s time to create the Glue job.

  1. Navigate to AWS Glue

  2. Click on Jobs on the left menu under the ETL heading

  3. Click “Add Job”

  4. Enter a name for the job

  5. Select an IAM role for the Glue job to use

  6. The role must have access to the DynamoDB table to read data and the S3 bucket to write data in

  7. Select “Spark” for the Type

  8. Select “Spark 2.4, Python 3” for the Glue version

  9. Under This job runs, select “A proposed script generated by AWS Glue”

  10. Enter a name for the script

  11. Click Next

  12. For Choose a data source, select the table that was created by the crawler

  13. Ensure that the data location column shows the ARN for the table, and it states DynamoDB under the classification column

  14. Under Transform type, click “Change Schema”

  15. Under Data Target, select “Create tables in your Data Target”

  16. For data store, select Amazon S3

  17. Select Parquet for the format

  18. Enter the path of the S3 bucket where you’d like to store the data

  19. Click Next

  20. Ensure the data mapping is accurate and click Save Job and edit script

  21. Review the script code and click Run Code on the top left

  22. Wait for the job to complete running. Monitor the logs and fix issues if they arise relating to access

S3 Check

Let’s verify that the data is written to S3

  1. Navigate to S3 in AWS Management Console

  2. Open the bucket where the DynamoDb data is stored in the Parquet table

  3. Ensure that the files are present there

Glue Data Catalog

  1. Navigate to AWS Glue

  2. Click on Crawlers on the left menu under the Data Catalog heading

  3. Click “Add Crawler”

  4. Enter a name for the crawler. Click Next

  5. For Crawler source type, select Data Stores. Click Next

  6. Select S3 from the Data Store dropdown

  7. Enter the path of the S3 bucket where the data is being stored

  8. Click Next

  9. Select No for Add Another Data Store

  10. Click Create an IAM Role

  11. Select the role created earlier (e.g. glue-dynamodb-s3-role)

  12. Select Run on Demand for frequency if it’s a one-time job. If not, select the appropriate time interval

  13. Click Next

  14. Select the Glue database to store the crawler results in. Click Next

  15. Review everything is accurate. Click Finish to finalize

  16. Once the crawler is created, start it by clicking the checkbox next to the crawler name and clicking Run Crawler

  17. Once the crawler is finished running, navigate to the Tables under Data Catalog to ensure the table is created and the metadata looks accurate such as the record count

Querying via Athena

  1. Next, navigate to AWS Athena

  2. Select AWSDataCatalog for the Data Source on the left menu

  3. Choose the database in which the crawler will store the data.

  4. Use the query editor to run queries against the table created by crawler

  5. Test functionality by running the following command to ensure the format was extracted properly:

    SELECT * FROM <db>.<table> LIMIT 10; to ensure data has been copied

    That’s it! DynamoDB table data has been backed up in S3 in Parquet table and can be queried by Athena as needed using SQL! 🍻 Learn about the best security practices of AWS S3 data here. For all your Cloud IT solutions, reach out to us at Clairvoyant.

Author
Prakshal Jain

Tags: Cloud Services

Fill in your Details