Was :
$90
Today :
$50
Was :
$108
Today :
$60
Was :
$126
Today :
$70
Why Should You Prepare For Your AWS Certified Data Engineer - Associate (DEA-C01) With MyCertsHub?
At MyCertsHub, we go beyond standard study material. Our platform provides authentic Amazon Data-Engineer-Associate Exam Dumps, detailed exam guides, and reliable practice exams that mirror the actual AWS Certified Data Engineer - Associate (DEA-C01) test. Whether you’re targeting Amazon certifications or expanding your professional portfolio, MyCertsHub gives you the tools to succeed on your first attempt.
Verified Data-Engineer-Associate Exam Dumps
Every set of exam dumps is carefully reviewed by certified experts to ensure accuracy. For the Data-Engineer-Associate AWS Certified Data Engineer - Associate (DEA-C01) , you’ll receive updated practice questions designed to reflect real-world exam conditions. This approach saves time, builds confidence, and focuses your preparation on the most important exam areas.
Realistic Test Prep For The Data-Engineer-Associate
You can instantly access downloadable PDFs of Data-Engineer-Associate practice exams with MyCertsHub. These include authentic practice questions paired with explanations, making our exam guide a complete preparation tool. By testing yourself before exam day, you’ll walk into the Amazon Exam with confidence.
Smart Learning With Exam Guides
Our structured Data-Engineer-Associate exam guide focuses on the AWS Certified Data Engineer - Associate (DEA-C01)'s core topics and question patterns. You will be able to concentrate on what really matters for passing the test rather than wasting time on irrelevant content. Pass the Data-Engineer-Associate Exam – Guaranteed
We Offer A 100% Money-Back Guarantee On Our Products.
After using MyCertsHub's exam dumps to prepare for the AWS Certified Data Engineer - Associate (DEA-C01) exam, we will issue a full refund. That’s how confident we are in the effectiveness of our study resources.
Try Before You Buy – Free Demo
Still undecided? See for yourself how MyCertsHub has helped thousands of candidates achieve success by downloading a free demo of the Data-Engineer-Associate exam dumps.
MyCertsHub – Your Trusted Partner For Amazon Exams
Whether you’re preparing for AWS Certified Data Engineer - Associate (DEA-C01) or any other professional credential, MyCertsHub provides everything you need: exam dumps, practice exams, practice questions, and exam guides. Passing your Data-Engineer-Associate exam has never been easier thanks to our tried-and-true resources.
A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job willprocess daily incoming .csv files that users upload to an Amazon S3 bucket. The size ofeach S3 object is less than 100 MB.Which solution will meet these requirements MOST cost-effectively?
A. Write a custom Python application. Host the application on an Amazon ElasticKubernetes Service (Amazon EKS) cluster. B. Write a PySpark ETL script. Host the script on an Amazon EMR cluster. C. Write an AWS Glue PySpark job. Use Apache Spark to transform the data. D. Write an AWS Glue Python shell job. Use pandas to transform the data.
Answer: D
Explanation: AWS Glue is a fully managed serverless ETL service that can handle various
data sources and formats, including .csv files in Amazon S3. AWS Glue provides two types
of jobs: PySpark and Python shell. PySpark jobs use Apache Spark to process large-scale
data in parallel, while Python shell jobs use Python scripts to process small-scale data in a single execution environment. For this requirement, a Python shell job is more suitable and
cost-effective, as the size of each S3 object is less than 100 MB, which does not require
distributed processing. A Python shell job can use pandas, a popular Python library fordata
analysis, to transform the .csv data as needed. The other solutions are not optimal or
relevant for this requirement. Writing a custom Python application and hosting it on an
Amazon EKS cluster would require more effort and resources to set up and manage the
Kubernetes environment, as well as to handle the data ingestion and transformation logic.
Writing a PySpark ETL script and hosting it on an Amazon EMR cluster would also incur
more costs and complexity to provision and configure the EMR cluster, as well as to use
Apache Spark for processing small data files. Writing an AWS Glue PySpark job would also
be less efficient and economical than a Python shell job, as it would involve unnecessary
overhead and charges for using Apache Spark for small data files. References:
AWS Glue
Working with Python Shell Jobs
pandas
[AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide]
Question # 2
A financial company wants to implement a data mesh. The data mesh must supportcentralized data governance, data analysis, and data access control. The company hasdecided to use AWS Glue for data catalogs and extract, transform, and load (ETL)operations.Which combination of AWS services will implement a data mesh? (Choose two.)
A. Use Amazon Aurora for data storage. Use an Amazon Redshift provisioned cluster fordata analysis. B. Use Amazon S3 for data storage. Use Amazon Athena for data analysis. C. Use AWS Glue DataBrewfor centralized data governance and access control. D. Use Amazon RDS for data storage. Use Amazon EMR for data analysis. E. Use AWS Lake Formation for centralized data governance and access control.
Answer: B,E
Explanation: A data mesh is an architectural framework that organizes data into domains
and treats data as products that are owned and offered for consumption by different
teams1. A data mesh requires a centralized layer for data governance and access control,
as well as a distributed layer for data storage and analysis. AWS Glue can provide data
catalogs and ETL operations for the data mesh, but it cannot provide data governance and
access control by itself2. Therefore, the company needs to use another AWS service for
this purpose. AWS Lake Formation is a service that allows you to create, secure, and manage data lakes on AWS3. It integrates with AWS Glue and other AWS services to
provide centralized data governance and access control for the data mesh. Therefore,
option E is correct.
For data storage and analysis, the company can choose from different AWS services
depending on their needs and preferences. However, one of the benefits of a data mesh is
that it enables data to be stored and processed in a decoupled and scalable way1.
Therefore, using serverless or managed services that can handle large volumes and
varieties of data is preferable. Amazon S3 is a highly scalable, durable, and secure object
storage service that can store any type of data. Amazon Athena is a serverless interactive
query service that can analyze data in Amazon S3 using standard SQL. Therefore, option
B is a good choice for data storage and analysis in a data mesh. Option A, C, and D are
not optimal because they either use relational databases that are not suitable for storing
diverse and unstructured data, or they require more management and provisioning than
serverless services. References:
1: What is a Data Mesh? - Data Mesh Architecture Explained - AWS
2: AWS Glue - Developer Guide
3: AWS Lake Formation - Features
[4]: Design a data mesh architecture using AWS Lake Formation and AWS Glue
[5]: Amazon S3 - Features
[6]: Amazon Athena - Features
Question # 3
A company has a frontend ReactJS website that uses Amazon API Gateway to invokeREST APIs. The APIs perform the functionality of the website. A data engineer needs towrite a Python script that can be occasionally invoked through API Gateway. The codemust return results to API Gateway.Which solution will meet these requirements with the LEAST operational overhead?
A. Deploy a custom Python script on an Amazon Elastic Container Service (Amazon ECS)cluster. B. Create an AWS Lambda Python function with provisioned concurrency. C. Deploy a custom Python script that can integrate with API Gateway on Amazon ElasticKubernetes Service (Amazon EKS). D. Create an AWS Lambda function. Ensure that the function is warm byscheduling anAmazon EventBridge rule to invoke the Lambda function every 5 minutes by usingmockevents.
Answer: B
Explanation: AWS Lambda is a serverless compute service that lets you run code without
provisioning or managing servers. You can use Lambda to create functions that perform
custom logic and integrate with other AWS services, such as API Gateway. Lambda
automatically scales your application by running code in response to each trigger. You pay
only for the compute time you consume1.
Amazon ECS is a fully managed container orchestration service that allows you to run and
scale containerized applications on AWS. You can use ECS to deploy, manage, and scale
Docker containers using either Amazon EC2 instances or AWS Fargate, a serverless
compute engine for containers2.
Amazon EKS is a fully managed Kubernetes service that allows you to run Kubernetes
Question # 4
A company uses Amazon Redshift for its data warehouse. The company must automaterefresh schedules for Amazon Redshift materialized views.Which solution will meet this requirement with the LEAST effort?
A. Use Apache Airflow to refresh the materialized views. B. Use an AWS Lambda user-defined function (UDF) within Amazon Redshift to refresh thematerialized views. C. Use the query editor v2 in Amazon Redshift to refresh the materialized views. D. Use an AWS Glue workflow to refresh the materialized views.
Answer: C
Explanation: The query editor v2 in Amazon Redshift is a web-based tool that allows
users to run SQL queries and scripts on Amazon Redshift clusters. The query editor v2
supports creating and managing materialized views, which are precomputed results of a
query that can improve the performance of subsequent queries. The query editor v2 also
supports scheduling queries to run at specified intervals, which can be used to refresh
materialized views automatically. This solution requires the least effort, as it does not
involve any additional services, coding, or configuration. The other solutions are more
complex and require more operational overhead. Apache Airflow is an open-source
platform for orchestrating workflows, which can be used to refresh materialized views, but it
requires setting up and managing an Airflow environment, creating DAGs (directed acyclic
graphs) to define the workflows, and integrating with Amazon Redshift. AWS Lambda is a
serverless compute service that can run code in response to events, which can be used to refresh materialized views, but it requires creating and deploying Lambda functions,
defining UDFs within Amazon Redshift, and triggering the functions using events or
schedules. AWS Glue is a fully managed ETL service that can run jobs to transform and
load data, which can be used to refresh materialized views, but it requires creating and
configuring Glue jobs, defining Glue workflows to orchestrate the jobs, and scheduling the
workflows using triggers. References:
Query editor V2
Working with materialized views
Scheduling queries
[AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide]
Question # 5
A financial services company stores financial data in Amazon Redshift. A data engineerwants to run real-time queries on the financial data to support a web-based tradingapplication. The data engineer wants to run the queries from within the trading application.Which solution will meet these requirements with the LEAST operational overhead?
A. Establish WebSocket connections to Amazon Redshift. B. Use the Amazon Redshift Data API. C. Set up Java Database Connectivity (JDBC) connections to Amazon Redshift. D. Store frequently accessed data in Amazon S3. Use Amazon S3 Select to run thequeries.
Answer: B
Explanation: The Amazon Redshift Data API is a built-in feature that allows you to run
SQL queries on Amazon Redshift data with web services-based applications, such as AWS
Lambda, Amazon SageMaker notebooks, and AWS Cloud9. The Data API does not require
a persistent connection to your database, and it provides a secure HTTP endpoint and
integration with AWS SDKs. You can use the endpoint to run SQL statements without
managing connections. The Data API also supports both Amazon Redshift provisioned
clusters and Redshift Serverless workgroups. The Data API is the best solution for running
real-time queries on the financial data from within the trading application, as it has the least
operational overhead compared to the other options.
Option A is not the best solution, as establishing WebSocket connections to Amazon
Redshift would require more configuration and maintenance than using the Data API.
WebSocket connections are also not supported by Amazon Redshift clusters or serverless workgroups.
Option C is not the best solution, as setting up JDBC connections to Amazon Redshift
would also require more configuration and maintenance than using the Data API. JDBC
connections are also not supported by Redshift Serverless workgroups.
Option D is not the best solution, as storing frequently accessed data in Amazon S3 and
using Amazon S3 Select to run the queries would introduce additional latency and
complexity than using the Data API. Amazon S3 Select is also not optimized for real-time
queries, as it scans the entire object before returning the results. References:
Using the Amazon Redshift Data API
Calling the Data API
Amazon Redshift Data API Reference
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 6
A data engineer must orchestrate a series of Amazon Athena queries that will run everyday. Each query can run for more than 15 minutes.Which combination of steps will meet these requirements MOST cost-effectively? (Choosetwo.)
A. Use an AWS Lambda function and the Athena Boto3 client start_query_execution APIcall to invoke the Athena queries programmatically. B. Create an AWS Step Functions workflow and add two states. Add the first state beforethe Lambda function. Configure the second state as a Wait state to periodically checkwhether the Athena query has finished using the Athena Boto3 get_query_execution APIcall. Configure the workflow to invoke the next query when the current query has finishedrunning. C. Use an AWS Glue Python shell job and the Athena Boto3 client start_query_executionAPI call to invoke the Athena queries programmatically. D. Use an AWS Glue Python shell script to run a sleep timer that checks every 5 minutes todetermine whether the current Athena query has finished running successfully. Configurethe Python shell script to invoke the next query when the current query has finishedrunning. E. Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestratethe Athena queries in AWS Batch.
Answer: A,B
Explanation: Option A and B are the correct answers because they meet the requirements
most cost-effectively. Using an AWS Lambda function and the Athena Boto3 client
start_query_execution API call to invoke the Athena queries programmatically is a simple
and scalable way to orchestrate the queries. Creating an AWS Step Functions workflow
and adding two states to check the query status and invoke the next query is a reliable and
efficient way to handle the long-running queries.
Option C is incorrect because using an AWS Glue Python shell job to invoke the Athena
queries programmatically is more expensive than using a Lambda function, as it requires
provisioning and running a Glue job for each query.
Option D is incorrect because using an AWS Glue Python shell script to run a sleep timer
that checks every 5 minutes to determine whether the current Athena query has finished
running successfully is not a cost-effective or reliable way to orchestrate the queries, as it
wastes resources and time.
Option E is incorrect because using Amazon Managed Workflows for Apache Airflow
(Amazon MWAA) to orchestrate the Athena queries in AWS Batch is an overkill solution
that introduces unnecessary complexity and cost, as it requires setting up and managing an Airflow environment and an AWS Batch compute environment.
References:
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,
Started, Tutorial: Create a Hello World Workflow, Pages 1-8
Question # 7
A data engineer must use AWS services to ingest a dataset into an Amazon S3 data lake.The data engineer profiles the dataset and discovers that the dataset contains personallyidentifiable information (PII). The data engineer must implement a solution to profile thedataset and obfuscate the PII. Which solution will meet this requirement with the LEAST operational effort?
A. Use an Amazon Kinesis Data Firehose delivery stream to process the dataset. Createan AWS Lambda transform function to identify the PII. Use an AWS SDK to obfuscate thePII. Set the S3 data lake as the target for the delivery stream. B. Use the Detect PII transform in AWS Glue Studio to identify the PII. Obfuscate the PII.Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the datainto the S3 data lake. C. Use the Detect PII transform in AWS Glue Studio to identify the PII. Create a rule inAWS Glue Data Quality to obfuscate the PII. Use an AWS Step Functions state machine toorchestrate a data pipeline to ingest the data into the S3 data lake. D. Ingest the dataset into Amazon DynamoDB. Create an AWS Lambda function to identifyand obfuscate the PII in the DynamoDB table and to transform the data. Use the sameLambda function to ingest the data into the S3 data lake.
Answer: C
Explanation: AWS Glue is a fully managed service that provides a serverless data
integration platform for data preparation, data cataloging, and data loading. AWS Glue
Studio is a graphical interface that allows you to easily author, run, and monitor AWS Glue
ETL jobs. AWS Glue Data Quality is a feature that enables you to validate, cleanse, and
enrich your data using predefined or custom rules. AWS Step Functions is a service that
allows you to coordinate multiple AWS services into serverless workflows.
Using the Detect PII transform in AWS Glue Studio, you can automatically identify and
label the PII in your dataset, such as names, addresses, phone numbers, email addresses,
etc. You can then create a rule in AWS Glue Data Quality to obfuscate the PII, such as
masking, hashing, or replacing the values with dummy data. You can also use other rules
to validate and cleanse your data, such as checking for null values, duplicates, outliers, etc.
You can then use an AWS Step Functions state machine to orchestrate a data pipeline to
ingest the data into the S3 data lake. You can use AWS Glue DataBrew to visually explore
and transform the data, AWS Glue crawlers to discover and catalog the data, and AWS
Glue jobs to load the data into the S3 data lake.
This solution will meet the requirement with the least operational effort, as it leverages the
serverless and managed capabilities of AWS Glue, AWS Glue Studio, AWS Glue Data
Quality, and AWS Step Functions. You do not need to write any code to identify or
obfuscate the PII, as you can use the built-in transforms and rules in AWS Glue Studio and
AWS Glue Data Quality. You also do not need to provision or manage any servers or
clusters, as AWS Glue and AWS Step Functions scale automatically based on the demand.
The other options are not as efficient as using the Detect PII transform in AWS Glue
Studio, creating a rule in AWS Glue Data Quality, and using an AWS Step Functions state
machine. Using an Amazon Kinesis Data Firehose delivery stream to process the dataset,
creating an AWS Lambda transform function to identify the PII, using an AWS SDK to
obfuscate the PII, and setting the S3 data lake as the target for the delivery stream will require more operational effort, as you will need to write and maintain code to identifyand
obfuscate the PII, as well as manage the Lambda function and its resources. Using the
Detect PII transform in AWS Glue Studio to identify the PII, obfuscating the PII, and using
an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into
the S3 data lake will not be as effective as creating a rule in AWS Glue Data Quality to
obfuscate the PII, as you will need to manually obfuscate the PII after identifying it, which
can be error-prone and time-consuming. Ingesting the dataset into Amazon DynamoDB,
creating an AWS Lambda function to identify and obfuscate the PII in the DynamoDB table
and to transform the data, and using the same Lambda function to ingest the data into the
S3 data lake will require more operational effort, as you will need to write and maintain
code to identify and obfuscate the PII, as well as manage the Lambda function and its
resources. You will also incur additional costs and complexity by using DynamoDB as an
intermediate data store, which may not be necessary for your use case. References:
AWS Glue
AWS Glue Studio
AWS Glue Data Quality
[AWS Step Functions]
[AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide],
Chapter 6: Data Integration and Transformation, Section 6.1: AWS Glue
Question # 8
During a security review, a company identified a vulnerability in an AWS Glue job. Thecompany discovered that credentials to access an Amazon Redshift cluster were hardcoded in the job script.A data engineer must remediate the security vulnerability in the AWS Glue job. Thesolution must securely store the credentials.Which combination of steps should the data engineer take to meet these requirements?(Choose two.)
A. Store the credentials in the AWS Glue job parameters. B. Store the credentials in a configuration file that is in an Amazon S3 bucket. C. Access the credentials from a configuration file that is in an Amazon S3 bucket by usingthe AWS Glue job. D. Store the credentials in AWS Secrets Manager. E. Grant the AWS Glue job 1AM role access to the stored credentials.
Answer: D,E
Explanation: AWS Secrets Manager is a service that allows you to securely store and
manage secrets, such as database credentials, API keys, passwords, etc. You can use Secrets Manager to encrypt, rotate, and audit your secrets, as well as to control access to
them using fine-grained policies. AWS Glue is a fully managed service that provides a
serverless data integration platform for data preparation, data cataloging, and data loading.
AWS Glue jobs allow you to transform and load data from various sources into various
targets, using either a graphical interface (AWS Glue Studio) or a code-based interface
(AWS Glue console or AWS Glue API).
Storing the credentials in AWS Secrets Manager and granting the AWS Glue job 1AM role
access to the stored credentials will meet the requirements, as it will remediate the security
vulnerability in the AWS Glue job and securely store the credentials. By using AWS Secrets
Manager, you can avoid hard coding the credentials in the job script, which is a bad
practice that exposes the credentials to unauthorized access or leakage. Instead, you can
store the credentials as a secret in Secrets Manager and reference the secret name or
ARN in the job script. You can also use Secrets Manager to encrypt thecredentials using
AWS Key Management Service (AWS KMS), rotate the credentials automatically or on
demand, and monitor the access to the credentials using AWS CloudTrail. By granting the
AWS Glue job 1AM role access to the stored credentials, you can use the principle of least
privilege to ensure that only the AWS Glue job can retrieve the credentials from Secrets
Manager. You can also use resource-based or tag-based policies to further restrict the
access to the credentials.
The other options are not as secure as storing the credentials in AWS Secrets Manager
and granting the AWS Glue job 1AM role access to the stored credentials. Storing the
credentials in the AWS Glue job parameters will not remediate the security vulnerability, as
the job parameters are still visible in the AWS Glue console and API. Storing the
credentials in a configuration file that is in an Amazon S3 bucket and accessing the
credentials from the configuration file by using the AWS Glue job will not be as secure as
using Secrets Manager, as the configuration file may not be encrypted or rotated, and the
access to the file may not be audited or controlled. References:
AWS Secrets Manager
AWS Glue
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,
Chapter 6: Data Integration and Transformation, Section 6.1: AWS Glue
Question # 9
A company needs to partition the Amazon S3 storage that the company uses for a datalake. The partitioning will use a path of the S3 object keys in the following format:s3://bucket/prefix/year=2023/month=01/day=01.A data engineer must ensure that the AWS Glue Data Catalog synchronizes with the S3storage when the company adds new partitions to the bucket.Which solution will meet these requirements with the LEAST latency?
A. Schedule an AWS Glue crawler to run every morning. B. Manually run the AWS Glue CreatePartition API twice each day. C. Use code that writes data to Amazon S3 to invoke the Boto3 AWS Glue create partitionAPI call. D. Run the MSCK REPAIR TABLE command from the AWS Glue console.
Answer: C
Explanation: The best solution to ensure that the AWS Glue Data Catalog synchronizes
with the S3 storage when the company adds new partitions to the bucket with the least
latency is to use code that writes data to Amazon S3 to invoke the Boto3 AWS Glue create
partition API call. This way, the Data Catalog is updated as soon as new data is written to
S3, and the partition information is immediately available for querying by other
services. The Boto3 AWS Glue create partition API call allows you to create a new partition
in the Data Catalog by specifying the table name, the database name, and the partition
values1. You can use this API call in your code that writes data to S3, such as a Python
script or an AWS Glue ETL job, to create a partition for each new S3 object key that
matches the partitioning scheme.
Option A is not the best solution, as scheduling an AWS Glue crawler to run every morning
would introduce a significant latency between the time new data is written to S3 and the
time the Data Catalog is updated. AWS Glue crawlers are processes that connect to a data
store, progress through a prioritized list of classifiers to determine the schema for your
data, and then create metadata tables in the Data Catalog2. Crawlers can be scheduled to
run periodically, such as daily or hourly, but they cannot runcontinuously or in real-time.
Therefore, using a crawler to synchronize the Data Catalog with the S3 storage would not
meet the requirement of the least latency.
Option B is not the best solution, as manually running the AWS Glue CreatePartition API twice each day would also introduce a significant latency between the time new data is
written to S3 and the time the Data Catalog is updated. Moreover, manually running the
API would require more operational overhead and human intervention than using code that
writes data to S3 to invoke the API automatically.
Option D is not the best solution, as running the MSCK REPAIR TABLE command from the
AWS Glue console would also introduce a significant latency between the time new data is
written to S3 and the time the Data Catalog is updated. The MSCK REPAIR TABLE
command is a SQL command that you can run in the AWS Glue console to add partitions
to the Data Catalog based on the S3 object keys that match the partitioning scheme3.
However, this command is not meant to be run frequently or in real-time, as it can take a
long time to scan the entire S3 bucket and add the partitions. Therefore, using this
command to synchronize the Data Catalog with the S3 storage would not meet the
requirement of the least latency. References:
AWS Glue CreatePartition API
Populating the AWS Glue Data Catalog
MSCK REPAIR TABLE Command
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 10
A company is migrating its database servers from Amazon EC2 instances that runMicrosoft SQL Server to Amazon RDS for Microsoft SQL Server DB instances. Thecompany's analytics team must export large data elements every day until the migration iscomplete. The data elements are the result of SQL joinsacross multiple tables. The datamust be in Apache Parquet format. The analytics team must store the data in Amazon S3.Which solution will meet these requirements in the MOST operationally efficient way?
A. Create a view in the EC2 instance-based SQL Server databases that contains therequired data elements. Create an AWS Glue job that selects the data directly from theview and transfers the data in Parquet format to an S3 bucket. Schedule the AWS Glue jobto run every day. B. Schedule SQL Server Agent to run a daily SQL query that selects the desired dataelements from the EC2 instance-based SQL Server databases. Configure the query todirect the output .csv objects to an S3 bucket. Create an S3 event that invokes an AWSLambda function to transform the output format from .csv to Parquet. C. Use a SQL query to create a view in the EC2 instance-based SQL Server databasesthat contains the required data elements. Create and run an AWS Glue crawler to read theview. Create an AWS Glue job that retrieves the data and transfers the data in Parquetformat to an S3 bucket. Schedule the AWS Glue job to run every day. D. Create an AWS Lambda function that queries the EC2 instance-based databases byusing Java Database Connectivity (JDBC). Configure the Lambda function to retrieve therequired data, transform the data into Parquet format, and transfer the data into an S3bucket. Use Amazon EventBridge to schedule the Lambda function to run every day.
Answer: A
Explanation: Option A is the most operationally efficient way to meet the requirements
because it minimizes the number of steps and services involved in the data export process.
AWS Glue is a fully managed service that can extract, transform, and load (ETL) data from
various sources to various destinations, including Amazon S3. AWS Glue can also convert
data to different formats, such as Parquet, which is a columnar storage format that is
optimized for analytics. By creating a view in the SQL Server databases that contains the
required data elements, the AWS Glue job can select the data directly from the view
without having to perform any joins or transformations on the source data. The AWS Glue
job can then transfer the data in Parquet format to an S3 bucket and run on a daily
schedule.
Option B is not operationally efficient because it involves multiple steps and services to
export the data. SQL Server Agent is a tool that can run scheduled tasks on SQL Server
databases, such as executing SQL queries. However, SQL Server Agent cannot directly
export data to S3, so the query output must be saved as .csv objects on the EC2 instance.
Then, an S3 event must be configured to trigger an AWS Lambda function that can
transform the .csv objects to Parquet format and upload them to S3. This option adds
complexity and latency to the data export process and requires additional resources and
configuration.
Option C is not operationally efficient because it introduces an unnecessary step of running
an AWS Glue crawler to read the view. An AWS Glue crawler is a service that can scan
data sources and create metadata tables in the AWS Glue Data Catalog. The Data Catalog
is a central repository that stores information about the data sources, such as schema,
format, and location. However, in this scenario, the schema and format of the data
elements are already known and fixed, so there is no need to run a crawler to discover
them. The AWS Glue job can directly select the data from the view without using the Data
Catalog. Running a crawler adds extra time and cost to the data export process.
Option D is not operationally efficient because it requires custom code and configuration to
query the databases and transform the data. An AWS Lambda function is a service that
can run code in response to events or triggers, such as Amazon EventBridge. Amazon
EventBridge is a service that can connect applications and services with event sources,
such as schedules, and route them to targets, such as Lambda functions. However, in this
scenario, using a Lambda function to query the databases and transform the data is not the
best option because it requires writing and maintaining code that uses JDBC to connect to
the SQL Server databases, retrieve the required data, convert the data to Parquet format,
and transfer the data to S3. This option also has limitations on the execution time, memory,
and concurrency of the Lambda function, which may affect the performance and reliability
of the data export process.
References:
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
AWS Glue Documentation Working with Views in AWS Glue
Converting to Columnar Formats
Question # 11
A data engineer must orchestrate a data pipeline that consists of one AWS Lambdafunction and one AWS Glue job. The solution must integrate with AWS services.Which solution will meet these requirements with the LEAST management overhead?
A. Use an AWS Step Functions workflow that includes a state machine. Configure the statemachine to run the Lambda function and then the AWS Glue job. B. Use an Apache Airflow workflow that is deployed on an Amazon EC2 instance. Define adirected acyclic graph (DAG) in which the first task is to call the Lambda function and thesecond task is to call the AWS Glue job. C. Use an AWS Glue workflow to run the Lambda function and then the AWS Glue job. D. Use an Apache Airflow workflow that is deployed on Amazon Elastic Kubernetes Service(Amazon EKS). Define a directed acyclic graph (DAG) in which the first task is to call theLambda function and the second task is to call the AWS Glue job.
Answer: A
Explanation: AWS Step Functions is a service that allows you to coordinate multiple AWS
services into serverless workflows. You can use Step Functions to create state machines
that define the sequence and logic of the tasks in your workflow. Step Functions supports
various types of tasks, such as Lambda functions, AWS Glue jobs, Amazon EMR clusters,
Amazon ECS tasks, etc. You can use Step Functions to monitor and troubleshoot your
workflows, as well as to handle errors and retries.
Using an AWS Step Functions workflow that includes a state machine to run the Lambda
function and then the AWS Glue job will meet the requirements with the least management
overhead, as it leverages the serverless and managed capabilities of Step Functions. You
do not need to write any code to orchestrate the tasks in your workflow, as you can use the
Step Functions console or the AWS Serverless Application Model (AWS SAM) to define and deploy your state machine. You also do not need to provision or manage any servers
or clusters, as Step Functions scales automatically based on the demand.
The other options are not as efficient as using an AWS Step Functions workflow. Using an
Apache Airflow workflow that is deployed on an Amazon EC2 instance or on Amazon
Elastic Kubernetes Service (Amazon EKS) will require more management overhead, as
you will need to provision, configure, and maintain the EC2 instance or the EKS cluster, as
well as the Airflow components. You will also need to write and maintain the Airflow DAGs
to orchestrate the tasks in your workflow. Using an AWS Glue workflow to run the Lambda
function and then the AWS Glue job will not work, as AWS Glue workflows only support
AWS Glue jobs and crawlers as tasks, not Lambda functions. References:
AWS Step Functions
AWS Glue
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,
Chapter 6: Data Integration and Transformation, Section 6.3: AWS Step Functions
Question # 12
A security company stores IoT data that is in JSON format in an Amazon S3 bucket. Thedata structure can change when the company upgrades the IoT devices. The companywants to create a data catalog that includes the IoT data. The company's analyticsdepartment will use the data catalog to index the data.Which solution will meet these requirements MOST cost-effectively?
A. Create an AWS Glue Data Catalog. Configure an AWS Glue Schema Registry. Create anew AWS Glue workload to orchestrate the ingestion of the data that the analyticsdepartment will use into Amazon Redshift Serverless. B. Create an Amazon Redshift provisioned cluster. Create an Amazon Redshift Spectrumdatabase for the analytics department to explore the data that is in Amazon S3. CreateRedshift stored procedures to load the data into Amazon Redshift. C. Create an Amazon Athena workgroup. Explore the data that is in Amazon S3 by usingApache Spark through Athena. Provide the Athena workgroup schema and tables to theanalytics department. D. Create an AWS Glue Data Catalog. Configure an AWS Glue Schema Registry. CreateAWS Lambda user defined functions (UDFs) by using the Amazon Redshift Data API.Create an AWS Step Functions job to orchestrate the ingestion of the data that theanalytics department will use into Amazon Redshift Serverless.
Answer: C
Explanation: The best solution to meet the requirements of creating a data catalog that
includes the IoT data, and allowing the analytics department to index the data, most costeffectively,
is to create an Amazon Athena workgroup, explore the data that is in Amazon
S3 by using Apache Spark through Athena, and provide the Athena workgroup schema
and tables to the analytics department.
Amazon Athena is a serverless, interactive query service that makes it easy to analyze
data directly in Amazon S3 using standard SQL or Python1. Amazon Athena also supports
Apache Spark, an open-source distributed processing framework that can run large-scale
data analytics applications across clusters of servers2. You can use Athena to run Spark
code on data in Amazon S3 without having to set up, manage, or scale any
infrastructure. You can also use Athena to create and manage external tables that point to
your data in Amazon S3, and store them in an external data catalog, such as AWS Glue
Data Catalog, Amazon Athena Data Catalog, or your own Apache Hive metastore3. You
can create Athena workgroups to separate query execution and resource allocation based
on different criteria, such as users, teams, or applications4. You can share the schemas
and tables in your Athena workgroup with other users or applications, such as Amazon
QuickSight, for data visualization and analysis5. Using Athena and Spark to create a data catalog and explore the IoT data in Amazon S3 is
the most cost-effective solution, as you pay only for the queries you run or the compute you
use, and you pay nothing when the service is idle1. You also save on the operational
overhead and complexity of managing data warehouse infrastructure, as Athena and Spark
are serverless and scalable. You can also benefit from the flexibility and performance of
Athena and Spark, as they support various data formats, including JSON, and can handle
schema changes and complex queries efficiently.
Option A is not the best solution, as creating an AWS Glue Data Catalog, configuring an
AWS Glue Schema Registry, creating a new AWS Glue workload to orchestrate
theingestion of the data that the analytics department will use into Amazon Redshift
Serverless, would incur more costs and complexity than using Athena and Spark. AWS
Glue Data Catalog is a persistent metadata store that contains table definitions, job
definitions, and other control information to help you manage your AWS Glue
components6. AWS Glue Schema Registry is a service that allows you to centrally store
and manage the schemas of your streaming data in AWS Glue Data Catalog7. AWS Glue
is a serverless data integration service that makes it easy to prepare, clean, enrich, and
move data between data stores8. Amazon Redshift Serverless is a feature of Amazon
Redshift, a fully managed data warehouse service, that allows you to run and scale
analytics without having to manage data warehouse infrastructure9. While these services
are powerful and useful for many data engineering scenarios, they are not necessary or
cost-effective for creating a data catalog and indexing the IoT data in Amazon S3. AWS
Glue Data Catalog and Schema Registry charge you based on the number of objects
stored and the number of requests made67. AWS Glue charges you based on the compute
time and the data processed by your ETL jobs8. Amazon Redshift Serverless charges you
based on the amount of data scanned by your queries and the compute time used by your
workloads9. These costs can add up quickly, especially if you have large volumes of IoT
data and frequent schema changes. Moreover, using AWS Glue and Amazon Redshift
Serverless would introduce additional latency and complexity, as you would have to ingest
the data from Amazon S3 to Amazon Redshift Serverless, and then query it from there,
instead of querying it directly from Amazon S3 using Athena and Spark.
Option B is not the best solution, as creating an Amazon Redshift provisioned cluster,
creating an Amazon Redshift Spectrum database for the analytics department to explore
the data that is in Amazon S3, and creating Redshift stored procedures to load the data
into Amazon Redshift, would incur more costs and complexity than using Athena and
Spark. Amazon Redshift provisioned clusters are clusters that you create and manage by
specifying the number and type of nodes, and the amount of storage and compute
capacity10. Amazon Redshift Spectrum is a feature of Amazon Redshift that allows you to
query and join data across your data warehouse and your data lake using standard
SQL11. Redshift stored procedures are SQL statements that you can define and store in
Amazon Redshift, and then call them by using the CALL command12. While these features are powerful and useful for many data warehousing scenarios, they are not necessary or
cost-effective for creating a data catalog and indexing the IoT data in Amazon S3. Amazon
Redshift provisioned clusters charge you based on the node type, the number of nodes,
and the duration of the cluster10. Amazon Redshift Spectrum charges you based on the
amount of data scanned by your queries11. These costs can add up quickly, especially if
you have large volumes of IoT data and frequent schema changes. Moreover, using
Amazon Redshift provisioned clusters and Spectrum would introduce additional latency
and complexity, as you would have to provision andmanage the cluster, create an external
schema and database for the data in Amazon S3, and load the data into the cluster using
stored procedures, instead of querying it directly from Amazon S3 using Athena and Spark.
Option D is not the best solution, as creating an AWS Glue Data Catalog, configuring an
AWS Glue Schema Registry, creating AWS Lambda user defined functions (UDFs) by
using the Amazon Redshift Data API, and creating an AWS Step Functions job to
orchestrate the ingestion of the data that the analytics department will use into Amazon
Redshift Serverless, would incur more costs and complexity than using Athena and
Spark. AWS Lambda is a serverless compute service that lets you run code without
provisioning or managing servers13. AWS Lambda UDFs are Lambda functions that you
can invoke from within an Amazon Redshift query. Amazon Redshift Data API is a service
that allows you to run SQL statements on Amazon Redshift clusters using HTTP requests,
without needing a persistent connection. AWS Step Functions is a service that lets you
coordinate multiple AWS services into serverless workflows. While these services are
powerful and useful for many data engineering scenarios, they are not necessary or costeffective
for creating a data catalog and indexing the IoT data in Amazon S3. AWS Glue
Data Catalog and Schema Registry charge you based on the number of objects stored and
the number of requests made67. AWS Lambda charges you based on the number of
requests and the duration of your functions13. Amazon Redshift Serverless charges you
based on the amount of data scanned by your queries and the compute time used by your
workloads9. AWS Step Functions charges you based on the number of state transitions in
your workflows. These costs can add up quickly, especially if you have large volumes of
IoT data and frequent schema changes. Moreover, using AWS Glue, AWS Lambda,
Amazon Redshift Data API, and AWS Step Functions would introduce additional latency
and complexity, as you would have to create and invoke Lambda functions to ingest the
data from Amazon S3 to Amazon Redshift Serverless using the Data API, and coordinate
the ingestion process using Step Functions, instead of querying it directly from Amazon S3
using Athena and Spark. References:
What is Amazon Athena?
Apache Spark on Amazon Athena
Creating tables, updating the schema, and adding new partitions in the Data
Catalog from AWS Glue ETL jobs
Managing Athena workgroups
Using Amazon QuickSight to visualize data in Amazon Athena
AWS Glue Data Catalog AWS Glue Schema Registry
What is AWS Glue?
Amazon Redshift Serverless
Amazon Redshift provisioned clusters
Querying external data using Amazon Redshift Spectrum
Using stored procedures in Amazon Redshift
What is AWS Lambda?
[Creating and using AWS Lambda UDFs]
[Using the Amazon Redshift Data API]
[What is AWS Step Functions?]
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 13
A company uses Amazon Athena to run SQL queries for extract, transform, and load (ETL)tasks by using Create Table As Select (CTAS). The company must use Apache Sparkinstead of SQL to generate analytics.Which solution will give the company the ability to use Spark to access Athena?
A. Athena query settings B. Athena workgroup C. Athena data source D. Athena query editor
Answer: C
Explanation: Athena data source is a solution that allows you to use Spark to access
Athena by using the Athena JDBC driver and the Spark SQL interface. You can use the
Athena data source to create Spark DataFrames from Athena tables, run SQL queries on
the DataFrames, and write the results back to Athena. The Athena data source supports
various data formats, such as CSV, JSON, ORC, and Parquet, and also supports
partitioned and bucketed tables. The Athena data source is a cost-effective and scalable
way to use Spark to access Athena, as it does not require any additional infrastructure or
services, and you only pay for the data scanned by Athena.
The other options are not solutions that give the company the ability to use Spark to access
Athena. Option A, Athena query settings, is a feature that allows you to configure various
parameters for your Athena queries, such as the output location, the encryption settings,
the query timeout, and the workgroup. Option B, Athena workgroup, is a feature that allows
you to isolate and manage your Athena queries and resources, such as the query history,
the query notifications, the query concurrency, and the query cost. Option D, Athena query
editor, is a feature that allows you to write and run SQL queries on Athena using the web
console or the API. None of these options enable you to use Spark instead of SQL to
generate analytics on Athena. References:
Using Apache Spark in Amazon Athena
Athena JDBC Driver
Spark SQL
Athena query settings
[Athena workgroups]
[Athena query editor]
Question # 14
A company needs to set up a data catalog and metadata management for data sourcesthat run in the AWS Cloud. The company will use the data catalog to maintain the metadataof all the objects that are in a set of data stores. The data stores include structured sourcessuch as Amazon RDS and Amazon Redshift. The data stores also include semistructuredsources such as JSON files and .xml files that are stored in Amazon S3.The company needs a solution that will update the data catalog on a regular basis. Thesolution also must detect changes to the source metadata.Which solution will meet these requirements with the LEAST operational overhead?
A. Use Amazon Aurora as the data catalog. Create AWS Lambda functions that willconnect to the data catalog. Configure the Lambda functions to gather the metadatainformation from multiple sources and to update the Aurora data catalog. Schedule theLambda functions to run periodically. B. Use the AWS Glue Data Catalog as the central metadata repository. Use AWS Gluecrawlers to connect to multiple data stores and to update the Data Catalog with metadatachanges. Schedule the crawlers to run periodically to update the metadata catalog. C. Use Amazon DynamoDB as the data catalog. Create AWS Lambda functions that willconnect to the data catalog. Configure the Lambda functions to gather the metadatainformation from multiple sources and to update the DynamoDB data catalog. Schedule theLambda functions to run periodically. D. Use the AWS Glue Data Catalog as the central metadata repository. Extract the schemafor Amazon RDS and Amazon Redshift sources, and build the Data Catalog. Use AWSGlue crawlers for data that is in Amazon S3 to infer the schema and to automaticallyupdate the Data Catalog.
Answer: B
Explanation: This solution will meet the requirements with the least operational overhead
because it uses the AWS Glue Data Catalog as the central metadata repository for data
sources that run in the AWS Cloud. The AWS Glue Data Catalog is a fully managed
service that provides a unified view of your data assets across AWS and on-premises data sources. It stores the metadata of your data in tables, partitions, and columns, and enables
you to access and query your data using various AWS services, such as Amazon Athena,
Amazon EMR, and Amazon Redshift Spectrum. You can use AWS Glue crawlers to
connect to multiple data stores, such as Amazon RDS, Amazon Redshift, and Amazon S3,
and to update the Data Catalog with metadata changes. AWS Glue crawlers can
automatically discover the schema and partition structure of your data, and create or
update the corresponding tables in the Data Catalog. You can schedule the crawlers to run
periodically to update the metadata catalog, and configure them to detect changes to the
source metadata, such as new columns, tables, or partitions12.
The other options are not optimal for the following reasons:
A. Use Amazon Aurora as the data catalog. Create AWS Lambda functions that
will connect to the data catalog. Configure the Lambda functions to gather the
metadata information from multiple sources and to update the Aurora data catalog.
Schedule the Lambda functions to run periodically. This option is not
recommended, as it would require more operational overhead to create and
manage an Amazon Aurora database as the data catalog, and to write and
maintain AWS Lambda functions to gather and update the metadata information
from multiple sources. Moreover, this option would not leverage the benefits of the
AWS Glue Data Catalog, such as data cataloging, data transformation, and data
governance.
C. Use Amazon DynamoDB as the data catalog. Create AWS Lambda functions
that will connect to the data catalog. Configure the Lambda functions to gather the
metadata information from multiple sources and to update the DynamoDB data
catalog. Schedule the Lambda functions to run periodically. This option is also not
recommended, as it would require more operational overhead to create and
manage an Amazon DynamoDB table as the data catalog, and to write and
maintain AWS Lambda functions to gather and update the metadata information
from multiple sources. Moreover, this option would not leverage the benefits of the
AWS Glue Data Catalog, such as data cataloging, data transformation, and data
governance.
D. Use the AWS Glue Data Catalog as the central metadata repository. Extract the
schema for Amazon RDS and Amazon Redshift sources, and build the Data
Catalog. Use AWS Glue crawlers for data that is in Amazon S3 to infer the schema
and to automatically update the Data Catalog. This option is not optimal, as it
would require more manual effort to extract the schema for Amazon RDS and
Amazon Redshift sources, and to build the Data Catalog. This option would not
take advantage of the AWS Glue crawlers’ ability to automatically discover the
schema and partition structure of your data from various data sources, and to
create or update the corresponding tables in the Data Catalog.
References:
1: AWS Glue Data Catalog
2: AWS Glue Crawlers
: Amazon Aurora
: AWS Lambda
: Amazon DynamoDB
Question # 15
A manufacturing company wants to collect data from sensors. A data engineer needs toimplement a solution that ingests sensor data in near real time.The solution must store the data to a persistent data store. The solution must store the datain nested JSON format. The company must have the ability to query from the data storewith a latency of less than 10 milliseconds.Which solution will meet these requirements with the LEAST operational overhead?
A. Use a self-hosted Apache Kafka cluster to capture the sensor data. Store the data inAmazon S3 for querying. B. Use AWS Lambda to process the sensor data. Store the data in Amazon S3 forquerying. C. Use Amazon Kinesis Data Streams to capture the sensor data. Store the data inAmazon DynamoDB for querying. D. Use Amazon Simple Queue Service(Amazon SQS) to buffer incomingsensor data. UseAWS Glue to store thedata in Amazon RDS for querying.
Answer: C
Explanation: Amazon Kinesis Data Streams is a service that enables you to collect,
process, and analyze streaming data in real time. You can use Kinesis Data Streams to
capture sensor data from various sources, such as IoT devices, web applications, or mobile
apps. You can create data streams that can scale up to handle any amount of data from
thousands of producers. You can also use the Kinesis Client Library (KCL) or the Kinesis
Data Streams API to write applications that process and analyze the data in the streams1.
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and
predictable performance with seamless scalability. You can use DynamoDB to store the
sensor data in nested JSON format, as DynamoDB supports document data types, such as
lists and maps. You can also use DynamoDB to query the data with a latency of less than
10 milliseconds, as DynamoDB offers single-digit millisecond performance for any scale of
data. You can use the DynamoDB API or the AWS SDKs to perform queries on the data,
such as using key-value lookups, scans, or queries2.
The solution that meets the requirements with the least operational overhead is to use
Amazon Kinesis Data Streams to capture the sensor data and store the data in Amazon
DynamoDB for querying. This solution has the following advantages:
It does not require you to provision, manage, or scale any servers, clusters, or
queues, as Kinesis Data Streams and DynamoDB are fully managed services that
handle all the infrastructure for you. This reduces the operational complexity and
cost of running your solution.
It allows you to ingest sensor data in near real time, as Kinesis Data Streams can
capture data records as they are produced and deliver them to your applications
within seconds. You can also use Kinesis Data Firehose to load the data from the
streams to DynamoDB automatically and continuously3.
It allows you to store the data in nested JSON format, as DynamoDB supports
document data types, such as lists and maps. You can also use DynamoDB Streams to capturechanges in the data and trigger actions, such as sending
notifications or updating other databases.
It allows you to query the data with a latency of less than 10 milliseconds, as
DynamoDB offers single-digit millisecond performance for any scale of data. You
can also use DynamoDB Accelerator (DAX) to improve the read performance by
caching frequently accessed data.
Option A is incorrect because it suggests using a self-hosted Apache Kafka cluster to
capture the sensor data and store the data in Amazon S3 for querying. This solution has
the following disadvantages:
It requires you to provision, manage, and scale your own Kafka cluster, either on
EC2 instances or on-premises servers. This increases the operational complexity
and cost of running your solution.
It does not allow you to query the data with a latency of less than 10 milliseconds,
as Amazon S3 is an object storage service that is not optimized for low-latency
queries. You need to use another service, such as Amazon Athena or Amazon
Redshift Spectrum, to query the data in S3, which may incur additional costs and
latency.
Option B is incorrect because it suggests using AWS Lambda to process the sensor data
and store the data in Amazon S3 for querying. This solution has the following
disadvantages:
It does not allow you to ingest sensor data in near real time, as Lambda is a
serverless compute service that runs code in response to events. You need to use
another service, such as API Gateway or Kinesis Data Streams, to trigger Lambda
functions with sensor data, which may add extra latency and complexity to your
solution.
It does not allow you to query the data with a latency of less than 10 milliseconds,
as Amazon S3 is an object storage service that is not optimized for low-latency
queries. You need to use another service, such as Amazon Athena or Amazon
Redshift Spectrum, to query the data in S3, which may incur additional costs and
latency.
Option D is incorrect because it suggests using Amazon Simple Queue Service (Amazon
SQS) to buffer incoming sensor data and use AWS Glue to store the data in Amazon RDS
for querying. This solution has the following disadvantages:
It does not allow you to ingest sensor data in near real time, as Amazon SQS is a
message queue service that delivers messages in a best-effort manner. You need
to use another service, such as Lambda or EC2, to poll the messages from the
queue and process them, which may add extra latency and complexity to your
solution.
It does not allow you to store the data in nested JSON format, as Amazon RDS is
a relational database service that supports structured data types, such as tables
and columns. You need to use another service, such as AWS Glue, to transform
the data from JSON to relational format, which may add extra cost and overhead
to your solution.
References:
1: Amazon Kinesis Data Streams - Features
2: Amazon DynamoDB - Features
3: Loading Streaming Data into Amazon DynamoDB - Amazon Kinesis Data
Firehose
[4]: Capturing Table Activity with DynamoDB Streams - Amazon DynamoDB [5]: Amazon DynamoDB Accelerator (DAX) - Features
[6]: Amazon S3 - Features
[7]: AWS Lambda - Features
[8]: Amazon Simple Queue Service - Features
[9]: Amazon Relational Database Service - Features
[10]: Working with JSON in Amazon RDS - Amazon Relational Database Service
[11]: AWS Glue - Features
Question # 16
A company is planning to upgrade its Amazon Elastic Block Store (Amazon EBS) GeneralPurpose SSD storage from gp2 to gp3. The company wants to prevent any interruptions inits Amazon EC2 instances that will cause data loss during the migration to the upgradedstorage.Which solution will meet these requirements with the LEAST operational overhead?
A. Create snapshots of the gp2 volumes. Create new gp3 volumes from the snapshots.Attach the new gp3 volumes to the EC2 instances. B. Create new gp3 volumes. Gradually transfer the data to the new gp3 volumes. When thetransfer is complete, mount the new gp3 volumes to the EC2 instances to replace the gp2volumes. C. Change the volume type of the existing gp2 volumes to gp3. Enter new values forvolume size, IOPS, and throughput. D. Use AWS DataSync to create new gp3 volumes. Transfer the data from the original gp2volumes to the new gp3 volumes.
Answer: C
Explanation: Changing the volume type of the existing gp2 volumes to gp3 is the easiest
and fastest way to migrate to the new storage type without any downtime or data loss. You
can use the AWS Management Console, the AWS CLI, or the Amazon EC2 API to modify
the volume type, size, IOPS, and throughput of your gp2 volumes. The modification takes
effect immediately, and you can monitor the progress of the modification using
CloudWatch. The other options are either more complex or require additional steps, such
as creating snapshots, transferring data, or attaching new volumes, which can increase the
operational overhead and the risk of errors. References:
Migrating Amazon EBS volumes from gp2 to gp3 and save up to 20% on
costs (Section: How to migrate from gp2 to gp3)
Switching from gp2 Volumes to gp3 Volumes to Lower AWS EBS Costs (Section:
How to Switch from GP2 Volumes to GP3 Volumes)
Modifying the volume type, IOPS, or size of an EBS volume - Amazon Elastic
Compute Cloud (Section: Modifying the volume type)
Question # 17
A company created an extract, transform, and load (ETL) data pipeline in AWS Glue. Adata engineer must crawl a table that is in Microsoft SQL Server. The data engineer needsto extract, transform, and load the output of the crawl to an Amazon S3 bucket. The data engineer also must orchestrate the data pipeline.Which AWS service or feature will meet these requirements MOST cost-effectively?
A. AWS Step Functions B. AWS Glue workflows C. AWS Glue Studio D. Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
Answer: B
Explanation: AWS Glue workflows are a cost-effective way to orchestrate complex ETL
jobs that involve multiple crawlers, jobs, and triggers. AWS Glue workflows allow you to
visually monitor the progress and dependencies of your ETL tasks, and automatically
handle errors and retries. AWS Glue workflows also integrate with other AWS services,
such as Amazon S3, Amazon Redshift, and AWS Lambda, among others, enabling you to
leverage these services for your data processing workflows. AWS Glue workflows are
serverless, meaning you only pay for the resources you use, and you don’t have to manage
any infrastructure.
AWS Step Functions, AWS Glue Studio, and Amazon MWAA are also possible options for
orchestrating ETL pipelines, but they have some drawbacks compared to AWS Glue
workflows. AWS Step Functions is a serverless function orchestrator that can handle
different types of data processing, such as real-time, batch, and stream processing.
However, AWS Step Functions requires you to write code to define your state machines,
which can be complex and error-prone. AWS Step Functions also charges you for every
state transition, which can add up quickly for large-scale ETL pipelines.
AWS Glue Studio is a graphical interface that allows you to create and run AWS Glue ETL
jobs without writing code. AWS Glue Studio simplifies the process of building, debugging,
and monitoring your ETL jobs, and provides a range of pre-built transformations and
connectors. However, AWS Glue Studio does not support workflows, meaning you cannot
orchestrate multiple ETL jobs or crawlers with dependencies and triggers. AWS Glue
Studio also does not support streaming data sources or targets, which limits its use cases
for real-time data processing.
Amazon MWAA is a fully managed service that makes it easy to run open-source versions
of Apache Airflow on AWS and build workflows to run your ETL jobs and data pipelines.
Amazon MWAA provides a familiar and flexible environment for data engineers who are
familiar with Apache Airflow, and integrates with a range of AWS services such as Amazon
EMR, AWS Glue, and AWS Step Functions. However, Amazon MWAA is not serverless,
meaning you have to provision and pay for the resources you need, regardless of your
usage. Amazon MWAA also requires you to write code to define your DAGs, which can be
challenging and time-consuming for complex ETL pipelines. References:
AWS Glue Workflows
AWS Step Functions
AWS Glue Studio Amazon MWAA
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 18
A company uses Amazon S3 to store semi-structured data in a transactional data lake.Some of the data files are small, but other data files are tens of terabytes.A data engineer must perform a change data capture (CDC) operation to identify changeddata from the data source. The data source sends a full snapshot as a JSON file every dayand ingests the changed data into the data lake.Which solution will capture the changed data MOST cost-effectively?
A. Create an AWS Lambda function to identify the changes between the previous data andthe current data. Configure the Lambda function to ingest the changes into the data lake. B. Ingest the data into Amazon RDS for MySQL. Use AWS Database Migration Service (AWS DMS) to write the changed data to the data lake. C. Use an open source data lake format to merge the data source with the S3 data lake toinsert the new data and update the existing data. D. Ingest the data into an Amazon Aurora MySQL DB instance that runs Aurora Serverless.Use AWS Database Migration Service (AWS DMS) to write the changed data to the datalake.
Answer: C
Explanation:
An open source data lake format, such as Apache Parquet, Apache ORC, or Delta Lake, is
a cost-effective way to perform a change data capture (CDC) operation on semi-structured
data stored in Amazon S3. An open source data lake format allows you to query data
directly from S3 using standard SQL, without the need to move or copy data to another
service. An open source data lake format also supports schema evolution, meaning it can
handle changes in the data structure over time. An open source data lake format also
supports upserts, meaning it can insert new data and update existing data in the same
operation, using a merge command. This way, you can efficiently capture the changes from
the data source and apply them to the S3 data lake, without duplicating or losing any data.
The other options are not as cost-effective as using an open source data lake format, as
they involve additional steps or costs. Option A requires you to create and maintain an
AWS Lambda function, which can be complex and error-prone. AWS Lambda also has
some limits on the execution time, memory, and concurrency, which can affect the
performance and reliability of the CDC operation. Option B and D require you to ingest the
data into a relational database service, such as Amazon RDS or Amazon Aurora, which
can be expensive and unnecessary for semi-structured data. AWS Database Migration
Service (AWS DMS) can write the changed data to the data lake, but it alsocharges you for
the data replication and transfer. Additionally, AWS DMS does not support JSON as a
source data type, so you would need to convert the data to a supported format before using
AWS DMS. References:
What is a data lake?
Choosing a data format for your data lake
Using the MERGE INTO command in Delta Lake
[AWS Lambda quotas]
[AWS Database Migration Service quotas]
Question # 19
A data engineer must manage the ingestion of real-time streaming data into AWS. Thedata engineer wants to perform real-time analytics on the incoming streaming data by usingtime-based aggregations over a window of up to 30 minutes. The data engineer needs asolution that is highly fault tolerant.Which solution will meet these requirements with the LEAST operational overhead?
A. Use an AWS Lambda function that includes both the business and the analytics logic toperform time-based aggregations over a window of up to 30 minutes for the data in Amazon Kinesis Data Streams. B. Use Amazon Managed Service for Apache Flink (previously known as Amazon KinesisData Analytics) to analyze the data that might occasionally contain duplicates by usingmultiple types of aggregations. C. Use an AWS Lambda function that includes both the business and the analytics logic toperform aggregations for a tumbling window of up to 30 minutes, based on the eventtimestamp. D. Use Amazon Managed Service for Apache Flink (previously known as Amazon KinesisData Analytics) to analyze the data by using multiple types of aggregations to perform timebasedanalytics over a window of up to 30 minutes.
Answer: A
Explanation: This solution meets the requirements of managing the ingestion of real-time
streaming data into AWS and performing real-time analytics on the incoming streaming
data with the least operational overhead. Amazon Managed Service for Apache Flink is a
fully managed service that allows you to run Apache Flink applications without having to
manage any infrastructure or clusters. Apache Flink is a framework for stateful stream
processing that supports various types of aggregations, such as tumbling, sliding, and
session windows, over streaming data. By using Amazon Managed Service for Apache
Flink, you can easily connect to Amazon Kinesis Data Streams as the source and sink of
your streaming data, and perform time-based analytics over a window of up to 30 minutes.
This solution is also highly fault tolerant, as Amazon Managed Service for Apache Flink
automatically scales, monitors, and restarts your Flink applications in case of failures.
References:
Amazon Managed Service for Apache Flink
Apache Flink
Window Aggregations in Flink
Question # 20
A company's data engineer needs to optimize the performance of table SQL queries. Thecompany stores data in an Amazon Redshift cluster. The data engineer cannot increasethe size of the cluster because of budget constraints. The company stores the data in multiple tables and loads the data by using the EVENdistribution style. Some tables are hundreds of gigabytes in size. Other tables are less than10 MB in size.Which solution will meet these requirements?
A. Keep using the EVEN distribution style for all tables. Specify primary and foreign keysfor all tables. B. Use the ALL distribution style for large tables. Specify primary and foreign keys for alltables. C. Use the ALL distribution style for rarely updated small tables. Specify primary andforeign keys for all tables. D. Specify a combination of distribution, sort, and partition keys for all tables.
Answer: C
Explanation: This solution meets the requirements of optimizing the performance of table
SQL queries without increasing the size of the cluster. By using the ALL distribution style
for rarely updated small tables, you can ensure that the entire table is copied to every node
in the cluster, which eliminates the need for data redistribution during joins. This can
improve query performance significantly, especially for frequently joined dimension tables.
However, using the ALL distribution style also increases the storage space and the load
time, so it is only suitable for small tables that are not updated frequently orextensively. By
specifying primary and foreign keys for all tables, you can help the query optimizer to
generate better query plans and avoid unnecessary scans or joins. You can also use the
AUTO distribution style to let Amazon Redshift choose the optimal distribution style based
on the table size and the query patterns. References:
Choose the best distribution style
Distribution styles
Working with data distribution styles
Question # 21
A data engineer is configuring Amazon SageMaker Studio to use AWS Glue interactivesessions to prepare data for machine learning (ML) models.The data engineer receives an access denied error when the data engineer tries to preparethe data by using SageMaker Studio.Which change should the engineer make to gain access to SageMaker Studio?
A. Add the AWSGlueServiceRole managed policy to the data engineer's IAM user. B. Add a policy to the data engineer's IAM user that includes the sts:AssumeRole action forthe AWS Glue and SageMaker service principals in the trust policy. C. Add the AmazonSageMakerFullAccess managed policy to the data engineer's IAM user. D. Add a policy to the data engineer's IAM user that allows the sts:AddAssociation actionfor the AWS Glue and SageMaker service principals in the trust policy.
Answer: B
Explanation: This solution meets the requirement of gaining access to SageMaker Studio
to use AWS Glue interactive sessions. AWS Glue interactive sessions are a way to use
AWS Glue DataBrew and AWS Glue Data Catalog from within SageMaker Studio. To use
AWS Glue interactive sessions, the data engineer’s IAM user needs to have permissions to
assume the AWS Glue service role and the SageMaker execution role. By adding a policy
to the data engineer’s IAM user that includes the sts:AssumeRole action for the AWS Glue
and SageMaker service principals in the trust policy, the data engineer can grant these
permissions and avoid the access denied error. The other options are not sufficient or
necessary to resolve the error. References: Get started with data integration from Amazon S3 to Amazon Redshift using AWS
Glue interactive sessions
Troubleshoot Errors - Amazon SageMaker
AccessDeniedException on sagemaker:CreateDomain in AWS SageMaker Studio,
despite having SageMakerFullAccess
Question # 22
A company stores petabytes of data in thousands of Amazon S3 buckets in the S3Standard storage class. The data supports analytics workloads that have unpredictable andvariable data access patterns.The company does not access some data for months. However, the company must be ableto retrieve all data within milliseconds. The company needs to optimize S3 storage costs.Which solution will meet these requirements with the LEAST operational overhead?
A. Use S3 Storage Lens standard metrics to determine when to move objects to more costoptimizedstorage classes. Create S3 Lifecycle policies for the S3 buckets to move objectsto cost-optimized storage classes. Continue to refine the S3 Lifecycle policies in the futureto optimize storage costs. B. Use S3 Storage Lens activity metrics to identify S3 buckets that the company accessesinfrequently. Configure S3 Lifecycle rules to move objects from S3 Standard to the S3Standard-Infrequent Access (S3 Standard-IA) and S3 Glacier storage classes based on theage of the data. C. Use S3 Intelligent-Tiering. Activate the Deep Archive Access tier. D. Use S3 Intelligent-Tiering. Use the default access tier.
Answer: D
Explanation: S3 Intelligent-Tiering is a storage class that automatically moves objects
between four access tiers based on the changing access patterns. The default access tier
consists of two tiers: Frequent Access and Infrequent Access. Objects in the Frequent
Access tier have the same performance and availability as S3 Standard, while objects in
the Infrequent Access tier have the same performance and availability as S3 Standard-IA.
S3 Intelligent-Tiering monitors the access patterns of each object and moves them
between the tiers accordingly, without any operational overhead or retrieval fees. This
solution can optimize S3 storage costs for data with unpredictable and variable access
patterns, while ensuring millisecond latency for data retrieval. The other solutions are not
optimal or relevant for this requirement. Using S3 Storage Lens standard metrics and activity metrics can provide insights into the storage usage and access patterns, but they
do not automate the data movement between storage classes. Creating S3 Lifecycle
policies for the S3 buckets can move objects to more cost-optimized storage classes, but
they require manual configuration and maintenance, and they may incur retrieval fees for
data that is accessed unexpectedly. Activating the Deep Archive Access tier for S3
Intelligent-Tiering can further reduce the storage costs for data that is rarely accessed, but
it also increases the retrieval time to 12 hours, which does not meet the requirement of
millisecond latency. References:
S3 Intelligent-Tiering
S3 Storage Lens
S3 Lifecycle policies
[AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide]
Question # 23
A company uses AWS Step Functions to orchestrate a data pipeline. The pipeline consistsof Amazon EMR jobs that ingest data from data sources and store the data in an AmazonS3 bucket. The pipeline also includes EMR jobs that load the data to Amazon Redshift.The company's cloud infrastructure team manually built a Step Functions state machine.The cloud infrastructure team launched an EMR cluster into a VPC to support the EMRjobs. However, the deployed Step Functions state machine is not able to run the EMR jobs.Which combination of steps should the company take to identify the reason the StepFunctions state machine is not able to run the EMR jobs? (Choose two.)
A. Use AWS CloudFormation to automate the Step Functions state machine deployment.Create a step to pause the state machine during the EMR jobs that fail. Configure the stepto wait for a human user to send approval through an email message. Include details of theEMR task in the email message for further analysis. B. Verify that the Step Functions state machine code has all IAM permissions that arenecessary to create and run the EMR jobs. Verify that the Step Functions state machinecode also includes IAM permissions to access the Amazon S3 buckets that the EMR jobsuse. Use Access Analyzer for S3 to check the S3 access properties. C. Check for entries in Amazon CloudWatch for the newly created EMR cluster. Changethe AWS Step Functions state machine code to use Amazon EMR on EKS. Change theIAM access policies and the security group configuration for the Step Functions statemachine code to reflect inclusion of Amazon Elastic Kubernetes Service (Amazon EKS). D. Query the flow logs for the VPC. Determine whether the traffic that originates from theEMR cluster can successfully reach the data providers. Determine whether any securitygroup that might be attached to the Amazon EMR cluster allows connections to the datasource servers on the informed ports. E. Check the retry scenarios that the company configured for the EMR jobs. Increase thenumber of seconds in the interval between each EMR task. Validate that each fallback state has the appropriate catch for each decision state. Configure an Amazon SimpleNotification Service (Amazon SNS) topic to store the error messages.
Answer: B,D
Explanation: To identify the reason why the Step Functions state machine is not able to
run the EMR jobs, the company should take the following steps:
Verify that the Step Functions state machine code has all IAM permissions that are
necessary to create and run the EMR jobs. The state machine code should have
an IAM role that allows it to invoke the EMR APIs, such as RunJobFlow,
AddJobFlowSteps, and DescribeStep. The state machine code should also have
IAM permissions to access the Amazon S3 buckets that the EMR jobs use as input
and output locations. The company can use Access Analyzer for S3 to check the
access policies and permissions of the S3 buckets12. Therefore, option B is
correct.
Query the flow logs for the VPC. The flow logs can provide information about the
network traffic to and from the EMR cluster that is launched in the VPC. The
company can use the flow logs to determine whether the traffic that originates from
the EMR cluster can successfully reach the data providers, such as Amazon RDS,
Amazon Redshift, or other external sources. The company can also determine
whether any security group that might be attached to the EMR cluster allows
connections to the data source servers on the informed ports. The company can
use Amazon VPC Flow Logs or Amazon CloudWatch Logs Insights to query the
flow logs3 . Therefore, option D is correct.
Option A is incorrect because it suggests using AWS CloudFormation to automate the Step
Functions state machine deployment. While this is a good practice to ensure consistency
and repeatability of the deployment, it does not help to identify the reasonwhy the state
machine is not able to run the EMR jobs. Moreover, creating a step to pause the state
machine during the EMR jobs that fail and wait for a human user to send approval through
an email message is not a reliable way to troubleshoot the issue. The company should use
the Step Functions console or API to monitor the execution history and status of the state
machine, and use Amazon CloudWatch to view the logs and metrics of the EMR jobs .
Option C is incorrect because it suggests changing the AWS Step Functions state machine
code to use Amazon EMR on EKS. Amazon EMR on EKS is a service that allows you to
run EMR jobs on Amazon Elastic Kubernetes Service (Amazon EKS) clusters. While this
service has some benefits, such as lower cost and faster execution time, it does not
support all the features and integrations that EMR on EC2 does, such as EMR Notebooks,
EMR Studio, and EMRFS. Therefore, changing the state machine code to use EMR on
EKS may not be compatible with the existing data pipeline and may introduce new issues.
Option E is incorrect because it suggests checking the retry scenarios that the company
configured for the EMR jobs. While this is a good practice to handle transient failures and
errors, it does not help to identify the root cause of why the state machine is not able to run
the EMR jobs. Moreover, increasing the number of seconds in the interval between each
EMR task may not improve the success rate of the jobs, and may increase the execution
time and cost of the state machine. Configuring an Amazon SNS topic to store the error
messages may help to notify the company of any failures, but it does not provide enough
information to troubleshoot the issue.
References:
1: Manage an Amazon EMR Job - AWS Step Functions 2: Access Analyzer for S3 - Amazon Simple Storage Service
3: Working with Amazon EMR and VPC Flow Logs - Amazon EMR
A company maintains an Amazon Redshift provisioned cluster that the company uses forextract, transform, and load (ETL) operations to support critical analysis tasks. A salesteam within the company maintains a Redshift cluster that the sales team uses for businessintelligence (BI) tasks.The sales team recently requested access to the data that is in the ETL Redshift cluster sothe team can perform weekly summary analysis tasks. The sales team needs to join datafrom the ETL cluster with data that is in the sales team's BI cluster.The company needs a solution that will share the ETL cluster data with the sales teamwithout interrupting the critical analysis tasks. The solution must minimize usage of thecomputing resources of the ETL cluster.Which solution will meet these requirements?
A. Set up the sales team Bl cluster asa consumer of the ETL cluster by using Redshift datasharing. B. Create materialized views based on the sales team's requirements. Grant the salesteam direct access to the ETL cluster. C. Create database views based on the sales team's requirements. Grant the sales teamdirect access to the ETL cluster. D. Unload a copy of the data from the ETL cluster to an Amazon S3 bucket every week.Create an Amazon Redshift Spectrum table based on the content of the ETL cluster.
Answer: A
Explanation: Redshift data sharing is a feature that enables you to share live data across
different Redshift clusters without the need to copy or move data. Data sharing provides
secure and governed access to data, while preserving the performance and concurrency
benefits of Redshift. By setting up the sales team BI cluster as a consumer of the ETL
cluster, the company can share the ETL cluster data with the sales team without
interrupting the critical analysis tasks. The solution also minimizes the usage of the
computing resources of the ETL cluster, as the data sharing does not consume any storage space or compute resources from the producer cluster. The other options are either not
feasible or not efficient. Creating materialized views or database views would require the
sales team to have direct access to the ETL cluster, which could interfere with the critical
analysis tasks. Unloading a copy of the data from the ETL cluster to anAmazon S3 bucket
every week would introduce additional latency and cost, as well as create data
inconsistency issues. References:
Sharing data across Amazon Redshift clusters
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,
Chapter 2: Data Store Management, Section 2.2: Amazon Redshift
Question # 25
A data engineer maintains custom Python scripts that perform a data formatting processthat many AWS Lambda functions use. When the data engineer needs to modify thePython scripts, the data engineer must manually update all the Lambda functions.The data engineer requires a less manual way to update the Lambda functions.Which solution will meet this requirement?
A. Store a pointer to the custom Python scripts in the execution context object in a sharedAmazon S3 bucket. B. Package the custom Python scripts into Lambda layers. Apply the Lambda layers to theLambda functions. C. Store a pointer to the custom Python scripts in environment variables in a sharedAmazon S3 bucket. D. Assign the same alias to each Lambda function. Call reach Lambda function byspecifying the function's alias.
Answer: B
Explanation: Lambda layers are a way to share code and dependencies across multiple
Lambda functions. By packaging the custom Python scripts into Lambda layers, the data
engineer can update the scripts in one place and have them automatically applied to all the
Lambda functions that use the layer. This reduces the manual effort and ensures
consistency across the Lambda functions. The other options are either not feasible or not
efficient. Storing a pointer to the custom Python scripts in the execution context object or in
environment variables would require the Lambda functions to download the scripts from Amazon S3 every time they are invoked, which would increase latency and cost. Assigning
the same alias to each Lambda function would not help with updating the Python scripts, as
the alias only points to a specific version of the Lambda function code. References:
AWS Lambda layers
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,
Chapter 3: Data Ingestion and Transformation, Section 3.4: AWS Lambda
Feedback That Matters: Reviews of Our Amazon Data-Engineer-Associate Dumps
Natalia ReyesMay 16, 2026
Passed my Data Engineer Associate exam with an 89%! The practice questions from MyCertsHub were close to the real ones. Helped me focus on key areas without wasting time.
Veronica KellyMay 15, 2026
Very practical content. The labs and case studies made things clear for me. I was able to connect everything to my daily work. Solid prep course.
Demi DavisMay 15, 2026
Scored 92% on the exam. MyCertsHub made a big difference. I liked how the material was direct and to the point, no fluff. The mock exams helped me find my weak spots early.
Matilda ThomasMay 14, 2026
Good value for the price. The support team answered my questions quickly, and the explanations after each quiz helped me understand better. satisfied with the outcome.
Ella HillMay 14, 2026
I didn’t have much time to prepare, but the course structure helped me stay focused. Ended up scoring 85%. Would recommend for anyone looking to get certified efficiently.