Databricks Certified Data Engineer Professional Exam
661 Reviews
Exam Code
Databricks-Certified-Professional-Data-Engineer
Exam Name
Databricks Certified Data Engineer Professional Exam
Questions
202 Questions Answers With Explanation
Update Date
05, 13, 2026
Price
Was :
$81
Today :
$45
Was :
$99
Today :
$55
Was :
$117
Today :
$65
Why Should You Prepare For Your Databricks Certified Data Engineer Professional Exam With MyCertsHub?
At MyCertsHub, we go beyond standard study material. Our platform provides authentic Databricks Databricks-Certified-Professional-Data-Engineer Exam Dumps, detailed exam guides, and reliable practice exams that mirror the actual Databricks Certified Data Engineer Professional Exam test. Whether you’re targeting Databricks certifications or expanding your professional portfolio, MyCertsHub gives you the tools to succeed on your first attempt.
Every set of exam dumps is carefully reviewed by certified experts to ensure accuracy. For the Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam , you’ll receive updated practice questions designed to reflect real-world exam conditions. This approach saves time, builds confidence, and focuses your preparation on the most important exam areas.
Realistic Test Prep For The Databricks-Certified-Professional-Data-Engineer
You can instantly access downloadable PDFs of Databricks-Certified-Professional-Data-Engineer practice exams with MyCertsHub. These include authentic practice questions paired with explanations, making our exam guide a complete preparation tool. By testing yourself before exam day, you’ll walk into the Databricks Exam with confidence.
Smart Learning With Exam Guides
Our structured Databricks-Certified-Professional-Data-Engineer exam guide focuses on the Databricks Certified Data Engineer Professional Exam's core topics and question patterns. You will be able to concentrate on what really matters for passing the test rather than wasting time on irrelevant content. Pass the Databricks-Certified-Professional-Data-Engineer Exam – Guaranteed
We Offer A 100% Money-Back Guarantee On Our Products.
After using MyCertsHub's exam dumps to prepare for the Databricks Certified Data Engineer Professional Exam exam, we will issue a full refund. That’s how confident we are in the effectiveness of our study resources.
Try Before You Buy – Free Demo
Still undecided? See for yourself how MyCertsHub has helped thousands of candidates achieve success by downloading a free demo of the Databricks-Certified-Professional-Data-Engineer exam dumps.
MyCertsHub – Your Trusted Partner For Databricks Exams
Whether you’re preparing for Databricks Certified Data Engineer Professional Exam or any other professional credential, MyCertsHub provides everything you need: exam dumps, practice exams, practice questions, and exam guides. Passing your Databricks-Certified-Professional-Data-Engineer exam has never been easier thanks to our tried-and-true resources.
A data pipeline uses Structured Streaming to ingest data from kafka to Delta Lake. Data is
being stored in a bronze table, and includes the Kafka_generated timesamp, key, and
value. Three months after the pipeline is deployed the data engineering team has noticed
some latency issued during certain times of the day.
A senior data engineer updates the Delta Table's schema and ingestion logic to include the
current timestamp (as recoded by Apache Spark) as well the Kafka topic and partition. The
team plans to use the additional metadata fields to diagnose the transient processing
delays:
Which limitation will the team face while diagnosing this problem?
A. New fields not be computed for historic records. B. Updating the table schema will invalidate the Delta transaction log metadata. C. Updating the table schema requires a default value provided for each file added. D. Spark cannot capture the topic partition fields from the kafka source.
Answer: A Explanation: When adding new fields to a Delta table's schema, these fields will not be retrospectively applied to historical records that were ingested before the schema change. Consequently, while the team can use the new metadata fields to investigate transient processing delays moving forward, they will be unable to apply this diagnostic approach to past data that lacks these fields. References: Databricks documentation on Delta Lake schema management: https://docs.databricks.com/delta/delta-batch.html#schema-management
Question # 2
The data architect has decided that once data has been ingested from external sources
into the
Databricks Lakehouse, table access controls will be leveraged to manage permissions for
all production tables and views.
The following logic was executed to grant privileges for interactive queries on a production
database to the core engineering group.
GRANT USAGE ON DATABASE prod TO eng;
GRANT SELECT ON DATABASE prod TO eng;
Assuming these are the only privileges that have been granted to the eng group and that
these users are not workspace administrators, which statement describes their privileges?
A. Group members have full permissions on the prod database and can also assign permissions to other users or groups. B. Group members are able to list all tables in the prod database but are not able to see the results of any queries on those tables. C. Group members are able to query and modify all tables and views in the prod database, but cannot create new tables or views. D. Group members are able to query all tables and views in the prod database, but cannot create or edit anything in the database. E. Group members are able to create, query, and modify all tables and views in the prod database, but cannot define custom functions.
Answer: D Explanation: The GRANT USAGE ON DATABASE prod TO eng command grants the eng group the permission to use the prod database, which means they can list and access the tables and views in the database. The GRANT SELECT ON DATABASE prod TO eng command grants the eng group the permission to select data from the tables and views in the prod database, which means they can query the data using SQL or DataFrame API. However, these commands do not grant the eng group any other permissions, such as creating, modifying, or deleting tables and views, or defining custom functions. Therefore, the eng group members are able to query all tables and views in the prod database, but cannot create or edit anything in the database. References: Grant privileges on a database: https://docs.databricks.com/en/security/authauthz/table-acls/grant-privileges-database.html Privileges you can grant on Hive metastore objects: https://docs.databricks.com/en/security/auth-authz/table-acls/privileges.html
Question # 3
An upstream system is emitting change data capture (CDC) logs that are being written to a
cloud object storage directory. Each record in the log indicates the change type (insert,
update, or delete) and the values for each field after the change. The source table has a
primary key identified by the field pk_id.
For auditing purposes, the data governance team wishes to maintain a full record of all
values that have ever been valid in the source system. For analytical purposes, only the
most recent value for each record needs to be recorded. The Databricks job to ingest these
records occurs once per hour, but each individual record may have changed multiple times
over the course of an hour.
Which solution meets these requirements?
A. Create a separate history table for each pk_id resolve the current state of the table by running a union all filtering the history tables for the most recent state. B. Use merge into to insert, update, or delete the most recent entry for each pk_id into a bronze table, then propagate all changes throughout the system. C. Iterate through an ordered set of changes to the table, applying each in turn; rely on Delta Lake's versioning ability to create an audit log. D. Use Delta Lake's change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse. E. Ingest all log information into a bronze table; use merge into to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.
Answer: B Explanation: This is the correct answer because it meets the requirements of maintaining a full record of all values that have ever been valid in the source system and recreating the current table state with only the most recent value for each record. The code ingests all log information into a bronze table, which preserves the raw CDC data as it is. Then, it uses merge into to perform an upsert operation on a silver table, which means it will insert new records or update or delete existing records based on the change type and the pk_id columns. This way, the silver table will always reflect the current state of the source table, while the bronze table will keep the history of all changes. Verified References: [Databricks Certified Data Engineer Professional], under “Delta Lake” section; Databricks Documentation, under “Upsert into a table using merge” section
Question # 4
What is the first of a Databricks Python notebook when viewed in a text editor?
A. %python B. % Databricks notebook source C. -- Databricks notebook source D. //Databricks notebook source
Answer: B Explanation: When viewing a Databricks Python notebook in a text editor, the first line indicates the format and source type of the notebook. The correct option is % Databricks notebook source, which is a magic command that specifies the start of a Databricks notebook source file.
Question # 5
Which statement regarding spark configuration on the Databricks platform is true?
A. Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster. B. When the same spar configuration property is set for an interactive to the same interactive cluster. C. Spark configuration set within an notebook will affect all SparkSession attached to the same interactive cluster D. The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs.
Answer: A Explanation: When Spark configuration properties are set for an interactive cluster using the Clusters UI in Databricks, those configurations are applied at the cluster level. This means that all notebooks attached to that cluster will inherit and be affected by these configurations. This approach ensures consistency across all executions within that cluster, as the Spark configuration properties dictate aspects such as memory allocation, number of executors, and other vital execution parameters. This centralized configuration management helps maintain standardized execution environments across different notebooks, aiding in debugging and performance optimization. References: Databricks documentation on configuring clusters: https://docs.databricks.com/clusters/configure.html
Question # 6
The data architect has mandated that all tables in the Lakehouse should be configured as
external (also known as "unmanaged") Delta Lake tables.
Which approach will ensure that this requirement is met?
A. When a database is being created, make sure that the LOCATION keyword is used. B. When configuring an external data warehouse for all table storage, leverage Databricks for all ELT. C. When data is saved to a table, make sure that a full file path is specified alongside the Delta format. D. When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement. E. When the workspace is being configured, make sure that external cloud object storage has been mounted.
Answer: D Explanation: To create an external or unmanaged Delta Lake table, you need to use the EXTERNAL keyword in the CREATE TABLE statement. This indicates that the table is not managed by the catalog and the data files are not deleted when the table is dropped. You also need to provide a LOCATION clause to specify the path where the data files are stored. For example: CREATE EXTERNAL TABLE events ( date DATE, eventId STRING, eventType STRING, data STRING) USING DELTA LOCATION ‘/mnt/delta/events’; This creates an external Delta Lake table named events that references the data files in the ‘/mnt/delta/events’ path. If you drop this table, the data files will remain intact and you can recreate the table with the same statement. References: https://docs.databricks.com/delta/delta-batch.html#create-a-tablehttps://docs.databricks.com/delta/delta-batch.html#drop-a-table
Question # 7
The DevOps team has configured a production workload as a collection of notebooks
scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the
team and has requested access to one of these notebooks to review the production logic.
What are the maximum notebook permissions that can be granted to the user without
allowing accidental changes to production code or data?
A. Can manage B. Can edit C. Can run D. Can Read
Answer: D Explanation: Granting a user 'Can Read' permissions on a notebook within Databricks allows them to view the notebook's content without the ability to execute or edit it. This level of permission ensures that the new team member can review the production logic for learning or auditing purposes without the risk of altering the notebook's code or affecting production data and workflows. This approach aligns with best practices for maintaining security and integrity in production environments, where strict access controls are essential to prevent unintended modifications.References: Databricks documentation on access control and permissions for notebooks within the workspace (https://docs.databricks.com/security/access-control/workspace-acl.html).
Question # 8
The marketing team is looking to share data in an aggregate table with the sales
organization, but the field names used by the teams do not match, and a number of
marketing specific fields have not been approval for the sales org.
Which of the following solutions addresses the situation while emphasizing simplicity?
A. Create a view on the marketing table selecting only these fields approved for the sales team alias the names of any fields that should be standardized to the sales naming conventions. B. Use a CTAS statement to create a derivative table from the marketing table configure a production jon to propagation changes. C. Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from marketing table. D. Create a new table with the required schema and use Delta Lake's DEEP CLONE functionality to sync up changes committed to one table to the corresponding table
Answer: A
Explanation: Creating a view is a straightforward solution that can address the need for
field name standardization and selective field sharing between departments. A view allows
for presenting a transformed version of the underlying data without duplicating it. In this
scenario, the view would only include the approved fields for the sales team and rename
any fields as per their naming conventions.
References:
Databricks documentation on using SQL views in Delta Lake:
https://docs.databricks.com/delta/quick-start.html#sql-views
Question # 9
Assuming that the Databricks CLI has been installed and configured correctly, which
Databricks CLI command can be used to upload a custom Python Wheel to object storage
mounted with the DBFS for use with a production job?
A. configure B. fs C. jobs D. libraries E. workspace
Answer: B Explanation: The libraries command group allows you to install, uninstall, and list libraries on Databricks clusters. You can use the libraries install command to install a custom Python Wheel on a cluster by specifying the --whl option and the path to the wheel file. For example, you can use the following command to install a custom Python Wheel named mylib-0.1-py3-none-any.whl on a cluster with the id 1234-567890-abcde123: databricks libraries install --cluster-id 1234-567890-abcde123 --whl dbfs:/mnt/mylib/mylib0.1-py3-none-any.whl This will upload the custom Python Wheel to the cluster and make it available for use with a production job. You can also use the libraries uninstall command to uninstall a library from a cluster, and the libraries list command to list the libraries installed on a cluster. References: Libraries CLI (legacy): https://docs.databricks.com/en/archive/devtools/cli/libraries-cli.html Library operations: https://docs.databricks.com/en/devtools/cli/commands.html#library-operations Install or update the Databricks CLI: https://docs.databricks.com/en/devtools/cli/install.html
Question # 10
A DLT pipeline includes the following streaming tables:
Raw_lot ingest raw device measurement data from a heart rate tracking device.
Bgm_stats incrementally computes user statistics based on BPM measurements from
raw_lot.
How can the data engineer configure this pipeline to be able to retain manually deleted or
updated records in the raw_iot table while recomputing the downstream table when a
pipeline update is run?
A. Set the skipChangeCommits flag to true on bpm_stats B. Set the SkipChangeCommits flag to true raw_lot C. Set the pipelines, reset, allowed property to false on bpm_stats D. Set the pipelines, reset, allowed property to false on raw_iot
Answer: D Explanation: In Databricks Lakehouse, to retain manually deleted or updated records in the raw_iot table while recomputing downstream tables when a pipeline update is run, the property pipelines.reset.allowed should be set to false. This property prevents the system from resetting the state of the table, which includes the removal of the history of changes, during a pipeline update. By keeping this property as false, any changes to the raw_iot table, including manual deletes or updates, are retained, and recomputation of downstream tables, such as bpm_stats, can occur with the full history of data changes intact. References: Databricks documentation on DLT pipelines: https://docs.databricks.com/dataengineering/delta-live-tables/delta-live-tables-overview.html
Question # 11
The data engineer team has been tasked with configured connections to an external
database that does not have a supported native connector with Databricks. The external
database already has data security configured by group membership. These groups map
directly to user group already created in Databricks that represent various teams within the
company.
A new login credential has been created for each group in the external database. The
Databricks Utilities Secrets module will be used to make these credentials available to
Databricks users.
Assuming that all the credentials are configured correctly on the external database and
group membership is properly configured on Databricks, which statement describes how
teams can be granted the minimum necessary access to using these credentials?
A. ‘’Read’’ permissions should be set on a secret key mapped to those credentials that will be used by a given team. B. No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added. C. “Read” permissions should be set on a secret scope containing only those credentials that will be used by a given team. D. “Manage” permission should be set on a secret scope containing only those credentials that will be used by a given team.
Answer: C Explanation: In Databricks, using the Secrets module allows for secure management of sensitive information such as database credentials. Granting 'Read' permissions on a secret key that maps to database credentials for a specific team ensures that only members of that team can access these credentials. This approach aligns with the principle of least privilege, granting users the minimum level of access required to perform their jobs, thus enhancing security. References: Databricks Documentation on Secret Management: Secrets
Question # 12
The data engineer is using Spark's MEMORY_ONLY storage level.
Which indicators should the data engineer look for in the spark UI's Storage tab to signal
that a cached table is not performing optimally?
A. Size on Disk is> 0 B. The number of Cached Partitions> the number of Spark Partitions C. The RDD Block Name included the '' annotation signaling failure to cache D. On Heap Memory Usage is within 75% of off Heap Memory usage
Answer: C Explanation: In the Spark UI's Storage tab, an indicator that a cached table is not performing optimally would be the presence of the _disk annotation in the RDD Block Name. This annotation indicates that some partitions of the cached data have been spilled to disk because there wasn't enough memory to hold them. This is suboptimal because accessing data from disk is much slower than from memory. The goal of caching is to keep data in memory for fast access, and a spill to disk means that this goal is not fully achieved.
Question # 13
A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that
the Min, Median, and Max Durations for tasks in a particular stage show the minimum and
median time to complete a task as roughly the same, but the max duration for a task to be
roughly 100 times as long as the minimum.
Which situation is causing increased duration of the overall job?
A. Task queueing resulting from improper thread pool assignment. B. Spill resulting from attached volume storage being too small. C. Network latency due to some cluster nodes being in different regions from the source data D. Skew caused by more data being assigned to a subset of spark-partitions. E. Credential validation errors while pulling data from an external system.
Answer: D Explanation: This is the correct answer because skew is a common situation that causes increased duration of the overall job. Skew occurs when some partitions have more data than others, resulting in uneven distribution of work among tasks and executors. Skew can be caused by various factors, such as skewed data distribution, improper partitioning strategy, or join operations with skewed keys. Skew can lead to performance issues such as long-running tasks, wasted resources, or even task failures due to memory or disk spills. Verified References: [Databricks Certified Data Engineer Professional], under “Performance Tuning” section; Databricks Documentation, under “Skew” section.
Question # 14
Spill occurs as a result of executing various wide transformations. However, diagnosing
spill requires one to proactively look for key indicators.
Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?
A. Stage’s detail screen and Executor’s files B. Stage’s detail screen and Query’s detail screen C. Driver’s and Executor’s log files D. Executor’s detail screen and Executor’s log files
Answer: B Explanation: In Apache Spark's UI, indicators of data spilling to disk during the execution of wide transformations can be found in the Stage’s detail screen and the Query’s detail screen. These screens provide detailed metrics about each stage of a Spark job, including information about memory usage and spill data. If a task is spilling data to disk, it indicates that the data being processed exceeds the available memory, causing Spark to spill data to disk to free up memory. This is an important performance metric as excessive spill can significantly slow down the processing. References: Apache Spark Monitoring and Instrumentation: Spark Monitoring Guide Spark UI Explained: Spark UI Documentation
Question # 15
A team of data engineer are adding tables to a DLT pipeline that contain repetitive
expectations for many of the same data quality checks.
One member of the team suggests reusing these data quality rules across all tables
defined for this pipeline.
What approach would allow them to do this?
A. Maintain data quality rules in a Delta table outside of this pipeline’s target schema, providing the schema name as a pipeline parameter. B. Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline. C. Add data quality constraints to tables in this pipeline using an external job with access to pipeline configuration files. D. Maintain data quality rules in a separate Databricks notebook that each DLT notebook of file.
Answer: A Explanation: Maintaining data quality rules in a centralized Delta table allows for the reuse of these rules across multiple DLT (Delta Live Tables) pipelines. By storing these rules outside the pipeline's target schema and referencing the schema name as a pipeline parameter, the team can apply the same set of data quality checks to different tables within the pipeline. This approach ensures consistency in data quality validations and reduces redundancy in code by not having to replicate the same rules in each DLT notebook or file. References: Databricks Documentation on Delta Live Tables: Delta Live Tables Guide
Question # 16
A data engineer, User A, has promoted a new pipeline to production by using the REST
API to programmatically create several jobs. A DevOps engineer, User B, has configured
an external orchestration tool to trigger job runs through the REST API. Both users
authorized the REST API calls using their personal access tokens.
Which statement describes the contents of the workspace audit logs concerning these
events?
A. Because the REST API was used for job creation and triggering runs, a Service Principal will be automatically used to identity these events. B. Because User B last configured the jobs, their identity will be associated with both the job creation events and the job run events. C. Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events. D. Because the REST API was used for job creation and triggering runs, user identity will not be captured in the audit logs. E. Because User A created the jobs, their identity will be associated with both the job creation events and the job run events.
Answer: C Explanation: The events are that a data engineer, User A, has promoted a new pipeline to
production by using the REST API to programmatically create several jobs, and a DevOps
engineer, User B, has configured an external orchestration tool to trigger job runs through
the REST API. Both users authorized the REST API calls using their personal access
tokens. The workspace audit logs are logs that record user activities in a Databricks
workspace, such as creating, updating, or deleting objects like clusters, jobs, notebooks, or
tables. The workspace audit logs also capture the identity of the user who performed each
activity, as well as the time and details of the activity. Because these events are managed
separately, User A will have their identity associated with the job creation events and User
B will have their identity associated with the job run events in the workspace audit logs.
Verified References: [Databricks Certified Data Engineer Professional], under “Databricks
Workspace” section; Databricks Documentation, under “Workspace audit logs” section
Question # 17
A junior developer complains that the code in their notebook isn't producing the correct
results in the development environment. A shared screenshot reveals that while they're
using a notebook versioned with Databricks Repos, they're using a personal branch that
contains old logic. The desired branch named dev-2.3.9 is not available from the branch
selection dropdown.
Which approach will allow this developer to review the current logic for this notebook?
A. Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9 B. Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch. C. Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch D. Merge all changes back to the main branch in the remote Git repository and clone the repo again E. Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository
Answer: B Explanation: This is the correct answer because it will allow the developer to update their local repository with the latest changes from the remote repository and switch to the desired branch. Pulling changes will not affect the current branch or create any conflicts, as it will only fetch the changes and not merge them. Selecting the dev-2.3.9 branch from the dropdown will checkout that branch and display its contents in the notebook. Verified References: [Databricks Certified Data Engineer Professional], under “Databricks Tooling” section; Databricks Documentation, under “Pull changes from a remote repository” section.
Question # 18
A Data engineer wants to run unit’s tests using common Python testing frameworks on
python functions defined across several Databricks notebooks currently used in production.
How can the data engineer run unit tests against function that work with data in production?
A. Run unit tests against non-production data that closely mirrors production B. Define and unit test functions using Files in Repos C. Define units test and functions within the same notebook D. Define and import unit test functions from a separate Databricks notebook
Answer: A Explanation: The best practice for running unit tests on functions that interact with data is to use a dataset that closely mirrors the production data. This approach allows data engineers to validate the logic of their functions without the risk of affecting the actual production data. It's important to have a representative sample of production data to catch edge cases and ensure the functions will work correctly when used in a production environment. References: Databricks Documentation on Testing: Testing and Validation of Data and Notebooks
Question # 19
The Databricks workspace administrator has configured interactive clusters for each of the
data engineering groups. To control costs, clusters are set to terminate after 30 minutes of
inactivity. Each user should be able to execute workloads against their assigned clusters at
any time of the day. Assuming users have been added to a workspace but not granted any permissions, which
of the following describes the minimal permissions a user would need to start and attach to
an already configured cluster.
A. "Can Manage" privileges on the required cluster
B. Workspace Admin privileges, cluster creation allowed. "Can Attach To" privileges on the
required cluster
C. Cluster creation allowed. "Can Attach To" privileges on the required cluster
D. "Can Restart" privileges on the required cluster
E. Cluster creation allowed. "Can Restart" privileges on the required cluster
Answer: D
Explanation: https://learn.microsoft.com/en-us/azure/databricks/security/authauthz/access-control/cluster-acl
https://docs.databricks.com/en/security/auth-authz/access-control/cluster-acl.html
A. "Can Manage" privileges on the required cluster B. Workspace Admin privileges, cluster creation allowed. "Can Attach To" privileges on the required cluster C. Cluster creation allowed. "Can Attach To" privileges on the required cluster D. "Can Restart" privileges on the required cluster E. Cluster creation allowed. "Can Restart" privileges on the required cluster
The data engineering team maintains a table of aggregate statistics through batch nightly
updates. This includes total sales for the previous day alongside totals and averages for a
variety of time periods including the 7 previous days, year-to-date, and quarter-to-date.
This table is named store_saies_summary and the schema is as follows:
The table daily_store_sales contains all the information needed to update
store_sales_summary. The schema for this table is:
store_id INT, sales_date DATE, total_sales FLOAT
If daily_store_sales is implemented as a Type 1 table and the total_sales column might
be adjusted after manual data auditing, which approach is the safest to generate accurate
reports in the store_sales_summary table?
A. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update. B. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table. C. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table. D. Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table. E. Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.
Answer: E Explanation: The daily_store_sales table contains all the information needed to update store_sales_summary. The schema of the table is: store_id INT, sales_date DATE, total_sales FLOAT The daily_store_sales table is implemented as a Type 1 table, which means that old values are overwritten by new values and no history is maintained. The total_sales column might be adjusted after manual data auditing, which means that the data in the table may change over time. The safest approach to generate accurate reports in the store_sales_summary table is to use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update. Structured Streaming is a scalable and fault-tolerant stream processing engine built on Spark SQL. Structured Streaming allows processing data streams as if they were tables or DataFrames, using familiar operations such as select, filter, groupBy, or join. Structured Streaming also supports output modes that specify how to write the results of a streaming query to a sink, such as append, update, or complete. Structured Streaming can handle both streaming and batch data sources in a unified manner. The change data feed is a feature of Delta Lake that provides structured streaming sources that can subscribe to changes made to a Delta Lake table. The change data feed captures both data changes and schema changes as ordered events that can be processed by downstream applications or services. The change data feed can be configured with different options, such as starting from a specific version or timestamp, filtering by operation type or partition values, or excluding no-op changes. By using Structured Streaming to subscribe to the change data feed for daily_store_sales, one can capture and process any changes made to the total_sales column due to manual data auditing. By applying these changes to the aggregates in the store_sales_summary table with each update, one can ensure that the reports are always consistent and accurate with the latest data. Verified References: [Databricks Certified Data Engineer Professional], under “Spark Core” section; Databricks Documentation, under “Structured Streaming” section; Databricks Documentation, under “Delta Change Data Feed” section.
Question # 21
A production workload incrementally applies updates from an external Change Data
Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was
initially migrated for this table, OPTIMIZE was executed and most data files were resized to
1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming
production job. Recent review of data files shows that most data files are under 64 MB,
although each partition in the table contains at least 1 GB of data and the total table size is
over 10 TB.
Which of the following likely explains these smaller file sizes?
A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations B. Z-order indices calculated on the table are preventing file compaction C Bloom filler indices calculated on the table are preventing file compaction C. Databricks has autotuned to a smaller target file size based on the overall size of data in the table D. Databricks has autotuned to a smaller target file size based on the amount of data in each partition
Answer: A Explanation: This is the correct answer because Databricks has a feature called Auto Optimize, which automatically optimizes the layout of Delta Lake tables by coalescing small files into larger ones and sorting data within each file by a specified column. However, Auto Optimize also considers the trade-off between file size and merge performance, and may choose a smaller target file size to reduce the duration of merge operations, especially for streaming workloads that frequently update existing records. Therefore, it is possible that Auto Optimize has autotuned to a smaller target file size based on the characteristics of the streaming production job. Verified References: [Databricks Certified Data Engineer Professional], under “Delta Lake” section; Databricks Documentation, under “Auto Optimize” section. https://docs.databricks.com/en/delta/tune-file-size.html#autotune-table 'Autotune file size based on workload'
Question # 22
Which statement regarding stream-static joins and static Delta tables is correct?
A. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch. B. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job's initialization. C. The checkpoint directory will be used to track state information for the unique keys present in the join. D. Stream-static joins cannot use static Delta tables because of consistency issues. E. The checkpoint directory will be used to track updates to the static Delta table.
Answer: A Explanation: This is the correct answer because stream-static joins are supported by Structured Streaming when one of the tables is a static Delta table. A static Delta table is a Delta table that is not updated by any concurrent writes, such as appends or merges, during the execution of a streaming query. In this case, each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch, which means it will reflect any changes made to the static Delta table before the start of each microbatch. Verified References: [Databricks Certified Data Engineer Professional], under “Structured Streaming” section; Databricks Documentation, under “Stream and static joins” section.
Question # 23
A CHECK constraint has been successfully added to the Delta table named activity_details
using the following logic:
A batch job is attempting to insert new records to the table, including a record where
latitude = 45.50 and longitude = 212.67.
Which statement describes the outcome of this batch insert?
A. The write will fail when the violating record is reached; any records previously processed will be recorded to the target table. B. The write will fail completely because of the constraint violation and no records will be inserted into the target table. C. The write will insert all records except those that violate the table constraints; the violating records will be recorded to a quarantine table. D. The write will include all records in the target table; any violations will be indicated in the boolean column named valid_coordinates. E. The write will insert all records except those that violate the table constraints; the violating records will be reported in a warning log.
Answer: B Explanation: The CHECK constraint is used to ensure that the data inserted into the table meets the specified conditions. In this case, the CHECK constraint is used to ensure that the latitude and longitude values are within the specified range. If the data does not meet the specified conditions, the write operation will fail completely and no records will be inserted into the target table. This is because Delta Lake supports ACID transactions, which means that either all the data is written or none of it is written. Therefore, the batch insert will fail when it encounters a record that violates the constraint, and the target table will not be updated. References: Constraints: https://docs.delta.io/latest/delta-constraints.html ACID Transactions: https://docs.delta.io/latest/delta-intro.html#acid-transactions
Question # 24
A distributed team of data analysts share computing resources on an interactive cluster
with autoscaling configured. In order to better manage costs and query throughput, the
workspace administrator is hoping to evaluate whether cluster upscaling is caused by many
concurrent users or resource-intensive queries.
In which location can one review the timeline for cluster resizing events?
A. Workspace audit logs B. Driver's log file C. Ganglia D. Cluster Event Log E. Executor's log file
Answer: C
Question # 25
When scheduling Structured Streaming jobs for production, which configuration
automatically recovers from query failures and keeps costs low?
A. Cluster: New Job Cluster; Retries: Unlimited; Maximum Concurrent Runs: Unlimited B. Cluster: New Job Cluster; Retries: None; Maximum Concurrent Runs: 1 C. Cluster: Existing All-Purpose Cluster; Retries: Unlimited; Maximum Concurrent Runs: 1 D. Cluster: Existing All-Purpose Cluster; Retries: Unlimited; Maximum Concurrent Runs: 1 E. Cluster: Existing All-Purpose Cluster; Retries: None; Maximum Concurrent Runs: 1
Answer: D Explanation: The configuration that automatically recovers from query failures and keeps costs low is to use a new job cluster, set retries to unlimited, and set maximum concurrent runs to 1. This configuration has the following advantages: A new job cluster is a cluster that is created and terminated for each job run. This means that the cluster resources are only used when the job is running, and no idle costs are incurred. This also ensures that the cluster is always in a clean state and has the latest configuration and libraries for the job1. Setting retries to unlimited means that the job will automatically restart the query in case of any failure, such as network issues, node failures, or transient errors. This improves the reliability and availability of the streaming job, and avoids data loss or inconsistency2. Setting maximum concurrent runs to 1 means that only one instance of the job can run at a time. This prevents multiple queries from competing for the same resources or writing to the same output location, which can cause performance degradation or data corruption3. Therefore, this configuration is the best practice for scheduling Structured Streaming jobs for production, as it ensures that the job is resilient, efficient, and consistent. References: Job clusters, Job retries, Maximum concurrent runs
Feedback That Matters: Reviews of Our Databricks Databricks-Certified-Professional-Data-Engineer Dumps
Vincent JohnstonMay 16, 2026
Certified as a Databricks Certified Professional Data Engineer! The practice questions from MyCertsHub were a lifesaver, especially for topics like advanced Delta Lake and optimization.
Delaney WilliamsMay 15, 2026
Real-world data engineering scenarios are tested on this exam. I received more than just theoretical examples from MyCertsHub. Their practice tests were almost identical in difficulty and structure to the real thing.
Amelia ThomasMay 15, 2026
Prepare with MyCertsHub if you're not 100% confident with Spark, Delta Lake, and performance tuning. I was able to grasp concepts that were frequently asked of me on the Databricks Professional exam thanks to their resources.
Dorothy LewisMay 14, 2026
Passed the Databricks Data Engineer Pro exam with a score of 89%! The content on MyCertsHub was well-organized, and their explanations were superior to those on free dumps.
Prabhat TalwarMay 14, 2026
I was able to find new employment opportunities after becoming certified as a Databricks Professional Data Engineer." MyCertsHub played a significant role because of their scenario-driven practice questions, which assisted me in connecting my platform knowledge to the impact on the business.