Databricks Certified Associate Developer for Apache Spark 3.5 – Python
Questions
136 Questions Answers With Explanation
Update Date
05, 18, 2026
Price
Was :
$81
Today :
$45
Was :
$99
Today :
$55
Was :
$117
Today :
$65
Why Should You Prepare For Your Databricks Certified Associate Developer for Apache Spark 3.5 – Python With MyCertsHub?
At MyCertsHub, we go beyond standard study material. Our platform provides authentic Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam Dumps, detailed exam guides, and reliable practice exams that mirror the actual Databricks Certified Associate Developer for Apache Spark 3.5 – Python test. Whether you’re targeting Databricks certifications or expanding your professional portfolio, MyCertsHub gives you the tools to succeed on your first attempt.
Every set of exam dumps is carefully reviewed by certified experts to ensure accuracy. For the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python , you’ll receive updated practice questions designed to reflect real-world exam conditions. This approach saves time, builds confidence, and focuses your preparation on the most important exam areas.
Realistic Test Prep For The Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5
You can instantly access downloadable PDFs of Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 practice exams with MyCertsHub. These include authentic practice questions paired with explanations, making our exam guide a complete preparation tool. By testing yourself before exam day, you’ll walk into the Databricks Exam with confidence.
Smart Learning With Exam Guides
Our structured Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam guide focuses on the Databricks Certified Associate Developer for Apache Spark 3.5 – Python's core topics and question patterns. You will be able to concentrate on what really matters for passing the test rather than wasting time on irrelevant content. Pass the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam – Guaranteed
We Offer A 100% Money-Back Guarantee On Our Products.
After using MyCertsHub's exam dumps to prepare for the Databricks Certified Associate Developer for Apache Spark 3.5 – Python exam, we will issue a full refund. That’s how confident we are in the effectiveness of our study resources.
Try Before You Buy – Free Demo
Still undecided? See for yourself how MyCertsHub has helped thousands of candidates achieve success by downloading a free demo of the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam dumps.
MyCertsHub – Your Trusted Partner For Databricks Exams
Whether you’re preparing for Databricks Certified Associate Developer for Apache Spark 3.5 – Python or any other professional credential, MyCertsHub provides everything you need: exam dumps, practice exams, practice questions, and exam guides. Passing your Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam has never been easier thanks to our tried-and-true resources.
22 of 55. A Spark application needs to read multiple Parquet files from a directory where the files have differing but compatible schemas. The data engineer wants to create a DataFrame that includes all columns from all files. Which code should the data engineer use to read the Parquet files and include all columns using Apache Spark?
A. spark.read.parquet("/data/parquet/") B. spark.read.option("mergeSchema", True).parquet("/data/parquet/") C. spark.read.format("parquet").option("inferSchema", "true").load("/data/parquet/") D. spark.read.parquet("/data/parquet/").option("mergeAllCols", True)
Answer: B
Explanation:
When reading Parquet files, Spark infers a unified schema automatically only if all files share
identical structures.
If files have different but compatible schemas, you must enable schema merging by setting the
21 of 55. What is the behavior of the function date_sub(start, days) if a negative value is passed into the days parameter?
A. The number of days specified will be added to the start date. B. An error message of an invalid parameter will be returned. C. The same start date will be returned. D. The number of days specified will be removed from the start date.
Answer: A
Explanation:
In Spark SQL, the function date_sub(startDate, days) returns the date that is days before startDate.
If the days parameter is negative, Spark interprets it as subtracting a negative number, which
B: No error occurs; negative values are supported.
C: The start date changes if days ≠0.
D: Subtracting days would move the date backward, not forward.
Reference:
Spark SQL Functions ” date_sub(startDate, days) and date_add(startDate, days) behavior.
Databricks Exam Guide (June 2025): Section œUsing Spark SQL ” working with date and timestamp
functions.
Question # 3
20 of 55. What is the difference between df.cache() and df.persist() in Spark DataFrame?
A. Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY. B. persist() ” Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER), and cache() ” Can be used to set different storage levels. C. Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_DESER) D. cache() ” Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER), and persist() ” Can be used to set different storage levels to persist the contents of the DataFrame.
Answer: D
Explanation:
Both cache() and persist() are Spark DataFrame storage operations that store computed results in
memory (and optionally on disk) to speed up subsequent actions on the same DataFrame.
Key difference:
cache() is a shorthand for persist(StorageLevel.MEMORY_AND_DISK).
persist() allows specifying different storage levels, such as MEMORY_ONLY, DISK_ONLY, or
Applications ” caching, persistence, and storage levels
Question # 4
19 of 55. A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash function not available in the standard Spark functions library. The existing UDF code is: import hashlib from pyspark.sql.types import StringType def shake_256(raw): return hashlib.shake_256(raw.encode()).hexdigest(20) shake_256_udf = udf(shake_256, StringType()) The developer replaces this UDF with a Pandas UDF for better performance: @pandas_udf(StringType()) def shake_256(raw: str) -> str: return hashlib.shake_256(raw.encode()).hexdigest(20) However, the developer receives this error: TypeError: Unsupported signature: (raw: str) -> str What should the signature of the shake_256() function be changed to in order to fix this error? A. def shake_256(raw: str) -> str: B. def shake_256(raw: [pd.Series]) -> pd.Series: C. def shake_256(raw: pd.Series) -> pd.Series: D. def shake_256(raw: [str]) -> [str]:
A. Option A B. Option B C. Option C D. Option D
Answer: C
Explanation:
Pandas UDFs (vectorized UDFs) process entire Pandas Series objects, not scalar values. Each
invocation operates on a column (Series) rather than a single value.
This allows Spark to apply the function in a vectorized way, improving performance significantly over
traditional Python UDFs.
Why the other options are incorrect:
A/D: These define scalar functions ” not compatible with Pandas UDFs.
B: Uses an invalid type hint [pd.Series] (not a valid Python type annotation).
Reference:
PySpark Pandas API ” @pandas_udf decorator and function signatures
Question # 5
18 of 55. An engineer has two DataFrames ” df1 (small) and df2 (large). To optimize the join, the engineer uses a broadcast join: from pyspark.sql.functions import broadcast df_result = df2.join(broadcast(df1), on="id", how="inner") What is the purpose of using broadcast() in this scenario?
A. It increases the partition size for df1 and df2. B. It ensures that the join happens only when the id values are identical. C. It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes. D. It filters the id values before performing the join.
Answer: C
Explanation:
A broadcast join is a type of join where the smaller DataFrame is replicated (broadcast) to all worker
nodes in the cluster. This avoids shuffling the large DataFrame across the network.
Benefits:
Eliminates shuffle for the smaller dataset.
Greatly improves performance when one side of the join is small enough to fit in memory.
Correct usage example:
df_result = df2.join(broadcast(df1), "id")
This is a map-side join, where each executor joins its local partition of the large dataset with the
broadcasted copy of the small one.
Why the other options are incorrect:
A: Broadcasting does not change partition sizes.
B: Joins always match on key equality; this is not specific to broadcast joins.
D: Broadcasting does not filter; it distributes data for faster joins.
17 of 55. A data engineer has noticed that upgrading the Spark version in their applications from Spark 3.0 to Spark 3.5 has improved the runtime of some scheduled Spark applications. Looking further, the data engineer realizes that Adaptive Query Execution (AQE) is now enabled. Which operation should AQE be implementing to automatically improve the Spark application performance?
A. Dynamically switching join strategies B. Collecting persistent table statistics and storing them in the metastore for future use C. Improving the performance of single-stage Spark jobs D. Optimizing the layout of Delta files on disk
Answer: A
Explanation:
Adaptive Query Execution (AQE) in Spark 3.x automatically optimizes query plans at runtime based
on the actual data characteristics observed during job execution.
Key features of AQE include:
Dynamic switching of join strategies: Changes between sort-merge join and broadcast join based on
actual shuffle sizes.
Coalescing shuffle partitions: Reduces small tasks and improves parallelism efficiency.
Handling skew joins: Dynamically splits large partitions to avoid data skew.
Thus, the most accurate answer describing AQEs function is œdynamically switching join strategies.
Why the other options are incorrect:
B: Table statistics are collected manually or by the metastore, not by AQE.
C: AQE benefits multi-stage jobs involving shuffles, not single-stage jobs.
D: Delta file optimization is handled by Databricks utilities, not AQE.
16 of 55. A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately. Which two characteristics of Apache Spark's execution model explain this behavior? (Choose 2 answers)
A. Transformations are executed immediately to build the lineage graph. B. The Spark engine optimizes the execution plan during the transformations, causing delays. C. Transformations are evaluated lazily. D. The Spark engine requires manual intervention to start executing transformations. E. Only actions trigger the execution of the transformation pipeline.
Answer: C, E
Explanation:
Apache Spark follows a lazy evaluation model, meaning transformations (like filter(), select(), map())
are not executed immediately. Instead, they build a logical plan (lineage graph) that represents the
sequence of operations to be applied.
Execution only begins when an action (e.g., count(), collect(), save(), show()) is called. At that point,
Sparks engine:
Optimizes the logical plan into a physical plan.
Divides it into stages and tasks.
Executes them across the cluster.
This design helps Spark optimize execution paths and avoid unnecessary computations.
Why the other options are incorrect:
A: Transformations do not execute immediately; they are deferred.
B: Optimization happens during job execution (after an action), not during transformations.
D: Execution starts automatically once an action is triggered, no manual intervention needed.
lazy evaluation, actions vs. transformations, and execution hierarchy.
Spark 3.5 Documentation ” Lazy Evaluation model and DAG scheduling.
Question # 8
15 of 55. A data engineer is working on a Streaming DataFrame (streaming_df) with the following streaming data: id name count timestamp 1 Delhi 20 2024-09-19T10:11 1 Delhi 50 2024-09-19T10:12 2 London 50 2024-09-19T10:15 3 Paris 30 2024-09-19T10:18 3 Paris 20 2024-09-19T10:20 4 Washington10 2024-09-19T10:22 Which operation is supported with streaming_df?
A. streaming_df.count() B. streaming_df.filter("count < 30") C. streaming_df.select(countDistinct("name")) D. streaming_df.show()
Answer: B
Explanation:
In Structured Streaming, only transformation operations are allowed on streaming DataFrames.
These include select(), filter(), where(), groupBy(), withColumn(), etc.
Example of supported transformation:
filtered_df = streaming_df.filter("count < 30")
However, actions such as count(), show(), and collect() are not supported directly on streaming
DataFrames because streaming queries are unbounded and never finish until stopped.
To perform aggregations, the query must be executed through writeStream and an output sink.
Why the other options are incorrect:
A: count() is an action, not allowed directly on streaming DataFrames.
C: countDistinct() is a stateful aggregation, not supported outside of a proper streaming query.
D: show() is also an action, unsupported on streaming queries.
Reference:
PySpark Structured Streaming Programming Guide ” supported transformations and actions.
streaming DataFrames and understanding supported transformations.
Question # 9
14 of 55. A developer created a DataFrame with columns color, fruit, and taste, and wrote the data to a Parquet directory using: df.write.partitionBy("color", "taste").parquet("/path/to/output") What is the result of this code?
A. It appends new partitions to an existing Parquet file. B. It throws an error if there are null values in either partition column. C. It creates separate directories for each unique combination of color and taste. D. It stores all data in a single Parquet file.
Answer: C
Explanation:
When writing a DataFrame using .partitionBy() in Spark, the data is physically organized into
directory structures corresponding to unique combinations of the partition columns.
13 of 55. A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this: region_id region_name 10 North 12 East 14 West The resulting Python dictionary must contain a mapping of region_id to region_name, containing the smallest 3 region_id values. Which code fragment meets the requirements?
A. regions_dict = dict(regions.take(3)) B. regions_dict = regions.select("region_id", "region_name").take(3) C. regions_dict = dict(regions.select("region_id", "region_name").rdd.collect()) D. regions_dict = dict(regions.orderBy("region_id").limit(3).rdd.map(lambda x: (x.region_id, x.region_name)).collect())
Answer: D
Explanation:
To create a Python dictionary from a Spark DataFrame, you can first collect the data to the driver
node and then convert it into a Python dictionary using dict().
Steps:
Select only relevant columns.
Order by region_id to get the smallest ones.
Limit to 3 rows.
Map each row into key“value pairs.
Collect results to the driver and convert to a dictionary.
Correct code:
regions_dict = dict(
regions.orderBy("region_id")
.limit(3)
.rdd.map(lambda x: (x.region_id, x.region_name))
.collect()
)
This produces a dictionary like:
{10: 'North', 12: 'East', 14: 'West'}
Why the other options are incorrect:
A/B: take(3) returns a list of Row objects, not key“value pairs.
C: Doesnt order or limit by smallest IDs, so the result may not be correct.
12 of 55. A data scientist has been investigating user profile data to build features for their model. After some exploratory data analysis, the data scientist identified that some records in the user profiles contain NULL values in too many fields to be useful. The schema of the user profile table looks like this: user_id STRING, username STRING, date_of_birth DATE, country STRING, created_at TIMESTAMP The data scientist decided that if any record contains a NULL value in any field, they want to remove that record from the output before further processing. Which block of Spark code can be used to achieve these requirements?
A. filtered_users = raw_users.na.drop("any") B. filtered_users = raw_users.na.drop("all") C. filtered_users = raw_users.dropna(how="any") D. filtered_users = raw_users.dropna(how="all")
Answer: C
Explanation:
In Sparks DataFrame API, the dropna() (or equivalently, DataFrameNaFunctions.drop()) method
removes rows containing null values.
Behavior:
how="any" → drops rows where any column has a null value.
how="all" → drops rows where all columns are null.
Since the data scientist wants to drop records with any null field, the correct parameter is how="any".
Correct syntax:
filtered_users = raw_users.dropna(how="any")
This will remove all records that have at least one null value in any column.
Why the other options are incorrect:
A: Uses na.drop("any") but missing parentheses context (works only as raw_users.na.drop("any"),
which is equivalent to option C).
B/D: how="all" only removes rows where all values are null ” too strict for this use case.
Reference:
PySpark DataFrame API ” DataFrameNaFunctions.drop() and DataFrame.dropna().
executor configuration, CPU cores, and parallel task execution
Question # 13
10 of 55. What is the benefit of using Pandas API on Spark for data transformations? A. It executes queries faster using all the available cores in the cluster as well as provides Pandas's rich set of features.
B. It is available only with Python, thereby reducing the learning curve. C. It runs on a single node only, utilizing memory efficiently. D. It computes results immediately using eager execution.
Answer: A
Explanation:
Pandas API on Spark provides a distributed implementation of the Pandas DataFrame API on top of
Apache Spark.
Advantages:
Executes transformations in parallel across all nodes and cores in the cluster.
Maintains Pandas-like syntax, making it easy for Python users to transition.
Enables scaling of existing Pandas code to handle large datasets without memory limits.
Therefore, it combines Pandas usability with Sparks distributed power, offering both speed and
scalability.
Why the other options are incorrect:
B: While it uses Python, thats not its main advantage.
C: It runs distributed across the cluster, not on a single node.
D: Pandas API on Spark uses lazy evaluation, not eager computation.
Reference:
PySpark Pandas API Overview ” advantages of distributed execution.
Databricks Exam Guide (June 2025): Section œUsing Pandas API on Apache Spark ” explains the
benefits of Pandas API integration for scalable transformations.
Question # 14
9 of 55. Given the code fragment: import pyspark.pandas as ps pdf = ps.DataFrame(data) Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?
A. pdf.to_pandas() B. pdf.to_spark() C. pdf.to_dataframe() D. pdf.spark()
Answer: B
Explanation:
In Pandas API on Spark (previously Koalas), the method .to_spark() converts a
pyspark.pandas.DataFrame into a PySpark DataFrame.
Correct usage:
spark_df = pdf.to_spark()
This enables interoperability between the Pandas API on Spark and the PySpark SQL API, allowing
developers to switch seamlessly between both for transformations or performance optimization.
Why the other options are incorrect:
A (to_pandas): Converts to a local Pandas DataFrame, not a PySpark DataFrame.
C (to_dataframe): Not a valid API method.
D (spark): Not an existing DataFrame method.
Reference:
PySpark Pandas API Reference ” DataFrame.to_spark() method.
Databricks Exam Guide (June 2025): Section œUsing Pandas API on Apache Spark ” covers
DataFrame conversions and interoperability.
Question # 15
8 of 55. A data scientist at a large e-commerce company needs to process and analyze 2 TB of daily customer transaction data. The company wants to implement real-time fraud detection and personalized product recommendations. Currently, the company uses a traditional relational database system, which struggles with the increasing data volume and velocity. Which feature of Apache Spark effectively addresses this challenge?
A. Ability to process small datasets efficiently B. In-memory computation and parallel processing capabilities C. Support for SQL queries on structured data D. Built-in machine learning libraries
Answer: B
Explanation:
Apache Spark was designed for big data and high-velocity workloads. Its core strength lies in its inmemory
computation and parallel distributed processing model.
These features allow Spark to:
Process large-scale datasets quickly across many nodes.
Support real-time and near“real-time analytics for tasks like fraud detection and recommendations.
Minimize disk I/O through caching and memory persistence.
Thus, the key advantage in this use case is Sparks ability to handle large data volumes efficiently
using distributed, in-memory computation.
Why the other options are incorrect:
A: Spark is optimized for large, not small, datasets.
C: SQL support is useful but doesnt solve the scalability issue.
D: MLlib supports machine learning but relies on Sparks parallel computation for speed.
identifies Sparks advantages: in-memory processing, distributed computation, and scalability.
Apache Spark 3.5 Overview ” Key design goals and cluster computation model
Question # 16
7 of 55. A developer has been asked to debug an issue with a Spark application. The developer identified that the data being loaded from a CSV file is being read incorrectly into a DataFrame. The CSV file has been read using the following Spark SQL statement: CREATE TABLE locations USING csv OPTIONS (path '/data/locations.csv') The first lines of the command SELECT * FROM locations look like this: | city | lat | long | | ALTI Sydney | -33... | ... | Which parameter can the developer add to the OPTIONS clause in the CREATE TABLE statement to read the CSV data correctly again?
A. 'header' 'true' B. 'header' 'false' C. 'sep' ',' D. 'sep' '|'
Answer: A
Explanation:
When reading CSV files using Spark SQL or the DataFrame API, Spark by default assumes that the first
line of the file is data, not headers. To interpret the first line as column names, the header option
must be set to true.
Correct syntax:
CREATE TABLE locations
USING csv
OPTIONS (
path '/data/locations.csv',
header 'true'
);
This tells Spark to read the first row as column headers and correctly map columns like city, lat, and
long.
Why the other options are incorrect:
B (header 'false'): Default behavior; would keep reading header as data.
C / D (sep): Used to specify the delimiter; not relevant unless the file uses a different separator (e.g.,
|).
Reference (Databricks Apache Spark 3.5 “ Python / Study Guide):
PySpark SQL Data Sources ” CSV options (header, inferSchema, sep).
Databricks Exam Guide (June 2025): Section œUsing Spark SQL ” Reading data from files with
different formats using Spark SQL and DataFrame APIs.
Question # 17
6 of 55. Which components of Apache Sparks Architecture are responsible for carrying out tasks when assigned to them?
A. Driver Nodes B. Executors C. CPU Cores D. Worker Nodes
Answer: B
Explanation:
In Sparks distributed architecture:
The Driver Node coordinates the execution of a Spark application. It converts the logical plan into a
physical plan of stages and tasks.
The Executors, running on Worker Nodes, are responsible for executing tasks assigned by the driver
and storing data (in memory or disk) during execution.
Key point:
Executors are the active agents that perform the actual computations on data partitions. Each
executor runs multiple tasks in parallel using available CPU cores.
Why the other options are incorrect:
A (Driver Nodes): The driver schedules tasks; it doesnt execute them.
C (CPU Cores): CPU cores execute within executors, but they are hardware, not Spark architectural
components.
D (Worker Nodes): Worker nodes host executors but do not directly execute tasks; executors do.
Reference (Databricks Apache Spark 3.5 “ Python / Study Guide):
describes the roles of driver and executor nodes in distributed processing
Question # 18
5 of 55. What is the relationship between jobs, stages, and tasks during execution in Apache Spark?
A. A job contains multiple tasks, and each task contains multiple stages. B. A stage contains multiple jobs, and each job contains multiple tasks. C. A stage contains multiple tasks, and each task contains multiple jobs. D. A job contains multiple stages, and each stage contains multiple tasks.
Answer: D
Explanation:
In Apache Sparks execution hierarchy, the relationships are structured as follows:
Job: Created when an action (e.g., count(), collect(), save()) is triggered on an RDD or DataFrame.
Stage: Each job is divided into one or more stages, separated by shuffle boundaries (e.g., after a
reduceByKey or join).
Task: Each stage consists of multiple tasks, one per partition, executed in parallel on executors.
Execution Hierarchy:
Job → Stage(s) → Task(s)
So, a job contains multiple stages, and each stage contains multiple tasks.
Why the other options are incorrect:
A: A job does not directly contain tasks without stages.
B: A stage cannot contain multiple jobs; it belongs to a single job.
C: Tasks do not contain jobs.
Reference (Databricks Apache Spark 3.5 “ Python / Study Guide):
Spark Architecture Overview ” Execution Hierarchy: Jobs, Stages, and Tasks.
describes execution hierarchy and lazy evaluation.
Question # 19
4 of 55. A developer is working on a Spark application that processes a large dataset using SQL queries. Despite having a large cluster, the developer notices that the job is underutilizing the available resources. Executors remain idle for most of the time, and logs reveal that the number of tasks per stage is very low. The developer suspects that this is causing suboptimal cluster performance. Which action should the developer take to improve cluster utilization?
A. Increase the value of spark.sql.shuffle.partitions B. Reduce the value of spark.sql.shuffle.partitions C. Enable dynamic resource allocation to scale resources as needed D. Increase the size of the dataset to create more partitions
Answer: A
Explanation:
In Spark SQL and DataFrame operations, the configuration parameter spark.sql.shuffle.partitions
defines the number of partitions created during shuffle operations such as join, groupBy, and
distinct.
The default value (in Spark 3.5) is 200.
If this number is too low, Spark creates fewer tasks, leading to idle executors and poor cluster
utilization.
Increasing this value allows Spark to create more tasks that can run in parallel across executors,
API Applications ” tuning strategies, partitioning, and optimizing cluster utilization.
Question # 20
3 of 55. A data engineer observes that the upstream streaming source feeds the event table frequently and sends duplicate records. Upon analyzing the current production table, the data engineer found that the time difference in the event_timestamp column of the duplicate records is, at most, 30 minutes. To remove the duplicates, the engineer adds the code: df = df.withWatermark("event_timestamp", "30 minutes") What is the result?
A. It removes all duplicates regardless of when they arrive. B. It accepts watermarks in seconds and the code results in an error. C. It removes duplicates that arrive within the 30-minute window specified by the watermark. D. It is not able to handle deduplication in this scenario.
Answer: C
Explanation:
In Structured Streaming, a watermark defines the maximum delay for event-time data to be
considered in stateful operations like deduplication or window aggregations.
œStructured Streaming ” Topic: Streaming Deduplication with and without watermark usage.
Question # 21
2 of 55. Which command overwrites an existing JSON file when writing a DataFrame?
B. df.write.mode("append").json("path/to/file") C. df.write.option("overwrite").json("path/to/file") D. df.write.mode("overwrite").json("path/to/file") D. df.write.mode("overwrite").json("path/to/file")
Answer: D
Explanation:
When writing DataFrames to files using the Spark DataFrameWriter API, Spark by default raises an
error if the target path already exists. To explicitly overwrite existing data, you must specify the write
mode as "overwrite".
Correct Syntax:
df.write.mode("overwrite").json("path/to/file")
This command removes the existing file or directory at the specified path and writes the new output
in JSON format.
Other supported save modes include:
"append" ” Adds new data to existing files.
"ignore" ” Skips writing if the path already exists.
"error" or "errorifexists" ” Fails the job if the output path exists (default).
Why other options are incorrect:
A: Defaults to "error" mode, which fails if the path exists.
B: "append" only adds data; it does not overwrite existing data.
C: .option("overwrite") is invalid ” mode("overwrite") must be used instead.
Reference (Databricks Apache Spark 3.5 “ Python / Study Guide):
PySpark API Reference: DataFrameWriter.mode() ” describes valid write modes including
"overwrite".
PySpark API Reference: DataFrameWriter.json() ” method to write DataFrames in JSON format.
Spark DataFrame APIs ” Reading and writing DataFrames using save modes, schema management,
and partitioning.
Question # 22
QUESTION 86 1 of 55. A data scientist wants to ingest a directory full of plain text files so that each record in the output DataFrame contains the entire contents of a single file and the full path of the file the text was read from. The first attempt does read the text files, but each record contains a single line. This code is shown below: txt_path = "/datasets/raw_txt/*" df = spark.read.text(txt_path) # one row per line by default df = df.withColumn("file_path", input_file_name()) # add full path Which code change can be implemented in a DataFrame that meets the data scientist's requirements?
A. Add the option wholetext to the text() function. B. Add the option lineSep to the text() function. C. Add the option wholetext=False to the text() function. D. Add the option lineSep=", " to the text() function.
Answer: A
Explanation:
By default, the spark.read.text() method reads a text file one line per record. This means that each
line in a text file becomes one row in the resulting DataFrame.
To read each file as a single record, Apache Spark provides the option wholetext, which, when set to
True, causes Spark to treat the entire file contents as one single string per row.
Spark DataFrame APIs ” covers reading files and handling DataFrames
Question # 23
What is the benefit of Adaptive Query Execution (AQE)?
A. It allows Spark to optimize the query plan before execution but does not adapt during runtime. B. It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance. C. It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew. D. It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.
Answer: B
Explanation:
Adaptive Query Execution (AQE) is a powerful optimization framework introduced in Apache Spark
3.0 and enabled by default since Spark 3.2. It dynamically adjusts query execution plans based on
runtime statistics, leading to significant performance improvements. The key benefits of AQE
include:
Dynamic Join Strategy Selection: AQE can switch join strategies at runtime. For instance, it can
convert a sort-merge join to a broadcast hash join if it detects that one side of the join is small
enough to be broadcasted, thus optimizing the join operation .
Handling Skewed Data: AQE detects skewed partitions during join operations and splits them into
smaller partitions. This approach balances the workload across tasks, preventing scenarios where
certain tasks take significantly longer due to data skew .
Coalescing Post-Shuffle Partitions: AQE dynamically coalesces small shuffle partitions into larger ones
based on the actual data size, reducing the overhead of managing numerous small tasks and
improving overall query performance .
These runtime optimizations allow Spark to adapt to the actual data characteristics during query
execution, leading to more efficient resource utilization and faster query processing times.
Question # 24
Given this view definition: df.createOrReplaceTempView("users_vw") Which approach can be used to query the users_vw view after the session is terminated? Options:
A. Query the users_vw using Spark B. Persist the users_vw data as a table C. Recreate the users_vw and query the data using Spark D. Save the users_vw definition and query using Spark
Answer: B
Explanation:
Temp views like createOrReplaceTempView are session-scoped.
They disappear once the Spark session ends.
To retain data across sessions, it must be persisted:
df.write.saveAsTable("users_vw")
Thus, the view needs to be persisted as a table to survive session termination.
Reference: Databricks “ Temp vs Global vs Permanent Views
Question # 25
A data engineer needs to persist a file-based data source to a specific location. However, by default, Spark writes to the warehouse directory (e.g., /user/hive/warehouse). To override this, the engineer must explicitly define the file path. Which line of code ensures the data is saved to a specific location? Options:
A. users.write(path="/some/path").saveAsTable("default_table") B. users.write.saveAsTable("default_table").option("path", "/some/path") C. users.write.option("path", "/some/path").saveAsTable("default_table") D. users.write.saveAsTable("default_table", path="/some/path")
Answer: C
Explanation:
To persist a table and specify the save path, use:
The .option("path", ...) must be applied before calling saveAsTable.
Option A uses invalid syntax (write(path=...)).
Option B applies .option() after .saveAsTable()”which is too late.
Option D uses incorrect syntax (no path parameter in saveAsTable).
Reference: Spark SQL - Save as Table
Feedback That Matters: Reviews of Our Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Dumps
Douglas MarshallMay 21, 2026
The most surprising thing about the MyCertsHub practice material was how closely it matched the Spark 3.5 exam's actual difficulty." Even subtle topics like broadcast joins and partitioning strategies were well-covered.
Brandon RichardsonMay 20, 2026
As someone who switched from traditional SQL to Spark, I was concerned about API usage and performance optimization. The structured practice I followed made things much clearer and more approachable.
Donald BakerMay 20, 2026
Thanks to focused preparation material, I was prepared for the PySpark questions that went so deep into memory management and job stages, which I hadn't anticipated. scored 91 percent without any guesswork.
Adam LeeMay 19, 2026
In all honesty, I would like to express my gratitude to MyCertsHub for assisting me through the Spark 3.5 exam's most difficult sections. Their laziness in evaluation and breakdown of execution plans had a significant impact.
Christian BakerMay 19, 2026
With so many APIs and edge cases, the Databricks Spark 3.5 exam can be overwhelming. I learned to confidently answer questions about narrow versus wide transformations with the right preparation.
Caleb WrightMay 18, 2026
MyCertsHub felt more like a mentor than any of the other sites with copied dumps. Studying was significantly more enjoyable and effective thanks to their interactive practice and feedback.
Hans HaasMay 18, 2026
A big thank you to the team that made the resources I used! I finally grasped structured streaming and tuning operations in Spark 3.5. got a score of 89%.
Mahmood AggarwalMay 17, 2026
After failing once, I switched to MyCertsHub and the difference was huge. I now know how to use DataFrame performance tricks, caching strategies, and DAGs. I passed this time with 92 percent. I am so grateful!