Databricks Certified Associate Developer for Apache Spark 3.0 Exam
Questions
180 Questions Answers With Explanation
Update Date
06, 30, 2026
Price
Was :
$81
Today :
$45
Was :
$99
Today :
$55
Was :
$117
Today :
$65
Why Should You Prepare For Your Databricks Certified Associate Developer for Apache Spark 3.0 Exam With MyCertsHub?
At MyCertsHub, we go beyond standard study material. Our platform provides authentic Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Dumps, detailed exam guides, and reliable practice exams that mirror the actual Databricks Certified Associate Developer for Apache Spark 3.0 Exam test. Whether you’re targeting Databricks certifications or expanding your professional portfolio, MyCertsHub gives you the tools to succeed on your first attempt.
Every set of exam dumps is carefully reviewed by certified experts to ensure accuracy. For the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam , you’ll receive updated practice questions designed to reflect real-world exam conditions. This approach saves time, builds confidence, and focuses your preparation on the most important exam areas.
Realistic Test Prep For The Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0
You can instantly access downloadable PDFs of Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 practice exams with MyCertsHub. These include authentic practice questions paired with explanations, making our exam guide a complete preparation tool. By testing yourself before exam day, you’ll walk into the Databricks Exam with confidence.
Smart Learning With Exam Guides
Our structured Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam guide focuses on the Databricks Certified Associate Developer for Apache Spark 3.0 Exam's core topics and question patterns. You will be able to concentrate on what really matters for passing the test rather than wasting time on irrelevant content. Pass the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam – Guaranteed
We Offer A 100% Money-Back Guarantee On Our Products.
After using MyCertsHub's exam dumps to prepare for the Databricks Certified Associate Developer for Apache Spark 3.0 Exam exam, we will issue a full refund. That’s how confident we are in the effectiveness of our study resources.
Try Before You Buy – Free Demo
Still undecided? See for yourself how MyCertsHub has helped thousands of candidates achieve success by downloading a free demo of the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam dumps.
MyCertsHub – Your Trusted Partner For Databricks Exams
Whether you’re preparing for Databricks Certified Associate Developer for Apache Spark 3.0 Exam or any other professional credential, MyCertsHub provides everything you need: exam dumps, practice exams, practice questions, and exam guides. Passing your Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam has never been easier thanks to our tried-and-true resources.
The code block displayed below contains an error. The code block should trigger Spark to cache
DataFrame transactionsDf in executor memory where available, writing to disk where insufficient
executor memory is available, in a fault-tolerant way. Find the error.
Code block:
transactionsDf.persist(StorageLevel.MEMORY_AND_DISK)
A. Caching is not supported in Spark, data are always recomputed. B. Data caching capabilities can be accessed through the spark object, but not through the DataFrame API. C. The storage level is inappropriate for fault-tolerant storage. D. The code block uses the wrong operator for caching. E. The DataFrameWriter needs to be invoked.
Answer: C Explanation: The storage level is inappropriate for fault-tolerant storage. Correct. Typically, when thinking about fault tolerance and storage levels, you would want to store redundant copies of the dataset. This can be achieved by using a storage level such as StorageLevel.MEMORY_AND_DISK_2. The code block uses the wrong command for caching. Wrong. In this case, DataFrame.persist() needs to be used, since this operator supports passing a storage level. DataFrame.cache() does not support passing a storage level. Caching is not supported in Spark, data are always recomputed. Incorrect. Caching is an important component of Spark, since it can help to accelerate Spark programs to great extent. Caching is often a good idea for datasets that need to be accessed repeatedly. Data caching capabilities can be accessed through the spark object, but not through the DataFrame API. No. Caching is either accessed through DataFrame.cache() or DataFrame.persist(). The DataFrameWriter needs to be invoked. Wrong. The DataFrameWriter can be accessed via DataFrame.write and is used to write data to external data stores, mostly on disk. Here, we find keywords such as "cache" and "executor memory" that point us away from using external data stores. We aim to save data to memory to accelerate the reading process, since reading from disk is comparatively slower. The DataFrameWriter does not write to memory, so we cannot use it here. More info: Best practices for caching in Spark SQL | by David Vrba | Towards Data Science
Question # 2
Which of the following code blocks reads the parquet file stored at filePath into DataFrame itemsDf,
using a valid schema for the sample of itemsDf shown below?
Sample of itemsDf:
1. +------+-----------------------------+-------------------+
2. |itemId|attributes |supplier |
3. +------+-----------------------------+-------------------+
4. |1 |[blue, winter, cozy] |Sports Company Inc.|
5. |2 |[red, summer, fresh, cooling]|YetiX |
6. |3 |[green, summer, travel] |Sports Company Inc.|
7. +------+-----------------------------+-------------------+
Answer: D Explanation: The challenge in this comes from there being an array variable in the schema. In addition, you should know how to pass a schema to the DataFrameReader that is invoked by spark.read. The correct way to define an array of strings in a schema is through ArrayType(StringType()). A schema can be passed to the DataFrameReader by simply appending schema(structType) to the read() operator. Alternatively, you can also define a schema as a string. For example, for the schema of itemsDf, the following string would make sense: itemId integer, attributes array, supplier string. A thing to keep in mind is that in schema definitions, you always need to instantiate the types, like so: StringType(). Just using StringType does not work in pySpark and will fail. Another concern with schemas is whether columns should be nullable, so allowed to have null values. In the case at hand, this is not a concern however, since the ust asks for a "valid" schema. Both non-nullable and nullable column schemas would be valid here, since no null value appears in the DataFrame sample. More info: Learning Spark, 2nd Edition, Chapter 3 Static notebook | Dynamic notebook: See test 3, (Databricks import instructions)
Question # 3
The code block displayed below contains an error. When the code block below has executed, itshould have divided DataFrame transactionsDf into 14 parts, based on columns storeId andtransactionDate (in this order). Find the error.Code block:transactionsDf.coalesce(14, ("storeId", "transactionDate"))
A. The parentheses around the column names need to be removed and .select() needs to be
appended to the code block. B. Operator coalesce needs to be replaced by repartition, the parentheses around the column names
need to be removed, and .count() needs to be appended to the code block.
(Correct) C. Operator coalesce needs to be replaced by repartition, the parentheses around the column names
need to be removed, and .select() needs to be appended to the code block. D. Operator coalesce needs to be replaced by repartition and the parentheses around the column
names need to be replaced by square brackets. E. Operator coalesce needs to be replaced by repartition.
Since we do not know how many partitions DataFrame transactionsDf has, we cannot safely use
coalesce, since it would not make any change if the current number of partitions is smaller than 14.
So, we need to use repartition.
In the Spark documentation, the call structure for repartition is shown like this:
DataFrame.repartition(numPartitions, *cols). The * operator means that any argument after
numPartitions will be
interpreted as column. Therefore, the brackets need to be removed.
Finally, the specifies that
after the execution the DataFrame should be divided. So, indirectly this is asking us to
append an action to the code block. Since .select()
is a transformation. the only possible choice here is .count().
More info: pyspark.sql.DataFrame.repartition ” PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1, (Databricks import instructions)
Question # 4
Which of the following code blocks returns a DataFrame with an added column to DataFrametransactionsDf that shows the unix epoch timestamps in column transactionDate as strings in theformatmonth/day/year in column transactionDateFormatted?Excerpt of DataFrame transactionsDf:
A. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate",
format="dd/MM/yyyy")) B. transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted",
from_unixtime("transactionDateFormatted", format="MM/dd/yyyy")) C.
transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted
") D. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate",
format="MM/dd/yyyy")) E. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))
Which of the following code blocks creates a new 6-column DataFrame by appending the rows of the
6-column DataFrame yesterdayTransactionsDf to the rows of the 6-column DataFrame
todayTransactionsDf, ignoring that both DataFrames have different column names?
A. union(todayTransactionsDf, yesterdayTransactionsDf) B. todayTransactionsDf.unionByName(yesterdayTransactionsDf, allowMissingColumns=True) C. todayTransactionsDf.unionByName(yesterdayTransactionsDf) D. todayTransactionsDf.concat(yesterdayTransactionsDf) E. todayTransactionsDf.union(yesterdayTransactionsDf)
Answer: E Explanation: todayTransactionsDf.union(yesterdayTransactionsDf) Correct. The union command appends rows of yesterdayTransactionsDf to the rows of todayTransactionsDf, ignoring that both DataFrames have different column names. The resulting DataFrame will have the column names of DataFrame todayTransactionsDf. todayTransactionsDf.unionByName(yesterdayTransactionsDf) No. unionByName specifically tries to match columns in the two DataFrames by name and only appends values in columns with identical names across the two DataFrames. In the form presented above, the command is a great fit for joining DataFrames that have exactly the same columns, but in a different order. In this case though, the command will fail because the two DataFrames have different columns. todayTransactionsDf.unionByName(yesterdayTransactionsDf, allowMissingColumns=True) No. The unionByName command is described in the previous explanation. However, with the allowMissingColumns argument set to True, it is no longer an issue that the two DataFrames have different column names. Any columns that do not have a match in the other DataFrame will be filled with null where there is no value. In the case at hand, the resulting DataFrame will have 7 or more columns though, so it this command is not the right answer. union(todayTransactionsDf, yesterdayTransactionsDf) No, there is no union method in pyspark.sql.functions. todayTransactionsDf.concat(yesterdayTransactionsDf) Wrong, the DataFrame class does not have a concat method. More info: pyspark.sql.DataFrame.union ” PySpark 3.1.2 documentation, pyspark.sql.DataFrame.unionByName ” PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, ( Databricks import instructions)
Question # 6
The code block displayed below contains an error. The code block should return DataFrametransactionsDf, but with the column storeId renamed to storeNumber. Find the error.Code block:transactionsDf.withColumn("storeNumber", "storeId")
A. Instead of withColumn, the withColumnRenamed method should be used. B. Arguments "storeNumber" and "storeId" each need to be wrapped in a col() operator. C. Argument "storeId" should be the first and argument "storeNumber" should be the secondargument to the withColumn method. D. The withColumn operator should be replaced with the copyDataFrame operator. E. Instead of withColumn, the withColumnRenamed method should be used and argument "storeId"should be the first and argument "storeNumber" should be the second argument to that method.
More info: pyspark.sql.DataFrame.withColumnRenamed ” PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1,
Databricks import instructions)
Question # 7
Which of the elements in the labeled panels represent the operation performed for broadcast
variables?
Larger image
A. 2, 5 B. 3 C. 2, 3 D. 1, 2 E. 1, 3, 4
Answer: C Explanation: 2,3 Correct! Both panels 2 and 3 represent the operation performed for broadcast variables. While a broadcast operation may look like panel 3, with the driver being the bottleneck, it most probably looks like panel 2. This is because the torrent protocol sits behind Spark's broadcast implementation. In the torrent protocol, each executor will try to fetch missing broadcast variables from the driver or other nodes, preventing the driver from being the bottleneck. 1,2 Wrong. While panel 2 may represent broadcasting, panel 1 shows bi-directional communication which does not occur in broadcast operations. No. While broadcasting may materialize like shown in panel 3, its use of the torrent protocol also enables communciation as shown in panel 2 (see first explanation). 1,3,4 No. While panel 2 shows broadcasting, panel 1 shows bi-directional communication “ not a characteristic of broadcasting. Panel 4 shows uni-directional communication, but in the wrong direction. Panel 4 resembles more an accumulator variable than a broadcast variable. 2,5 Incorrect. While panel 2 shows broadcasting, panel 5 includes bi-directional communication “ not a characteristic of broadcasting. More info: Broadcast Join with Spark “ henning.kropponline.de
Question # 8
Which of the following is not a feature of Adaptive Query Execution?
A. Replace a sort merge join with a broadcast join, where appropriate. B. Coalesce partitions to accelerate data processing. C. Split skewed partitions into smaller partitions to avoid differences in partition processing time. D. Reroute a query in case of an executor failure. E. Collect runtime statistics during query execution.
Answer: D Explanation: Reroute a query in case of an executor failure. Correct. Although this feature exists in Spark, it is not a feature of Adaptive Query Execution. The cluster manager keeps track of executors and will work together with the driver to launch an executor and assign the workload of the failed executor to it (see also link below). Replace a sort merge join with a broadcast join, where appropriate. No, this is a feature of Adaptive Query Execution. Coalesce partitions to accelerate data processing. Wrong, Adaptive Query Execution does this. Collect runtime statistics during query execution. Incorrect, Adaptive Query Execution (AQE) collects these statistics to adjust query plans. This feedback loop is an essential part of accelerating queries via AQE. Split skewed partitions into smaller partitions to avoid differences in partition processing time. No, this is indeed a feature of Adaptive Query Execution. Find more information in the Databricks blog post linked below. More info: Learning Spark, 2nd Edition, Chapter 12, On which way does RDD of spark finish faulttolerance? - Stack Overflow, How to Speed up SQL Queries with Adaptive Query Execution
Question # 9
The code block displayed below contains an error. The code block is intended to return all columns ofDataFrame transactionsDf except for columns predError, productId, and value. Find the error.Excerpt of DataFrame transactionsDf:transactionsDf.select(~col("predError"), ~col("productId"), ~col("value"))
A. The select operator should be replaced by the drop operator and the arguments to the dropoperator should be column names predError, productId and value wrapped in the col operator sothey should be expressed like drop(col(predError), col(productId), col(value)). B. The select operator should be replaced with the deselect operator. C. The column names in the select operator should not be strings and wrapped in the col operator, sothey should be expressed like select(~col(predError), ~col(productId), ~col(value)). D. The select operator should be replaced by the drop operator. E. The select operator should be replaced by the drop operator and the arguments to the dropoperator should be column names predError, productId and value as strings.(Correct)
Which of the following statements about storage levels is incorrect?
A. The cache operator on DataFrames is evaluated like a transformation. B. In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory. C. Caching can be undone using the DataFrame.unpersist() operator. D. MEMORY_AND_DISK replicates cached DataFrames both on memory and disk. E. DISK_ONLY will not use the worker node's memory.
Answer: D Explanation: MEMORY_AND_DISK replicates cached DataFrames both on memory and disk. Correct, this statement is wrong. Spark prioritizes storage in memory, and will only store data on disk that does not fit into memory. DISK_ONLY will not use the worker node's memory. Wrong, this statement is correct. DISK_ONLY keeps data only on the worker node's disk, but not in memory. In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory. Wrong, this statement is correct. In fact, Spark does not have a provision to cache DataFrames in the driver (which sits on the edge node in client mode). Spark caches DataFrames in the executors' memory. Caching can be undone using the DataFrame.unpersist() operator. Wrong, this statement is correct. Caching, as achieved via the DataFrame.cache() or DataFrame.persist() operators can be undone using the DataFrame.unpersist() operator. This operator will remove all of its parts from the executors' memory and disk. The cache operator on DataFrames is evaluated like a transformation. Wrong, this statement is correct. DataFrame.cache() is evaluated like a transformation: Through lazy evaluation. This means that after calling DataFrame.cache() the command will not have any effect until you call a subsequent action, like DataFrame.cache().count(). More info: pyspark.sql.DataFrame.unpersist ” PySpark 3.1.2 documentation
Question # 11
Which of the following statements about reducing out-of-memory errors is incorrect?
A. Concatenating multiple string columns into a single column may guard against out-of-memory errors. B. Reducing partition size can help against out-of-memory errors. C. Limiting the amount of data being automatically broadcast in joins can help against out-ofmemory errors. D. Setting a limit on the maximum size of serialized data returned to the driver may help prevent outofmemory errors. E. Decreasing the number of cores available to each executor can help against out-of-memory errors.
Answer: A Explanation: Concatenating multiple string columns into a single column may guard against out-of-memory errors. Exactly, this is an incorrect answer! Concatenating any string columns does not reduce the size of the data, it just structures it a different way. This does little to how Spark processes the data and definitely does not reduce out-of-memory errors. Reducing partition size can help against out-of-memory errors. No, this is not incorrect. Reducing partition size is a viable way to aid against out-of-memory errors, since executors need to load partitions into memory before processing them. If the executor does not have enough memory available to do that, it will throw an out-of-memory error. Decreasing partition size can therefore be very helpful for preventing that. Decreasing the number of cores available to each executor can help against out-of-memory errors. No, this is not incorrect. To process a partition, this partition needs to be loaded into the memory of an executor. If you imagine that every core in every executor processes a partition, potentially in parallel with other executors, you can imagine that memory on the machine hosting the executors fills up quite quickly. So, memory usage of executors is a concern, especially when multiple partitions are processed at the same time. To strike a balance between performance and memory usage, decreasing the number of cores may help against out-of-memory errors. Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-ofmemory errors. No, this is not incorrect. When using commands like collect() that trigger the transmission of potentially large amounts of data from the cluster to the driver, the driver may experience out-ofmemory errors. One strategy to avoid this is to be careful about using commands like collect() that send back large amounts of data to the driver. Another strategy is setting the parameter spark.driver.maxResultSize. If data to be transmitted to the driver exceeds the threshold specified by the parameter, Spark will abort the job and therefore prevent an out-of-memory error. Limiting the amount of data being automatically broadcast in joins can help against out-of-memory errors. Wrong, this is not incorrect. As part of Spark's internal optimization, Spark may choose to speed up operations by broadcasting (usually relatively small) tables to executors. This broadcast is happening from the driver, so all the broadcast tables are loaded into the driver first. If these tables are relatively big, or multiple mid-size tables are being broadcast, this may lead to an out-ofmemory error. The maximum table size for which Spark will consider broadcasting is set by the spark.sql.autoBroadcastJoinThreshold parameter. More info: Configuration - Spark 3.1.2 Documentation and Spark OOM Error ” Closeup. Does the following look familiar when¦ | by Amit Singh Rathore | The Startup | Medium
Question # 12
The code block displayed below contains an error. The code block is intended to write DataFrametransactionsDf to disk as a parquet file in location /FileStore/transactions_split, using columnstoreId as key for partitioning. Find the error.Code block:transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_split")A.
A. The format("parquet") expression is inappropriate to use here, "parquet" should be passed as firstargument to the save() operator and "/FileStore/transactions_split" as the second argument. B. Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should bereplaced by partitionBy. C. Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should bereplaced by bucketBy. D. partitionOn("storeId") should be called before the write operation. E. The format("parquet") expression should be removed and instead, the information should beadded to the write expression like so: write("parquet")
More info: partition by - Reading files which are written using PartitionBy or BucketBy in Spark - Stack
Overflow
Static notebook | Dynamic notebook: See test 1,
Databricks import instructions)
Question # 13
Which of the following is a problem with using accumulators?
A. Only unnamed accumulators can be inspected in the Spark UI. B. Only numeric values can be used in accumulators. C. Accumulator values can only be read by the driver, but not by executors. D. Accumulators do not obey lazy evaluation. E. Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.
Answer: C Explanation: Accumulator values can only be read by the driver, but not by executors. Correct. So, for example, you cannot use an accumulator variable for coordinating workloads between executors. The typical, canonical, use case of an accumulator value is to report data, for example for debugging purposes, back to the driver. For example, if you wanted to count values that match a specific condition in a UDF for debugging purposes, an accumulator provides a good way to do that. Only numeric values can be used in accumulators. No. While pySpark's Accumulator only supports numeric values (think int and float), you can define accumulators for custom types via the AccumulatorParam interface (documentation linked below). Accumulators do not obey lazy evaluation. Incorrect “ accumulators do obey lazy evaluation. This has implications in practice: When an accumulator is encapsulated in a transformation, that accumulator will not be modified until a subsequent action is run. Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure. Wrong. A concern with accumulators is in fact that under certain conditions they can run for each task more than once. For example, if a hardware failure occurs during a task after an accumulator variable has been increased but before a task has finished and Spark launches the task on a different worker in response to the failure, already executed accumulator variable increases will be repeated. Only unnamed accumulators can be inspected in the Spark UI. No. Currently, in PySpark, no accumulators can be inspected in the Spark UI. In the Scala interface of Spark, only named accumulators can be inspected in the Spark UI. More info: Aggregating Results with Spark Accumulators | Sparkour, RDD Programming Guide - Spark 3.1.2 Documentation, pyspark.Accumulator ” PySpark 3.1.2 documentation, and pyspark.AccumulatorParam ” PySpark 3.1.2 documentation
Question # 14
Which of the following describes a valid concern about partitioning?
A. A shuffle operation returns 200 partitions if not explicitly set. B. Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions. C. No data is exchanged between executors when coalesce() is run. D. Short partition processing times are indicative of low skew. E. The coalesce() method should be used to increase the number of partitions.
Answer: A Explanation: A shuffle operation returns 200 partitions if not explicitly set. Correct. 200 is the default value for the Spark property spark.sql.shuffle.partitions. This property determines how many partitions Spark uses when shuffling data for joins or aggregations. The coalesce() method should be used to increase the number of partitions. Incorrect. The coalesce() method can only be used to decrease the number of partitions. Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions. No. For narrow transformations, fewer partitions usually result in a longer overall runtime, if more executors are available than partitions. A narrow transformation does not include a shuffle, so no data need to be exchanged between executors. Shuffles are expensive and can be a bottleneck for executing Spark workloads. Narrow transformations, however, are executed on a per-partition basis, blocking one executor per partition. So, it matters how many executors are available to perform work in parallel relative to the number of partitions. If the number of executors is greater than the number of partitions, then some executors are idle while other process the partitions. On the flip side, if the number of executors is smaller than the number of partitions, the entire operation can only be finished after some executors have processed multiple partitions, one after the other. To minimize the overall runtime, one would want to have the number of partitions equal to the number of executors (but not more). So, for the scenario at hand, increasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions. No data is exchanged between executors when coalesce() is run. No. While coalesce() avoids a full shuffle, it may still cause a partial shuffle, resulting in data exchange between executors. Short partition processing times are indicative of low skew. Incorrect. Data skew means that data is distributed unevenly over the partitions of a dataset. Low skew therefore means that data is distributed evenly. Partition processing time, the time that executors take to process partitions, can be indicative of skew if some executors take a long time to process a partition, but others do not. However, a short processing time is not per se indicative a low skew: It may simply be short because the partition is small. A situation indicative of low skew may be when all executors finish processing their partitions in the same timeframe. High skew may be indicated by some executors taking much longer to finish their partitions than others. But the answer does not make any comparison “ so by itself it does not provide enough information to make any assessment about skew. More info: Spark Repartition & Coalesce - Explained and Performance Tuning - Spark 3.1.2 Documentation
Question # 15
Which of the following statements about executors is correct?
A. Executors are launched by the driver. B. Executors stop upon application completion by default. C. Each node hosts a single executor. D. Executors store data in memory only. E. An executor can serve multiple applications.
Answer: B Explanation: Executors stop upon application completion by default. Correct. Executors only persist during the lifetime of an application. A notable exception to that is when Dynamic Resource Allocation is enabled (which it is not by default). With Dynamic Resource Allocation enabled, executors are terminated when they are idle, independent of whether the application has been completed or not. An executor can serve multiple applications. Wrong. An executor is always specific to the application. It is terminated when the application completes (exception see above). Each node hosts a single executor. No. Each node can host one or more executors. Executors store data in memory only. No. Executors can store data in memory or on disk. Executors are launched by the driver. Incorrect. Executors are launched by the cluster manager on behalf of the driver. More info: Job Scheduling - Spark 3.1.2 Documentation, How Applications are Executed on a Spark Cluster | Anatomy of a Spark Application | InformIT, and Spark Jargon for Starters. This blog is to clear some of the¦ | by Mageswaran D | Medium
Question # 16
Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full
shuffle?
A. DataFrame.repartition(12) B. DataFrame.coalesce(6).shuffle() C. DataFrame.coalesce(6) D. DataFrame.coalesce(6, shuffle=True) E. DataFrame.repartition(6)
Answer: E
Explanation:
DataFrame.repartition(6)
Correct. repartition() always triggers a full shuffle (different from coalesce()).
DataFrame.repartition(12)
No, this would just leave the DataFrame with 12 partitions and not 6.
DataFrame.coalesce(6)
coalesce does not perform a full shuffle of the data. Whenever you see "full shuffle", you know that
you are not dealing with coalesce(). While coalesce() can perform a partial shuffle when required,
it will try to minimize shuffle operations, so the amount of data that is sent between executors.
Here, 12 partitions can easily be repartitioned to be 6 partitions simply by stitching every two
partitions into one.
DataFrame.coalesce(6, shuffle=True) and DataFrame.coalesce(6).shuffle()
These statements are not valid Spark API syntax.
More info: Spark Repartition & Coalesce - Explained and Repartition vs Coalesce in Apache Spark -
Rock the JVM Blog
Question # 17
Which of the following describes Spark actions?
A. Writing data to disk is the primary purpose of actions. B. Actions are Spark's way of exchanging data between executors. C. The driver receives data upon request by actions. D. Stage boundaries are commonly established by actions. E. Actions are Spark's way of modifying RDDs.
Answer: C Explanation: The driver receives data upon request by actions. Correct! Actions trigger the distributed execution of tasks on executors which, upon task completion, transfer result data back to the driver. Actions are Spark's way of exchanging data between executors. No. In Spark, data is exchanged between executors via shuffles. Writing data to disk is the primary purpose of actions. No. The primary purpose of actions is to access data that is stored in Spark's RDDs and return the data, often in aggregated form, back to the driver. Actions are Spark's way of modifying RDDs. Incorrect. Firstly, RDDs are immutable “ they cannot be modified. Secondly, Spark generates new RDDs via transformations and not actions. Stage boundaries are commonly established by actions. Wrong. A stage boundary is commonly established by a shuffle, for example caused by a wide transformation.
Question # 18
Which of the following code blocks performs a join in which the small DataFrame transactionsDf is
sent to all executors where it is joined with DataFrame itemsDf on columns storeId and itemId,
respectively?
A. itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, "right_outer") B. itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, "broadcast") C. itemsDf.merge(transactionsDf, "itemsDf.itemId == transactionsDf.storeId", "broadcast") D. itemsDf.join(broadcast(transactionsDf), itemsDf.itemId == transactionsDf.storeId) E. itemsDf.join(transactionsDf, broadcast(itemsDf.itemId == transactionsDf.storeId))
Answer: D
Explanation:
The issue with all answers that have "broadcast" as very last argument is that "broadcast" is not a
valid join type. While the entry with "right_outer" is a valid statement, it is not a broadcast join. The
item where broadcast() is wrapped around the equality condition is not valid code in Spark.
broadcast() needs to be wrapped around the name of the small DataFrame that should be broadcast.
More info: Learning Spark, 2nd Edition, Chapter 7
Static notebook | Dynamic notebook: See test 1,
Databricks import instructions)
tion and explanation?
Question # 19
Which of the following are valid execution modes?
A. Kubernetes, Local, Client B. Client, Cluster, Local C. Server, Standalone, Client D. Cluster, Server, Local E. Standalone, Client, Cluster
Answer: B Explanation: This is a tricky to get right, since it is easy to confuse execution modes and deployment modes. Even in literature, both terms are sometimes used interchangeably. There are only 3 valid execution modes in Spark: Client, cluster, and local execution modes. Execution modes do not refer to specific frameworks, but to where infrastructure is located with respect to each other. In client mode, the driver sits on a machine outside the cluster. In cluster mode, the driver sits on a machine inside the cluster. Finally, in local mode, all Spark infrastructure is started in a single JVM (Java Virtual Machine) in a single computer which then also includes the driver. Deployment modes often refer to ways that Spark can be deployed in cluster mode and how it uses specific frameworks outside Spark. Valid deployment modes are standalone, Apache YARN, Apache Mesos and Kubernetes. Client, Cluster, Local Correct, all of these are the valid execution modes in Spark. Standalone, Client, Cluster No, standalone is not a valid execution mode. It is a valid deployment mode, though. Kubernetes, Local, Client No, Kubernetes is a deployment mode, but not an execution mode. Cluster, Server, Local No, Server is not an execution mode. Server, Standalone, Client No, standalone and server are not execution modes. More info: Apache Spark Internals - Learning Journal
Question # 20
The code block displayed below contains an error. The code block is intended to perform an outerjoin of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively.Find the error.Code block:transactionsDf.join(itemsDf, [itemsDf.itemId, transactionsDf.productId], "outer")
A. The "outer" argument should be eliminated, since "outer" is the default join type. B. The join type needs to be appended to the join() operator, like join().outer() instead of listing it asthe last argument inside the join() call. C. The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.itemId ==transactionsDf.productId. D. The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.col("itemId")== transactionsDf.col("productId"). E. The "outer" argument should be eliminated from the call and join should be replaced by joinOuter.
Which of the following is a characteristic of the cluster manager?
A. Each cluster manager works on a single partition of data. B. The cluster manager receives input from the driver through the SparkContext. C. The cluster manager does not exist in standalone mode. D. The cluster manager transforms jobs into DAGs. E. In client mode, the cluster manager runs on the edge node.
Answer: B Explanation: The cluster manager receives input from the driver through the SparkContext. Correct. In order for the driver to contact the cluster manager, the driver launches a SparkContext. The driver then asks the cluster manager for resources to launch executors. In client mode, the cluster manager runs on the edge node. No. In client mode, the cluster manager is independent of the edge node and runs in the cluster. The cluster manager does not exist in standalone mode. Wrong, the cluster manager exists even in standalone mode. Remember, standalone mode is an easy means to deploy Spark across a whole cluster, with some limitations. For example, in standalone mode, no other frameworks can run in parallel with Spark. The cluster manager is part of Spark in standalone deployments however and helps launch and maintain resources across the cluster. The cluster manager transforms jobs into DAGs. No, transforming jobs into DAGs is the task of the Spark driver. Each cluster manager works on a single partition of data. No. Cluster managers do not work on partitions directly. Their job is to coordinate cluster resources so that they can be requested by and allocated to Spark drivers. More info: Introduction to Core Spark Concepts BigData
Question # 22
Which of the following code blocks returns DataFrame transactionsDf sorted in descending order by
column predError, showing missing values last?
A. transactionsDf.sort(asc_nulls_last("predError")) B. transactionsDf.orderBy("predError").desc_nulls_last() C. transactionsDf.sort("predError", ascending=False) D. transactionsDf.desc_nulls_last("predError") E. transactionsDf.orderBy("predError").asc_nulls_last()
Answer: C
Explanation:
transactionsDf.sort("predError", ascending=False)
Correct! When using DataFrame.sort() and setting ascending=False, the DataFrame will be sorted by
the specified column in descending order, putting all missing values last. An alternative,
although not listed as an answer here, would be transactionsDf.sort(desc_nulls_last("predError")).
transactionsDf.sort(asc_nulls_last("predError"))
Incorrect. While this is valid syntax, the DataFrame will be sorted on column predError in ascending
order and not in descending order, putting missing values last.
transactionsDf.desc_nulls_last("predError")
Wrong, this is invalid syntax. There is no method DataFrame.desc_nulls_last() in the Spark API. There
is a Spark function desc_nulls_last() however (link see below).
Which of the following code blocks returns a copy of DataFrame itemsDf where the column supplier
has been renamed to manufacturer?
A. itemsDf.withColumn(["supplier", "manufacturer"]) B. itemsDf.withColumn("supplier").alias("manufacturer") C. itemsDf.withColumnRenamed("supplier", "manufacturer") D. itemsDf.withColumnRenamed(col("manufacturer"), col("supplier")) E. itemsDf.withColumnsRenamed("supplier", "manufacturer")
Correct! This uses the relatively trivial DataFrame method withColumnRenamed for renaming
column supplier to column manufacturer.
Note that the
Question # 24
The code block displayed below contains an error. The code block should return the average of rowsin column value grouped by unique storeId. Find the error.Code block:transactionsDf.agg("storeId").avg("value")
A. Instead of avg("value"), avg(col("value")) should be used. B. The avg("value") should be specified as a second argument to agg() instead of being appended to it. C. All column names should be wrapped in col() operators. D. agg should be replaced by groupBy. E. "storeId" and "value" should be swapped.
Which of the following statements about the differences between actions and transformations is
correct?
A. Actions are evaluated lazily, while transformations are not evaluated lazily. B. Actions generate RDDs, while transformations do not. C. Actions do not send results to the driver, while transformations do. D. Actions can be queued for delayed execution, while transformations can only be processed immediately. E. Actions can trigger Adaptive Query Execution, while transformation cannot.
Answer: E Explanation: Actions can trigger Adaptive Query Execution, while transformation cannot. Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query. Actions can be queued for delayed execution, while transformations can only be processed immediately. No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately. Actions are evaluated lazily, while transformations are not evaluated lazily. Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation. Actions generate RDDs, while transformations do not. No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,...) based on the RDDs, but they do not generate them. Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do. No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to the driver. They produce RDDs that remain on the worker nodes. More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution
Feedback That Matters: Reviews of Our Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Dumps
Jorge GillJun 30, 2026
The coding-focused questions on MyCertsHub were exactly what I needed. aided me in identifying logic errors that I had not realized I was making.
Declan SmithJun 29, 2026
I'm new to Spark, but MyCertsHub made RDDs and transformations easier to understand. Passed the cert last week!
Jameson BarrettJun 29, 2026
Straightforward and practical prep. I was able to save time and gain confidence before the test thanks to the mini-quizzes.
Preet AmbleJun 28, 2026
Between classes, I prepared with MyCertsHub. The Spark syntax tips changed everything. Compact and clear.
Samuel BrownJun 28, 2026
Great resource. No fluff, just relevant practice that mirrors what you’ll actually see on the exam.