Free Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Actual Questions

The questions for Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 were last updated On Jan 19, 2025

Question No. 1

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

Show Answer Hide Answer
Correct Answer: F

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed.

transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level('MEMORY_ONLY')

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide - Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist --- PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)


Question No. 2

The code block shown below should return a DataFrame with columns transactionsId, predError, value, and f from DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the

code block to accomplish this.

transactionsDf.__1__(__2__)

Show Answer Hide Answer
Correct Answer: C

Correct code block:

transactionsDf.select(['transactionId', 'predError', 'value', 'f'])

The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument. Thus, this is the correct choice here. The option using col(['transactionId', 'predError',

'value', 'f']) is invalid, since inside col(), one can only pass a single column name, not a list. Likewise, all columns being specified in a single string like 'transactionId, predError, value, f' is not valid

syntax.

filter and where filter rows based on conditions, they do not control which columns to return.

Static notebook | Dynamic notebook: See test 2, Question: 49 (Databricks import instructions)


Question No. 3

The code block shown below should read all files with the file ending .png in directory path into Spark. Choose the answer that correctly fills the blanks in the code block to accomplish this.

spark.__1__.__2__(__3__).option(__4__, "*.png").__5__(path)

Show Answer Hide Answer
Correct Answer: B

Correct code block:

spark.read.format('binaryFile').option('recursiveFileLookup', '*.png').load(path)

Spark can deal with binary files, like images. Using the binaryFile format specification in the SparkSession's read API is the way to read in those files. Remember that, to access the read API, you

need to start the command with spark.read. The pathGlobFilter option is a great way to filter files by name (and ending). Finally, the path can be specified using the load operator -- the open

operator shown in one of the answers does not exist.


Question No. 4

Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors?

Show Answer Hide Answer
Correct Answer: A

Caching means storing a copy of a partition on an executor, so it can be accessed quicker by subsequent operations, instead of having to be recalculated. cache() is a lazily-evaluated method of the

DataFrame. Since count() is an action (while filter() is not), it triggers the caching process.

More info: pyspark.sql.DataFrame.cache --- PySpark 3.1.2 documentation, Learning Spark, 2nd Edition, Chapter 7

Static notebook | Dynamic notebook: See test 2, Question: 20 (Databricks import instructions)


Question No. 5

Which of the following DataFrame methods is classified as a transformation?

Show Answer Hide Answer
Correct Answer: C

DataFrame.select()

Correct, DataFrame.select() is a transformation. When the command is executed, it is evaluated lazily and returns an RDD when it is triggered by an action.

DataFrame.foreach()

Incorrect, DataFrame.foreach() is not a transformation, but an action. The intention of foreach() is to apply code to each element of a DataFrame to update accumulator variables or write the

elements to external storage. The process does not return an RDD - it is an action!

DataFrame.first()

Wrong. As an action, DataFrame.first() executed immediately and returns the first row of a DataFrame.

DataFrame.count()

Incorrect. DataFrame.count() is an action and returns the number of rows in a DataFrame.

DataFrame.show()

No, DataFrame.show() is an action and displays the DataFrame upon execution of the command.