Free Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Actual Questions

The questions for Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 were last updated On Dec 16, 2024

Question No. 1

The code block displayed below contains an error. The code block should use Python method find_most_freq_letter to find the letter present most in column itemName of DataFrame itemsDf and

return it in a new column most_frequent_letter. Find the error.

Code block:

1. find_most_freq_letter_udf = udf(find_most_freq_letter)

2. itemsDf.withColumn("most_frequent_letter", find_most_freq_letter("itemName"))

Show Answer Hide Answer
Correct Answer: A

Correct code block:

find_most_freq_letter_udf = udf(find_most_frequent_letter)

itemsDf.withColumn('most_frequent_letter', find_most_freq_letter_udf('itemName'))

Spark should use the previously registered find_most_freq_letter_udf method here -- but it is not doing that in the original codeblock. There, it just uses the non-UDF version of the Python method.

Note that typically, we would have to specify a return type for udf(). Except in this case, since the default return type for udf() is a string which is what we are expecting here. If we wanted to return

an integer variable instead, we would have to register the Python function as UDF using find_most_freq_letter_udf = udf(find_most_freq_letter, IntegerType()).

More info: pyspark.sql.functions.udf --- PySpark 3.1.1 documentation


Question No. 2

The code block shown below should return a DataFrame with columns transactionsId, predError, value, and f from DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the

code block to accomplish this.

transactionsDf.__1__(__2__)

Show Answer Hide Answer
Correct Answer: C

Correct code block:

transactionsDf.select(['transactionId', 'predError', 'value', 'f'])

The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument. Thus, this is the correct choice here. The option using col(['transactionId', 'predError',

'value', 'f']) is invalid, since inside col(), one can only pass a single column name, not a list. Likewise, all columns being specified in a single string like 'transactionId, predError, value, f' is not valid

syntax.

filter and where filter rows based on conditions, they do not control which columns to return.

Static notebook | Dynamic notebook: See test 2, Question: 49 (Databricks import instructions)


Question No. 3

Which of the following code blocks displays various aggregated statistics of all columns in DataFrame transactionsDf, including the standard deviation and minimum of values in each column?

Show Answer Hide Answer
Correct Answer: E

The DataFrame.summary() command is very practical for quickly calculating statistics of a DataFrame. You need to call .show() to display the results of the calculation. By default, the command

calculates various statistics (see documentation linked below), including standard deviation and minimum. Note that the answer that lists many options in the summary() parentheses does not

include the minimum, which is asked for in the question.

Answer options that include agg() do not work here as shown, since DataFrame.agg() expects more complex, column-specific instructions on how to aggregate values.

More info:

- pyspark.sql.DataFrame.summary --- PySpark 3.1.2 documentation

- pyspark.sql.DataFrame.agg --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 46 (Databricks import instructions)


Question No. 4

Which of the following code blocks performs a join in which the small DataFrame transactionsDf is sent to all executors where it is joined with DataFrame itemsDf on columns storeId and itemId,

respectively?

Show Answer Hide Answer
Correct Answer: D

The issue with all answers that have 'broadcast' as very last argument is that 'broadcast' is not a valid join type. While the entry with 'right_outer' is a valid statement, it is not a broadcast join. The

item where broadcast() is wrapped around the equality condition is not valid code in Spark. broadcast() needs to be wrapped around the name of the small DataFrame that should be broadcast.

More info: Learning Spark, 2nd Edition, Chapter 7

Static notebook | Dynamic notebook: See test 1, Question: 34 (Databricks import instructions)

tion and explanation?


Question No. 5

The code block displayed below contains an error. The code block should return the average of rows in column value grouped by unique storeId. Find the error.

Code block:

transactionsDf.agg("storeId").avg("value")

Show Answer Hide Answer