Free Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Actual Questions

The questions for Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 were last updated On Nov 6, 2024

Question No. 1

Which of the following describes properties of a shuffle?

Show Answer Hide Answer
Correct Answer: E

In a shuffle, Spark writes data to disk.

Correct! Spark's architecture dictates that intermediate results during a shuffle are written to disk.

A shuffle is one of many actions in Spark.

Incorrect. A shuffle is a transformation, but not an action.

Shuffles involve only single partitions.

No, shuffles involve multiple partitions. During a shuffle, Spark generates output partitions from multiple input partitions.

Operations involving shuffles are never evaluated lazily.

Wrong. A shuffle is a costly operation and Spark will evaluate it as lazily as other transformations. This is, until a subsequent action triggers its evaluation.

Shuffles belong to a class known as 'full transformations'.

Not quite. Shuffles belong to a class known as 'wide transformations'. 'Full transformation' is not a relevant term in Spark.

More info: Spark -- The Definitive Guide, Chapter 2 and Spark: disk I/O on stage boundaries explanation - Stack Overflow


Question No. 2

Which of the following code blocks applies the Python function to_limit on column predError in table transactionsDf, returning a DataFrame with columns transactionId and result?

Show Answer Hide Answer
Correct Answer: A

spark.udf.register('LIMIT_FCN', to_limit)

spark.sql('SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf')

Correct! First, you have to register to_limit as UDF to use it in a sql statement. Then, you can use it under the LIMIT_FCN name, correctly naming the resulting column result.

spark.udf.register(to_limit, 'LIMIT_FCN')

spark.sql('SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf')

No. In this answer, the arguments to spark.udf.register are flipped.

spark.udf.register('LIMIT_FCN', to_limit)

spark.sql('SELECT transactionId, to_limit(predError) AS result FROM transactionsDf')

Wrong, this answer does not use the registered LIMIT_FCN in the sql statement, but tries to access the to_limit method directly. This will fail, since Spark cannot access it.

spark.sql('SELECT transactionId, udf(to_limit(predError)) AS result FROM transactionsDf')

Incorrect, there is no udf method in Spark's SQL.

spark.udf.register('LIMIT_FCN', to_limit)

spark.sql('SELECT transactionId, LIMIT_FCN(predError) FROM transactionsDf AS result')

False. In this answer, the column that results from applying the UDF is not correctly renamed to result.

Static notebook | Dynamic notebook: See test 3, Question: 52 (Databricks import instructions)


Question No. 3

The code block shown below should return a copy of DataFrame transactionsDf without columns value and productId and with an additional column associateId that has the value 5. Choose the

answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__, __3__).__4__(__5__, 'value')

Show Answer Hide Answer
Correct Answer: C

Correct code block:

transactionsDf.withColumn('associateId', lit(5)).drop('productId', 'value')

For solving this Question: it is important that you know the lit() function (link to documentation below). This function enables you to add a column of a constant value to a DataFrame.

More info: pyspark.sql.functions.lit --- PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1, Question: 57 (Databricks import instructions)


Question No. 4

The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green, respectively.

Find the error.

Code block:

1. spark.createDataFrame([("red",), ("blue",), ("green",)], "color")

Instead of calling spark.createDataFrame, just DataFrame should be called.

Show Answer Hide Answer
Correct Answer: D

Correct code block:

spark.createDataFrame([('red',), ('blue',), ('green',)], ['color'])

The createDataFrame syntax is not exactly straightforward, but luckily the documentation (linked below) provides several examples on how to use it. It also shows an example very similar to the

code block presented here which should help you answer this Question: correctly.

More info: pyspark.sql.SparkSession.createDataFrame --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, Question: 23 (Databricks import instructions)


Question No. 5

Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000-row DataFrame itemsDf, without any duplicates, returning the same rows even if the code

block is run twice?

Show Answer Hide Answer
Correct Answer: B

itemsDf.sample(fraction=0.1, seed=87238)

Correct. If itemsDf has 10,000 rows, this code block returns about 1,000, since DataFrame.sample() is never guaranteed to return an exact amount of rows. To ensure you are not returning

duplicates, you should leave the withReplacement parameter at False, which is the default. Since the Question: specifies that the same rows should be returned even if the code block is run

twice,

you need to specify a seed. The number passed in the seed does not matter as long as it is an integer.

itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)

Incorrect. While this code block fulfills almost all requirements, it may return duplicates. This is because withReplacement is set to True.

Here is how to understand what replacement means: Imagine you have a bucket of 10,000 numbered balls and you need to take 1,000 balls at random from the bucket (similar to the problem in the

question). Now, if you would take those balls with replacement, you would take a ball, note its number, and put it back into the bucket, meaning the next time you take a ball from the bucket there

would be a chance you could take the exact same ball again. If you took the balls without replacement, you would leave the ball outside the bucket and not put it back in as you take the next 999

balls.

itemsDf.sample(fraction=1000, seed=98263)

Wrong. The fraction parameter needs to have a value between 0 and 1. In this case, it should be 0.1, since 1,000/10,000 = 0.1.

itemsDf.sampleBy('row', fractions={0: 0.1}, seed=82371)

No, DataFrame.sampleBy() is meant for stratified sampling. This means that based on the values in a column in a DataFrame, you can draw a certain fraction of rows containing those values from

the DataFrame (more details linked below). In the scenario at hand, sampleBy is not the right operator to use because you do not have any information about any column that the sampling should

depend on.

itemsDf.sample(fraction=0.1)

Incorrect. This code block checks all the boxes except that it does not ensure that when you run it a second time, the exact same rows will be returned. In order to achieve this, you would have to

specify a seed.

More info:

- pyspark.sql.DataFrame.sample --- PySpark 3.1.2 documentation

- pyspark.sql.DataFrame.sampleBy --- PySpark 3.1.2 documentation

- Types of Samplings in PySpark 3. The explanations of the sampling... | by Pinar Ersoy | Towards Data Science