The code block displayed below contains an error. The code block should use Python method find_most_freq_letter to find the letter present most in column itemName of DataFrame itemsDf and
return it in a new column most_frequent_letter. Find the error.
Code block:
1. find_most_freq_letter_udf = udf(find_most_freq_letter)
2. itemsDf.withColumn("most_frequent_letter", find_most_freq_letter("itemName"))
Correct code block:
find_most_freq_letter_udf = udf(find_most_frequent_letter)
itemsDf.withColumn('most_frequent_letter', find_most_freq_letter_udf('itemName'))
Spark should use the previously registered find_most_freq_letter_udf method here -- but it is not doing that in the original codeblock. There, it just uses the non-UDF version of the Python method.
Note that typically, we would have to specify a return type for udf(). Except in this case, since the default return type for udf() is a string which is what we are expecting here. If we wanted to return
an integer variable instead, we would have to register the Python function as UDF using find_most_freq_letter_udf = udf(find_most_freq_letter, IntegerType()).
More info: pyspark.sql.functions.udf --- PySpark 3.1.1 documentation
The code block shown below should return a DataFrame with columns transactionsId, predError, value, and f from DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the
code block to accomplish this.
transactionsDf.__1__(__2__)
Correct code block:
transactionsDf.select(['transactionId', 'predError', 'value', 'f'])
The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument. Thus, this is the correct choice here. The option using col(['transactionId', 'predError',
'value', 'f']) is invalid, since inside col(), one can only pass a single column name, not a list. Likewise, all columns being specified in a single string like 'transactionId, predError, value, f' is not valid
syntax.
filter and where filter rows based on conditions, they do not control which columns to return.
Static notebook | Dynamic notebook: See test 2, Question: 49 (Databricks import instructions)
Which of the following code blocks displays various aggregated statistics of all columns in DataFrame transactionsDf, including the standard deviation and minimum of values in each column?
The DataFrame.summary() command is very practical for quickly calculating statistics of a DataFrame. You need to call .show() to display the results of the calculation. By default, the command
calculates various statistics (see documentation linked below), including standard deviation and minimum. Note that the answer that lists many options in the summary() parentheses does not
include the minimum, which is asked for in the question.
Answer options that include agg() do not work here as shown, since DataFrame.agg() expects more complex, column-specific instructions on how to aggregate values.
More info:
- pyspark.sql.DataFrame.summary --- PySpark 3.1.2 documentation
- pyspark.sql.DataFrame.agg --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 46 (Databricks import instructions)
Which of the following code blocks performs a join in which the small DataFrame transactionsDf is sent to all executors where it is joined with DataFrame itemsDf on columns storeId and itemId,
respectively?
The issue with all answers that have 'broadcast' as very last argument is that 'broadcast' is not a valid join type. While the entry with 'right_outer' is a valid statement, it is not a broadcast join. The
item where broadcast() is wrapped around the equality condition is not valid code in Spark. broadcast() needs to be wrapped around the name of the small DataFrame that should be broadcast.
More info: Learning Spark, 2nd Edition, Chapter 7
Static notebook | Dynamic notebook: See test 1, Question: 34 (Databricks import instructions)
tion and explanation?
The code block displayed below contains an error. The code block should return the average of rows in column value grouped by unique storeId. Find the error.
Code block:
transactionsDf.agg("storeId").avg("value")