Which of the following describes properties of a shuffle?
In a shuffle, Spark writes data to disk.
Correct! Spark's architecture dictates that intermediate results during a shuffle are written to disk.
A shuffle is one of many actions in Spark.
Incorrect. A shuffle is a transformation, but not an action.
Shuffles involve only single partitions.
No, shuffles involve multiple partitions. During a shuffle, Spark generates output partitions from multiple input partitions.
Operations involving shuffles are never evaluated lazily.
Wrong. A shuffle is a costly operation and Spark will evaluate it as lazily as other transformations. This is, until a subsequent action triggers its evaluation.
Shuffles belong to a class known as 'full transformations'.
Not quite. Shuffles belong to a class known as 'wide transformations'. 'Full transformation' is not a relevant term in Spark.
More info: Spark -- The Definitive Guide, Chapter 2 and Spark: disk I/O on stage boundaries explanation - Stack Overflow
Which of the following code blocks applies the Python function to_limit on column predError in table transactionsDf, returning a DataFrame with columns transactionId and result?
spark.udf.register('LIMIT_FCN', to_limit)
spark.sql('SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf')
Correct! First, you have to register to_limit as UDF to use it in a sql statement. Then, you can use it under the LIMIT_FCN name, correctly naming the resulting column result.
spark.udf.register(to_limit, 'LIMIT_FCN')
spark.sql('SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf')
No. In this answer, the arguments to spark.udf.register are flipped.
spark.udf.register('LIMIT_FCN', to_limit)
spark.sql('SELECT transactionId, to_limit(predError) AS result FROM transactionsDf')
Wrong, this answer does not use the registered LIMIT_FCN in the sql statement, but tries to access the to_limit method directly. This will fail, since Spark cannot access it.
spark.sql('SELECT transactionId, udf(to_limit(predError)) AS result FROM transactionsDf')
Incorrect, there is no udf method in Spark's SQL.
spark.udf.register('LIMIT_FCN', to_limit)
spark.sql('SELECT transactionId, LIMIT_FCN(predError) FROM transactionsDf AS result')
False. In this answer, the column that results from applying the UDF is not correctly renamed to result.
Static notebook | Dynamic notebook: See test 3, Question: 52 (Databricks import instructions)
The code block shown below should return a copy of DataFrame transactionsDf without columns value and productId and with an additional column associateId that has the value 5. Choose the
answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, __3__).__4__(__5__, 'value')
Correct code block:
transactionsDf.withColumn('associateId', lit(5)).drop('productId', 'value')
For solving this Question: it is important that you know the lit() function (link to documentation below). This function enables you to add a column of a constant value to a DataFrame.
More info: pyspark.sql.functions.lit --- PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1, Question: 57 (Databricks import instructions)
The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green, respectively.
Find the error.
Code block:
1. spark.createDataFrame([("red",), ("blue",), ("green",)], "color")
Instead of calling spark.createDataFrame, just DataFrame should be called.
Correct code block:
spark.createDataFrame([('red',), ('blue',), ('green',)], ['color'])
The createDataFrame syntax is not exactly straightforward, but luckily the documentation (linked below) provides several examples on how to use it. It also shows an example very similar to the
code block presented here which should help you answer this Question: correctly.
More info: pyspark.sql.SparkSession.createDataFrame --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2, Question: 23 (Databricks import instructions)
Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000-row DataFrame itemsDf, without any duplicates, returning the same rows even if the code
block is run twice?
itemsDf.sample(fraction=0.1, seed=87238)
Correct. If itemsDf has 10,000 rows, this code block returns about 1,000, since DataFrame.sample() is never guaranteed to return an exact amount of rows. To ensure you are not returning
duplicates, you should leave the withReplacement parameter at False, which is the default. Since the Question: specifies that the same rows should be returned even if the code block is run
twice,
you need to specify a seed. The number passed in the seed does not matter as long as it is an integer.
itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)
Incorrect. While this code block fulfills almost all requirements, it may return duplicates. This is because withReplacement is set to True.
Here is how to understand what replacement means: Imagine you have a bucket of 10,000 numbered balls and you need to take 1,000 balls at random from the bucket (similar to the problem in the
question). Now, if you would take those balls with replacement, you would take a ball, note its number, and put it back into the bucket, meaning the next time you take a ball from the bucket there
would be a chance you could take the exact same ball again. If you took the balls without replacement, you would leave the ball outside the bucket and not put it back in as you take the next 999
balls.
itemsDf.sample(fraction=1000, seed=98263)
Wrong. The fraction parameter needs to have a value between 0 and 1. In this case, it should be 0.1, since 1,000/10,000 = 0.1.
itemsDf.sampleBy('row', fractions={0: 0.1}, seed=82371)
No, DataFrame.sampleBy() is meant for stratified sampling. This means that based on the values in a column in a DataFrame, you can draw a certain fraction of rows containing those values from
the DataFrame (more details linked below). In the scenario at hand, sampleBy is not the right operator to use because you do not have any information about any column that the sampling should
depend on.
itemsDf.sample(fraction=0.1)
Incorrect. This code block checks all the boxes except that it does not ensure that when you run it a second time, the exact same rows will be returned. In order to achieve this, you would have to
specify a seed.
More info:
- pyspark.sql.DataFrame.sample --- PySpark 3.1.2 documentation
- pyspark.sql.DataFrame.sampleBy --- PySpark 3.1.2 documentation
- Types of Samplings in PySpark 3. The explanations of the sampling... | by Pinar Ersoy | Towards Data Science