Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?
from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)
Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed.
transactionsDf.cache()
This is wrong because the default storage level of DataFrame.cache() is MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.
transactionsDf.persist()
This is wrong because the default storage level of DataFrame.persist() is MEMORY_AND_DISK.
transactionsDf.clear_persist()
Incorrect, since clear_persist() is not a method of DataFrame.
transactionsDf.storage_level('MEMORY_ONLY')
Wrong. storage_level is not a method of DataFrame.
The code block shown below should return a DataFrame with columns transactionsId, predError, value, and f from DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the
code block to accomplish this.
transactionsDf.__1__(__2__)
Correct code block:
transactionsDf.select(['transactionId', 'predError', 'value', 'f'])
The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument. Thus, this is the correct choice here. The option using col(['transactionId', 'predError',
'value', 'f']) is invalid, since inside col(), one can only pass a single column name, not a list. Likewise, all columns being specified in a single string like 'transactionId, predError, value, f' is not valid
syntax.
filter and where filter rows based on conditions, they do not control which columns to return.
Static notebook | Dynamic notebook: See test 2, Question: 49 (Databricks import instructions)
The code block shown below should read all files with the file ending .png in directory path into Spark. Choose the answer that correctly fills the blanks in the code block to accomplish this.
spark.__1__.__2__(__3__).option(__4__, "*.png").__5__(path)
Correct code block:
spark.read.format('binaryFile').option('recursiveFileLookup', '*.png').load(path)
Spark can deal with binary files, like images. Using the binaryFile format specification in the SparkSession's read API is the way to read in those files. Remember that, to access the read API, you
need to start the command with spark.read. The pathGlobFilter option is a great way to filter files by name (and ending). Finally, the path can be specified using the load operator -- the open
operator shown in one of the answers does not exist.
Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors?
Caching means storing a copy of a partition on an executor, so it can be accessed quicker by subsequent operations, instead of having to be recalculated. cache() is a lazily-evaluated method of the
DataFrame. Since count() is an action (while filter() is not), it triggers the caching process.
More info: pyspark.sql.DataFrame.cache --- PySpark 3.1.2 documentation, Learning Spark, 2nd Edition, Chapter 7
Static notebook | Dynamic notebook: See test 2, Question: 20 (Databricks import instructions)
Which of the following DataFrame methods is classified as a transformation?
DataFrame.select()
Correct, DataFrame.select() is a transformation. When the command is executed, it is evaluated lazily and returns an RDD when it is triggered by an action.
DataFrame.foreach()
Incorrect, DataFrame.foreach() is not a transformation, but an action. The intention of foreach() is to apply code to each element of a DataFrame to update accumulator variables or write the
elements to external storage. The process does not return an RDD - it is an action!
DataFrame.first()
Wrong. As an action, DataFrame.first() executed immediately and returns the first row of a DataFrame.
DataFrame.count()
Incorrect. DataFrame.count() is an action and returns the number of rows in a DataFrame.
DataFrame.show()
No, DataFrame.show() is an action and displays the DataFrame upon execution of the command.