At ValidExamDumps, we consistently monitor updates to the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam questions by Databricks. Whenever our team identifies changes in the exam questions,exam objectives, exam focus areas or in exam requirements, We immediately update our exam questions for both PDF and online practice exams. This commitment ensures our customers always have access to the most current and accurate questions. By preparing with these actual questions, our customers can successfully pass the Databricks Certified Associate Developer for Apache Spark 3.0 exam on their first attempt without needing additional materials or study guides.
Other certification materials providers often include outdated or removed questions by Databricks in their Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam. These outdated questions lead to customers failing their Databricks Certified Associate Developer for Apache Spark 3.0 exam. In contrast, we ensure our questions bank includes only precise and up-to-date questions, guaranteeing their presence in your actual exam. Our main priority is your success in the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam, not profiting from selling obsolete exam questions in PDF or Online Practice Test.
Which of the following code blocks returns all unique values of column storeId in DataFrame transactionsDf?
distinct() is a method of a DataFrame. Knowing this, or recognizing this from the documentation, is the key to solving this question.
More info: pyspark.sql.DataFrame.distinct --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2, Question: 19 (Databricks import instructions)
The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code
block to accomplish this.
Code block:
transactionsDf.__1__(__2__).__3__
Correct code block:
transactionsDf.select('storeId').printSchema()
The difficulty of this Question: is that it is hard to solve with the stepwise first-to-last-gap approach that has worked well for similar questions, since the answer options are so different from
one
another. Instead, you might want to eliminate answers by looking for patterns of frequently wrong answers.
A first pattern that you may recognize by now is that column names are not expressed in quotes. For this reason, the answer that includes storeId should be eliminated.
By now, you may have understood that the DataFrame.limit() is useful for returning a specified amount of rows. It has nothing to do with specific columns. For this reason, the answer that resolves to
limit('storeId') can be eliminated.
Given that we are interested in information about the data type, you should Question: whether the answer that resolves to limit(1).columns provides you with this information. While
DataFrame.columns is a valid call, it will only report back column names, but not column types. So, you can eliminate this option.
The two remaining options either use the printSchema() or print_schema() command. You may remember that DataFrame.printSchema() is the only valid command of the two. The select('storeId')
part just returns the storeId column of transactionsDf - this works here, since we are only interested in that column's type anyways.
More info: pyspark.sql.DataFrame.printSchema --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 57 (Databricks import instructions)
Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?
transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
Correct. The repartition operator is the correct one for increasing the number of partitions. calling getNumPartitions() on DataFrame.rdd returns the current number of partitions.
transactionsDf.coalesce(10)
No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it.
transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
Incorrect, there is no getNumPartitions() method for the DataFrame class.
transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class.
transactionsDf.repartition(transactionsDf._partitions+2)
No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method.
More info: pyspark.sql.DataFrame.repartition --- PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 23 (Databricks import instructions)
Which of the following code blocks returns a new DataFrame with only columns predError and values of every second row of DataFrame transactionsDf?
Entire DataFrame transactionsDf:
1. +-------------+---------+-----+-------+---------+----+
2. |transactionId|predError|value|storeId|productId| f|
3. +-------------+---------+-----+-------+---------+----+
4. | 1| 3| 4| 25| 1|null|
5. | 2| 6| 7| 2| 2|null|
6. | 3| 3| null| 25| 3|null|
7. | 4| null| null| 3| 2|null|
8. | 5| null| null| null| 2|null|
9. | 6| 3| 2| 25| 2|null|
10. +-------------+---------+-----+-------+---------+----+
Output of correct code block:
+---------+-----+
|predError|value|
+---------+-----+
| 6| 7|
| null| null|
| 3| 2|
+---------+-----+
This is not an easy Question: to solve. You need to know that % stands for the module operator in Python. % 2 will return true for every second row. The statement using spark.sql gets it
almost right (the modulo operator exists in SQL as well), but % 2 = 2 will never yield true, since modulo 2 is either 0 or 1.
Other answers are wrong since they are missing quotes around the column names and/or use filter or select incorrectly.
If you have any doubts about SparkSQL and answer options 3 and 4 in this question, check out the notebook I created as a response to a related student question.
Static notebook | Dynamic notebook: See test 1, Question: 53 (Databricks import instructions)
Which of the following code blocks produces the following output, given DataFrame transactionsDf?
Output:
1. root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- productId: integer (nullable = true)
7. |-- f: integer (nullable = true)
DataFrame transactionsDf:
1. +-------------+---------+-----+-------+---------+----+
2. |transactionId|predError|value|storeId|productId| f|
3. +-------------+---------+-----+-------+---------+----+
4. | 1| 3| 4| 25| 1|null|
5. | 2| 6| 7| 2| 2|null|
6. | 3| 3| null| 25| 3|null|
7. +-------------+---------+-----+-------+---------+----+
The output is the typical output of a DataFrame.printSchema() call. The DataFrame's RDD representation does not have a printSchema or formatSchema method (find available methods in the RDD
documentation linked below). The output of print(transactionsDf.schema) is this: StructType(List(StructField(transactionId,IntegerType,true),StructField(predError,IntegerType,true),StructField
(value,IntegerType,true),StructField(storeId,IntegerType,true),StructField(productId,IntegerType,true),StructField(f,IntegerType,true))). It includes the same information as the nicely formatted original
output, but is not nicely formatted itself. Lastly, the DataFrame's schema attribute does not have a print() method.
More info:
- pyspark.RDD: pyspark.RDD --- PySpark 3.1.2 documentation
- DataFrame.printSchema(): pyspark.sql.DataFrame.printSchema --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2, Question: 52 (Databricks import instructions)