You are using a Python notebook in an Apache Spark pool in Azure Synapse Analytics.
You need to present the data distribution statistics from a DataFrame in a tabular view.
Which method should you invoke on the DataFrame?
The aggregating statistic can be calculated for multiple columns at the same time with the describe function.
Example:
titanic[['Age', 'Fare']].describe()
Out[6]:
Age Fare
count 714.000000 891.000000
mean 29.699118 32.204208
std 14.526497 49.693429
min 0.420000 0.000000
25% 20.125000 7.910400
50% 28.000000 14.454200
75% 38.000000 31.000000
max 80.000000 512.329200
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You are using an Azure Synapse Analytics serverless SQL pool to query a collection of Apache Parquet files by using automatic schema inference. The files contain more than 40 million rows of UTF-8-encoded business names, survey names, and participant counts. The database is configured to use the default collation.
The queries use open row set and infer the schema shown in the following table.
You need to recommend changes to the queries to reduce I/O reads and tempdb usage.
Solution: You recommend defining a data source and view for the Parquet files. You recommend updating the query to use the view.
Does this meet the goal?
Solution: You recommend using OPENROWSET WITH to explicitly specify the maximum length for businessName and surveyName.
The size of the varchar(8000) columns are too big. Better reduce their size.
A SELECT...FROM OPENROWSET(BULK...) statement queries the data in a file directly, without importing the data into a table. SELECT...FROM OPENROWSET(BULK...) statements can also list bulk-column aliases by using a format file to specify column names, and also data types.
The enterprise analytics team needs to resolve the DAX measure performance issues.
What should the team do first?
You have a Power Bl workspace named workspace1 that contains three reports and two dataflows.
You have an Azure Data Lake Storage account named storage1.
You need to integrate workspace1 and storage1.
What should you do first?
You are using an Azure Synapse Analytics serverless SQL pool to query network traffic logs in the Apache Parquet format. A sample of the data is shown in the following table.
You need to create a Transact-SQL query that will return the source IP address.
Which function should you use in the select statement to retrieve the source IP address?