Problem Scenario 79 : You have been given MySQL DB with following details.
Columns of products table : (product_id | product categoryid | product_name | product_description | product_prtce | product_image )
Please accomplish following activities.
1. Copy "retaildb.products" table to hdfs in a directory p93_products
2. Filter out all the empty prices
3. Sort all the products based on price in both ascending as well as descending order.
4. Sort all the products based on price as well as product_id in descending order.
5. Use the below functions to do data ordering or ranking and fetch top 10 elements top()
A Solution :
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=products -target-dir=p93_products -m 1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p93_products/part-m-00000
Step 3 : Load this directory as RDD using Spark and Python (Open pyspark terminal and do following). productsRDD = sc.textFile('p93_products')
Step 4 : Filter empty prices, if exists
#filter out empty prices lines
nonemptyjines = productsRDD.filter(lambda x: len(x.split(',')[4]) > 0)
Step 5 : Now sort data based on product_price in order. sortedPriceProducts=nonempty_lines.map(lambdaline:(float(line.split(',')[4]),line.split(',')[2])).sortByKey()
for line in sortedPriceProducts.collect(): print(line)
Step 6 : Now sort data based on product_price in descending order. sortedPriceProducts=nonempty_lines.map(lambda line: (float(line.split(',')[4]),line.split(',')[2])).sortByKey(False)
for line in sortedPriceProducts.collect(): print(line)
Step 7 : Get highest price products name. sortedPriceProducts=nonemptyJines.map(lambda line : (float(line.split(',')[4]),line-split(,,,,,)[2]))-sortByKey(False).take(1)
print(sortedPriceProducts)
Step 8 : Now sort data based on product_price as well as product_id in descending order.
#Dont forget to cast string #Tuple as key ((price,id),name)
sortedPriceProducts=nonemptyJines.map(lambda line : ((float(line print(sortedPriceProducts)
Step 9 : Now sort data based on product_price as well as product_id in descending order, using top() function.
#Dont forget to cast string
#Tuple as key ((price,id),name)
sortedPriceProducts=nonemptyJines.map(lambda line: ((float(line.s^^
print(sortedPriceProducts)
Step 10 : Now sort data based on product_price as ascending and product_id in ascending order, using takeOrdered{) function.
#Dont forget to cast string
#Tuple as key ((price,id),name) sortedPriceProducts=nonemptyJines.map(lambda line: ((float(line.split(','}[4]},int(line.split(','}[0]}},line.split(','}[2]}}.takeOrdered(10, lambda tuple : (tuple[0][0],tuple[0][1]))
Step 11 : Now sort data based on product_price as descending and product_id in ascending order, using takeOrdered() function.
#Dont forget to cast string
#Tuple as key ((price,id},name)
#Using minus(-) parameter can help you to make descending ordering , only for numeric value.
sortedPrlceProducts=nonemptylines.map(lambda line: ((float(line.split(','}[4]},int(line.split(','}[0]}},line.split(','}[2]}}.takeOrdered(10, lambda tuple : (-tuple[0][0],tuple[0][1]}}
B Solution :
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=products -target-dir=p93_products -m 1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p93_products/part-m-00000
Step 3 : Load this directory as RDD using Spark and Python (Open pyspark terminal and do following). productsRDD = sc.textFile('p93_products')
Step 4 : Filter empty prices, if exists
#filter out empty prices lines
nonemptyjines = productsRDD.filter(lambda x: len(x.split(',')[4]) > 0)
Step 5 : Now sort data based on product_price in order. sortedPriceProducts=nonempty_lines.map(lambdaline:(float(line.split(',')[4]),line.split(',')[2])).sortByKey()
for line in sortedPriceProducts.collect(): print(line)
Step 6 : Now sort data based on product_price in descending order. sortedPriceProducts=nonempty_lines.map(lambda line: (float(line.split(',')[4]),line.split(',')[2])).sortByKey(False)
for line in sortedPriceProducts.collect(): print(line)
Step 7 : Get highest price products name. sortedPriceProducts=nonemptyJines.map(lambda line : (float(line.split(',')[4]),line-split(,,,,,)[2]))-sortByKey(False).take(1)
print(sortedPriceProducts)
Step 8 : Now sort data based on product_price as well as product_id in descending order.
#Dont forget to cast string #Tuple as key ((price,id),name)
sortedPriceProducts=nonemptyJines.map(lambda line : ((float(line print(sortedPriceProducts)
Step 9 : Now sort data based on product_price as well as product_id in descending order, using top() function.
#Dont forget to cast string
#Tuple as key ((price,id),name)
sortedPriceProducts=nonemptyJines.map(lambda line: ((float(line.s^^
print(sortedPriceProducts)
Step 10 : Now sort data based on product_price as descending and product_id in ascending order, using takeOrdered() function.
#Dont forget to cast string
#Tuple as key ((price,id},name)
#Using minus(-) parameter can help you to make descending ordering , only for numeric value.
sortedPrlceProducts=nonemptylines.map(lambda line: ((float(line.split(','}[4]},int(line.split(','}[0]}},line.split(','}[2]}}.takeOrdered(10, lambda tuple : (-tuple[0][0],tuple[0][1]}}