ICS 574 – HW #2Descriptive Analytics
1. Solve the following questions on Google Colab or Databricks using Spark SQL
a. [4 pts] Search the internet for a big dataset of at least 0.5 GB.
b. [4 pts] Create a DataFrame from the dataset.
c. Using the DataFrame and implement the following aggregation functions.
i. [4 pts] Aggregation with grouping
ii. [4 pts] Aggregation with pivoting
iii. [4 pts] Aggregation with rollups and cubes
d. Spark SQL supports the following window functions. Apply these functions on the
DataFrame
i. [10 pts] Ranking functions
1. rank
2. dense_rank
3. percent_rank
4. row_number
5. ntile
ii. [10 pts] Analytic functions
1. cume_dist
2. first_value
3. last_value
4. lag
5. lead
Deliverables
•
One pdf file which contains the following.
o A cover page which includes, your KFUPM ID, name, HW number, and date
o A description of the big dataset and its source.
o Each SQL statement and a snapshot of its output
o Problems you faced if any.
Note:
•
•
Submit the homework before 11:59pm Saturday April 27, 2024.
There are many YouTube videos that teach how to use Spark in Databricks or Colab.