Let’s talk about the final project. Now that we know the proper tools to handle big data let’s focus on working on this project using what we learned in this course. You can collect your data set from any appropriate source related to the topic of interest. Remember that your data must be big enough to qualify for the project (at least half a gigabyte).
General Instructions
You are expected to do a final project in this course and utilize the tools we learned on your interests’ qualifying dataset. Your project should include using at least one of the tools we learned in this course, including, but not limited to, Hive, Impala, MapReduce, Spark, or a combination of these tools. Use bash coding in the initial steps to massage your data before feeding it into the Hadoop. The final project consists of two parts: 1- Presentation 2- Report.
- Please be creative.
- Find cool results and use appropriate (simultaneously cool) graphs and techniques.
- For visualization, you can use Tableau to visualize and draw shiny plots. You can also make online WordCloud plots using thisLinks to an external site. website. It is free! Try it.
1. Final presentation
You should prepare to present your project in about 8 minutes and be ready to have approximately 2 minutes of Q&A. All members of the groups are expected to participate in the presentation.
1.1. Presentation format
Apart from your group name, what else does the presentation include?
1- Description of the data,
2- Problem Statement
,
3- Why is this big data?
4- Method & Results, and
5- Conclusion
.
1- Description of the data:
Let us know what the data is. When has it been collected? Who did collect it? What is the source? How large is your data? Do you have any links to the data? How many records does it have? How many features (columns)? Structured or Unstructured?
2- Problem Statement
What are you trying to do? What is your aim? What are your research questions?
3- Why is this big data?
What is the reason that you did select this data? Why is it big data?
4- Method & Results:
What methods did you use? Are you using any Hadoop tools? What are your findings? Any plot? Graph?
5- Conclusion
Please go ahead and finish your findings, and let us know if you have any suggestions regarding the data. For example, machine 12 has too many issues, so it is better to investigate the machine.
1.2 Grading Rubric for Presentation
Each member’s presentation rubric includes the following:
a) Presentation skills (on-time, clear presentation, narration, your PowerPoint style, etc.) 7pts
b) Project introduction 3pts
c) Problem Statement 3pts
d) Dataset 3pts (how big is your dataset? What is it about? Why is it big data? rows? columns? when collected?…)
e) Methods 5pt
f) Conclusion 5pt
g) Novelty & Creativity 7pts (Being creative in your findings and results. Having a novel method and dataset)
h) Participation 7pt
Each team member will evaluate the other team members in this part. There will be a questionnaire in which you can give points to your teammates (not yourself). I’ll send the questionnaire a few days before the presentations. By default, I assume each member gets the total points; otherwise, I’ll look at the given grades.
2. Report
Please discuss your project in detail & hand over a clean, professional, neat report. Your report must include an executive summary (learn how to write an executive summary) plus all sections discussed in the presentation (Description of the data, Problem Statement, Why this is big data, Method and results, and Conclusion) and your code in the appendix. Reports are limited to up to 10 pages (excluding appendices). Notes on the final project report:
- Please submit your Final report in PDF format.
- Write a professional report including an executive summary as the first section. The executive summary is essential to a report, and you’ll need to ensure it is in your report. You need this skill set in your future career regardless of the field.
- Please write all team members’ names, IDs, and group names on the report.
- Please remember that your code should be in the appendix of your final report. If required, charts can be in the appendix, too.
- As you know, the appendix does not count toward the page limit.
Remember to submit your final report on Canvas.
2.1 Grading Rubric for Report
a) Professional report skills (clean, on order with clear grammar, having front page, page number .. ) 10 pt
b) Executive summary 10 pt
c) Introduction & Problem Statement 5 pt
d) Code and Dataset 5 pt (how big is your dataset? What is it about? Why is it big data? …)
e) Methods 5 pt
f) Conclusion 5 pt
3. Speech Script
Please write a speech script corresponds to each slides
4. Dataset
You bring your dataset. Your data should be large enough to be qualified for Big Data! (minimum 500 MB) and not so large to fill the server space (maximum 2 GB ). There are many sources from which you can get a dataset. Here, I’m introducing some sources to get a dataset and work on:
- KaggleLinks to an external site.
- noaa.govLinks to an external site.
- Weather.comLinks to an external site.
,…
You may check with me and discuss if your dataset qualifies for this project.
5. Samples
You can find samples of great projects presented by students (on other topics) in the past semester here. You can’t choose your topics from samples
Sample1.pdfDownload Sample1.pdf
Sample2.pdfDownload Sample2.pdf
Sample3.pdfDownload Sample3.pdf
Sample4.pdfDownload Sample4.pdf
US Traffic Accidents
Pattern Analysis
December 8, 2021
Our Data
US Accidents data – a countrywide traffic accident dataset
1.5 million
traffic accidents
more than 500MB
in 49 states
1
US and state departments of transportation
2
law enforcement agencies
3
traffic cameras
from February 2016
to December 2020
Why is this big data? It satisfies 4V principles!
1
Problem Statement
US Accidents data – a countrywide traffic accident dataset
Accidents Distribution
Based On Time
Accidents Distribution
Based On Location
Accidents Distribution
Based On Weather
A LSTM regression model for prediction of accidents
1
Hive, Impala
2
MapReudce
3
Pyspark,Python
1
1.Time Analysis
1 Road Accidents During Day & Night
From 6 p.m. to 6 a.m.
Night
About 1/3 road accidents.
From 6 a.m. to 6 p.m.
Day
Major of road accident occurred at 5 p.m..
We found that most road accidents do not happen in the middle of the night, but
mostly in the afternoon.
9
2 The Road Accidents Distribution in a week
Weekdays
Weekends
Distribution in a week
83%
Occurred in weekdays
key information
Thursday of a week is
having the highest
percentage of road accidents
weekdays have almost 2
times higher accident
percentage to weekend
8
3 The Monthly Road Accidents Distribution Analysis
Distribution in a year
May(most)
18%
Occurred in May
key information
45% of the road accidents
occurred within 3 month ,
March to May.
May has the most accident
percentage among a year.
November is the month with
the least (3.54%) road
accidents.
November(least)
8
4 Car Accidents Year Analysis based on the severity
45.3%
1.78%
The most accidents occur every year are severe-2 in the last 5 years, especially in 2020,45.3% of
the total car accidents are moderately severe(level-2).
In last 5 years (2016-2020),the highly severe (Severity-4) accident cases happened in us remain
in the range of 0.94% to 1.78%.
2.Location Analysis
1 MapReduce
Mapping
Reduce
Outcome
CA: 448833
State:
State
CA
FL
……
FL: 153007
City:
City
Los Angles
Miami
……
Los Angles:39984
Miami:36233
We are building a MapReduce function to count the frequency of accidents among State and City
10
2
The Road Accidents Distribution in States
3
The Road Accidents Distribution in Cities
3. Weather Analysis
1 Weather Analysis – Humidity & Temperature
Humidity – Frequency & Severity
12.9% cases occurred in
extremely humid day vs.
1.1% in extremely dry day
Temperature – Frequency & Severity
‘Most’ cases occurred on
the comfortable H & T
6.7% cases occurred in very
cold day vs. 3.2% in very
hot day
The wetter environment drives cases increasing, but the temperature not.
4
2 Specific Weather Analysis
The Top Weather
“Fair”
457086 Cases
Others
Mostly Cloudy
Cloudy
Clear
Partly Cloudy
8
3 General Figure Obtained From Tableau
4. Deep Learning
Forecasting
1 Building LSTM model to forecast the accidents based
on weather
‘Distance’
‘Temperature’ ‘Humidity’
‘Visibility’
‘Windspeed’
50 neurons in the hidden layer
1 neuron in the output layer
Loss function —— MSE
Optimizer
The test MSE is :0.67
—— Adam
difference 0.82 miles on average
2
User Insight for Twitter Music
Recommendation System
1
2
3
4
5
Introduction
Dataset
Method
Result
Conclusion
Background
We designed a new
recommender for Twitter!
Gather Data
from User
Give Appropriate
Recommendation
Platform Get
Higher Profit
Retain More
Customers
Attract More
Artists
Problem Statement
Purpose:
To design a Music Recommender System providing users
personalized experiences by recommending songs that
are most likely-to-be-liked by individuals.
General Design
Recommendations
will be made:
About the dataset
Why is it Big Data?
• Millions of rows, unable to process using traditional
relational database softwares.
• Tried in Excel and laptop crashed.
Method
Unlike Hive & MapReduce,
Spark is a general-purpose
distributed data processing
engine that is suitable for use
in a wide range of
circumstances.
●
●
SparkCore
SparkSQL
Tableau is a visual analytics
platform transforming the way
we use data to solve problems
●
Diverse visualization
Galileo Galileo
@TheREALGalileo
Spark and Tableau are the next generation!!!
Result I: Recommendation by Artist
Diana Krall, Katy
Perry and MDS
Hash rank the top,
so their works
would be
recommended
➔ Group by artist_id
➔ Aggregate by Count user_id
➔ Sort by counts
Result I: Recommendation by Artist
Arctic Monkey,
Coldplay and Ed
Sheeran rank the
top, so their works
would be
recommended
➔ Group by artist_id
➔ Aggregate by distinctCount user_id
➔ Sort by counts
Result I: Recommendation by Artist
Arctic Monkey,
Coldplay and Ed
Sheeran rank the
top, so their works
would be
recommended
➔ Group by user_id, artist_id
➔ Window
➔ Group by artist_id, do count
Result II: Recommendation by Sentiment
● VADER
● Sentiment tracker
● Momentum Effect
The first track has
experience the
momentum effect on
April so it should be
recommended more
after that
➔
➔
➔
➔
Join table
Create Month Column
Group by Month, track
Aggregate by averaging Vader score
Result III: Recommendation by Language
● Listening to a
foreign song is
joyful
● ES, namely
Spanish, tops on
this chart;
recommend more
spanish songs for
English users
➔
➔
➔
➔
Select English user
Filter by not English in tweet_lang
Group by tweet_lang
Aggregate by count
Result IV: Recommendation by Popularity
● Simulated Billboard
● The longer it stay
on board, the high
priority it gets in
the
recommendation
system
➔ Group by Month, track_id
➔ Window to decide top 5
➔ Sort by Month, rank
Conclusion
Thank you !!
Instructions:
You are expected to do a final project in this course and utilize the tools that we learned on a qualifying dataset of your
interests. Your project should include the utilization of either Hive, Impala, Pig, MapReduce, or Spark, or a combination of
these tools. You might use bash coding in the initial steps to massage your data before feeding it into the Hadoop. The final
project consists of two parts: 1- Presentation 2- Report.
General points:
•
Please be creative. Writing a few lines of queries in Hive or Impala is not going to land you a good grade.
•
Find cool results and use cool charts and techniques.
•
For visualization, you can use Tableau to visualize and draw shiny plots. You can also make online WordCloud plots
using this
•
(Links to an external site.)
•
website. It is free! Try it.
1. Final presentation
You should prepare to present your project in about 8 minutes and be ready to have approximately 2 minutes of Q&A. All members of the
groups are expected to participate in the presentation.
1.1. Presentation format
Apart from your group name, what else does the presentation include?
1- Description of the data, 2- Problem Statement, 3- Why is this big data? 4- Method & Results, and 5- Conclusion.
1- Description of the data:
Let us know what is the data. When it has been collected? Who did collect it? What is the source? How large is your data? Do you have any
links to the data? How many records does it have? How many features (columns)? Structured or Unstructured? ,…
2- Problem Statement
What are you trying to do? What is your aim? What are your research questions?
3- Why is this big data?
What is the reason that you did select this data? Why is it big data?
4- Method & Results:
What methods did you use? Are you using any Hadoop tools? What are your findings? Any plot? Graph?
5- Conclusion
Conclude your findings and let us know if you have any suggestions regarding the data. For example, you might see machine 12 has too
many issues, and then, it is better to investigate the machine.
1.2 Grading Rubric for Presentation
Each member’s presentation rubric includes:
a) Presentation skills (on-time, clear presentation, your ppt,…) 5 pt
b) Project introduction 5pt
c) Problem Statement 5pt
d) Dataset 5pt (how big is your dataset?, what is it about? Why is it big data? …)
e) Methods 5pt
f) Conclusion 5pt
g) Novelty & Creativity 5pt (Being creative in your findings and results. Having a novel method and dataset)
h) Participation 5pt
Each team member will evaluate the other team members in this part. There will be a questionnaire that each of you can give
points to your teammates. I’ll send the questionnaire a few days before the presentations. By default, I assume that each
member gets the full points, otherwise, I’ll look at the given grades.
About the dataset
Collected by
Asmita Poddar
“Nowplaying-
on 25th August
RS” dataset
2019.
from
kaggle.com
Recorded listening events on
twitter.
11.6 million music listening
events, 139K users and 346K
Size 3.67
Gigabytes
tracks.
11.6 million rows and 43
columns
Analysis on COVID-19 Infection
Situation
DAT 560M Big Data and Cloud
Computing
Agenda
● Data Description
● Why is this big data
● Problem Statement
● Methods & Results
● Conclusion & Recommendation
Data Description
● A dataset about COVID-19 from CDC
● Dating from January 2020 to October 2021
● The data is about the COVID-19 case surveillance public use data with factors
such as geography and demography.
● Unstructured Columns: 19
8; Unstructured Columns: 32
● More information about our data in the next page
4
Why is this Big Data?
Case month
The earlier month of clinical date (Month in year)
Res state
State of resident (State: abbreviation, eg: CA)
Res county
County of resident
Age group
Age group of patients [(0-17),(18-49),(50-64),(65+)]
Sex
Gender of patients [male, female]
Process
Under what process was the case first identified
(Clinical evaluation; Routine surveillance; Contact
tracing
of case patient; Multiple; Other; Unknow; Missing)
Did the patient died due to this illness? (Yes or No)
Number of fully vaccinated people
Death yn
Vaccination
Why is this Big Data?
Original Size
Modified
Size
Unstructured
Columns
Structured
Columns
Original
Row Size
Modified
Row Size
Covid19 Case Dataset 4.9 GB
4.2 GB
19
8
37,532,072
32,865,603
Vaccination dataset:
100 MB
32
4
1,048,575
1,048,575
750 MB
Problem Statement
●
Problem 1: When was the U.S. COVID-19 pandemic most severe?
●
Problem 2: Which states and counties face the most severe pandemic problem
nowadays?
●
Problem 3: What is the age with the largest number of people who got COVID-19?
●
Problem 4: What is the number of deaths after being confirmed COVID-19?
●
Problem 5: Under what the most process was the case first identified?
●
Problem 6: For each state, the total number of people who are fully vaccinated.
● Problem 7: After comparing the total number of people who are fully vaccinated and
the total number of confirmed cases for each state, what interesting facts do you find?
Problem 1
When was the U.S. COVID-19 pandemic most severe?
●
Method: MapReduce
●
Result:
Max: 202012
# of case: 4738134
Problem 2
Which states and counties face the most severe pandemic problem nowadays?
●
Method: MapReduce
●
Result of state:
●
4 of states have more than 1500k cases
1. California
2. Florida
3. New York
4. Illinois
Problem 2
Which states and counties face the most severe pandemic problem nowadays?
●
Method: MapReduce
●
Result of state:
4 of states have more than 1500k cases
1. California
2. Florida
NY
IL
CA
3. New York
4. Illinois
FL
Problem 2
Which states and counties face the most severe pandemic problem nowadays?
●
Method: MapReduce
●
Result of state:
4 of states have more than 1500k cases
1. California
2. Florida
3. New York
4. Illinois
●
Result of County:
Los Angeles: 1.4 millions cases
Problem 3
What is the age with the largest number of people who got COVID-19?
●
●
Method: MapReduce
Result:
In the range of 18-49 years old, this group
has the most number of laboratory confirmed cases
Problem 4
What is the total number of people who died after being confirmed COVID-19?
●
Method: MapReduce
●
Result:
Missing & NA value due to privacy
There are 306731 patients died after being confirmed by Laboratory
0.9% death rate
Problem 5
Under what the most process was the case first identified?
●
Method: MapReduce
●
Result:
Most people confirmed COVID-19
by clinical evaluation
Problem 6 & 7
After comparing the total number of people who are fully vaccinated and the total number of confirmed cases
for each state, what interesting facts do you find?
●
Method: Impala & Tableau
●
Result:
The total # of fully vaccination of each state
Problem 6 & 7
After comparing the total number of people who are fully vaccinated and the total number of confirmed cases
for each state, what interesting facts do you find?
●
●
Method: Impala & Tableau
Result:
A.
B.
In general, the more confirmed cases in
which state, the larger the number of
vaccinations in the state.
However, some individual states like
Florida have a large confirmed
population but low vaccinated
population
Conclusion
●
Winter
●
California
●
High population density
●
Middle-aged population
●
Evenly distributed among male and female
●
Protection measures
●
Vaccine on the Road – improve vaccination in Florida
Recommendation
Thank you!
US Traffic Accidents
Pattern Analysis
December 8, 2021
Our Data
US Accidents data – a countrywide traffic accident dataset
1.5 million
traffic accidents
more than 500MB
in 49 states
1
US and state departments of transportation
2
law enforcement agencies
3
traffic cameras
from February 2016
to December 2020
Why is this big data? It satisfies 4V principles!
1
Problem Statement
US Accidents data – a countrywide traffic accident dataset
Accidents Distribution
Based On Time
Accidents Distribution
Based On Location
Accidents Distribution
Based On Weather
A LSTM regression model for prediction of accidents
1
Hive, Impala
2
MapReudce
3
Pyspark,Python
1
1.Time Analysis
1 Road Accidents During Day & Night
From 6 p.m. to 6 a.m.
Night
About 1/3 road accidents.
From 6 a.m. to 6 p.m.
Day
Major of road accident occurred at 5 p.m..
We found that most road accidents do not happen in the middle of the night, but
mostly in the afternoon.
9
2 The Road Accidents Distribution in a week
Weekdays
Weekends
Distribution in a week
83%
Occurred in weekdays
key information
Thursday of a week is
having the highest
percentage of road accidents
weekdays have almost 2
times higher accident
percentage to weekend
8
3 The Monthly Road Accidents Distribution Analysis
Distribution in a year
May(most)
18%
Occurred in May
key information
45% of the road accidents
occurred within 3 month ,
March to May.
May has the most accident
percentage among a year.
November is the month with
the least (3.54%) road
accidents.
November(least)
8
4 Car Accidents Year Analysis based on the severity
45.3%
1.78%
The most accidents occur every year are severe-2 in the last 5 years, especially in 2020,45.3% of
the total car accidents are moderately severe(level-2).
In last 5 years (2016-2020),the highly severe (Severity-4) accident cases happened in us remain
in the range of 0.94% to 1.78%.
2.Location Analysis
1 MapReduce
Mapping
Reduce
Outcome
CA: 448833
State:
State
CA
FL
……
FL: 153007
City:
City
Los Angles
Miami
……
Los Angles:39984
Miami:36233
We are building a MapReduce function to count the frequency of accidents among State and City
10
2
The Road Accidents Distribution in States
3
The Road Accidents Distribution in Cities
3. Weather Analysis
1 Weather Analysis – Humidity & Temperature
Humidity – Frequency & Severity
12.9% cases occurred in
extremely humid day vs.
1.1% in extremely dry day
Temperature – Frequency & Severity
‘Most’ cases occurred on
the comfortable H & T
6.7% cases occurred in very
cold day vs. 3.2% in very
hot day
The wetter environment drives cases increasing, but the temperature not.
4
2 Specific Weather Analysis
The Top Weather
“Fair”
457086 Cases
Others
Mostly Cloudy
Cloudy
Clear
Partly Cloudy
8
3 General Figure Obtained From Tableau
4. Deep Learning
Forecasting
1 Building LSTM model to forecast the accidents based
on weather
‘Distance’
‘Temperature’ ‘Humidity’
‘Visibility’
‘Windspeed’
50 neurons in the hidden layer
1 neuron in the output layer
Loss function —— MSE
Optimizer
The test MSE is :0.67
—— Adam
difference 0.82 miles on average
2
User Insight for Twitter Music
Recommendation System
1
2
3
4
5
Introduction
Dataset
Method
Result
Conclusion
Background
We designed a new
recommender for Twitter!
Gather Data
from User
Give Appropriate
Recommendation
Platform Get
Higher Profit
Retain More
Customers
Attract More
Artists
Problem Statement
Purpose:
To design a Music Recommender System providing users
personalized experiences by recommending songs that
are most likely-to-be-liked by individuals.
General Design
Recommendations
will be made:
About the dataset
Why is it Big Data?
• Millions of rows, unable to process using traditional
relational database softwares.
• Tried in Excel and laptop crashed.
Method
Unlike Hive & MapReduce,
Spark is a general-purpose
distributed data processing
engine that is suitable for use
in a wide range of
circumstances.
●
●
SparkCore
SparkSQL
Tableau is a visual analytics
platform transforming the way
we use data to solve problems
●
Diverse visualization
Galileo Galileo
@TheREALGalileo
Spark and Tableau are the next generation!!!
Result I: Recommendation by Artist
Diana Krall, Katy
Perry and MDS
Hash rank the top,
so their works
would be
recommended
➔ Group by artist_id
➔ Aggregate by Count user_id
➔ Sort by counts
Result I: Recommendation by Artist
Arctic Monkey,
Coldplay and Ed
Sheeran rank the
top, so their works
would be
recommended
➔ Group by artist_id
➔ Aggregate by distinctCount user_id
➔ Sort by counts
Result I: Recommendation by Artist
Arctic Monkey,
Coldplay and Ed
Sheeran rank the
top, so their works
would be
recommended
➔ Group by user_id, artist_id
➔ Window
➔ Group by artist_id, do count
Result II: Recommendation by Sentiment
● VADER
● Sentiment tracker
● Momentum Effect
The first track has
experience the
momentum effect on
April so it should be
recommended more
after that
➔
➔
➔
➔
Join table
Create Month Column
Group by Month, track
Aggregate by averaging Vader score
Result III: Recommendation by Language
● Listening to a
foreign song is
joyful
● ES, namely
Spanish, tops on
this chart;
recommend more
spanish songs for
English users
➔
➔
➔
➔
Select English user
Filter by not English in tweet_lang
Group by tweet_lang
Aggregate by count
Result IV: Recommendation by Popularity
● Simulated Billboard
● The longer it stay
on board, the high
priority it gets in
the
recommendation
system
➔ Group by Month, track_id
➔ Window to decide top 5
➔ Sort by Month, rank
Conclusion
Thank you !!
Instructions:
You are expected to do a final project in this course and utilize the tools that we learned on a qualifying dataset of your
interests. Your project should include the utilization of either Hive, Impala, Pig, MapReduce, or Spark, or a combination of
these tools. You might use bash coding in the initial steps to massage your data before feeding it into the Hadoop. The final
project consists of two parts: 1- Presentation 2- Report.
General points:
•
Please be creative. Writing a few lines of queries in Hive or Impala is not going to land you a good grade.
•
Find cool results and use cool charts and techniques.
•
For visualization, you can use Tableau to visualize and draw shiny plots. You can also make online WordCloud plots
using this
•
(Links to an external site.)
•
website. It is free! Try it.
1. Final presentation
You should prepare to present your project in about 8 minutes and be ready to have approximately 2 minutes of Q&A. All members of the
groups are expected to participate in the presentation.
1.1. Presentation format
Apart from your group name, what else does the presentation include?
1- Description of the data, 2- Problem Statement, 3- Why is this big data? 4- Method & Results, and 5- Conclusion.
1- Description of the data:
Let us know what is the data. When it has been collected? Who did collect it? What is the source? How large is your data? Do you have any
links to the data? How many records does it have? How many features (columns)? Structured or Unstructured? ,…
2- Problem Statement
What are you trying to do? What is your aim? What are your research questions?
3- Why is this big data?
What is the reason that you did select this data? Why is it big data?
4- Method & Results:
What methods did you use? Are you using any Hadoop tools? What are your findings? Any plot? Graph?
5- Conclusion
Conclude your findings and let us know if you have any suggestions regarding the data. For example, you might see machine 12 has too
many issues, and then, it is better to investigate the machine.
1.2 Grading Rubric for Presentation
Each member’s presentation rubric includes:
a) Presentation skills (on-time, clear presentation, your ppt,…) 5 pt
b) Project introduction 5pt
c) Problem Statement 5pt
d) Dataset 5pt (how big is your dataset?, what is it about? Why is it big data? …)
e) Methods 5pt
f) Conclusion 5pt
g) Novelty & Creativity 5pt (Being creative in your findings and results. Having a novel method and dataset)
h) Participation 5pt
Each team member will evaluate the other team members in this part. There will be a questionnaire that each of you can give
points to your teammates. I’ll send the questionnaire a few days before the presentations. By default, I assume that each
member gets the full points, otherwise, I’ll look at the given grades.
About the dataset
Collected by
Asmita Poddar
“Nowplaying-
on 25th August
RS” dataset
2019.
from
kaggle.com
Recorded listening events on
twitter.
11.6 million music listening
events, 139K users and 346K
Size 3.67
Gigabytes
tracks.
11.6 million rows and 43
columns