final presentation and report

Let’s talk about the final project. Now that we know the proper tools to handle big data let’s focus on working on this project using what we learned in this course. You can collect your data set from any appropriate source related to the topic of interest. Remember that your data must be big enough to qualify for the project (at least half a gigabyte).

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

General Instructions

You are expected to do a final project in this course and utilize the tools we learned on your interests’ qualifying dataset. Your project should include using at least one of the tools we learned in this course, including, but not limited to, Hive, Impala, MapReduce, Spark, or a combination of these tools. Use bash coding in the initial steps to massage your data before feeding it into the Hadoop. The final project consists of two parts: 1- Presentation 2- Report.

  • Please be creative.
  • Find cool results and use appropriate (simultaneously cool) graphs and techniques.
  • For visualization, you can use Tableau to visualize and draw shiny plots. You can also make online WordCloud plots using thisLinks to an external site. website. It is free! Try it.

1. Final presentation

You should prepare to present your project in about 8 minutes and be ready to have approximately 2 minutes of Q&A. All members of the groups are expected to participate in the presentation.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

1.1. Presentation format

Apart from your group name, what else does the presentation include?

1- Description of the data,

2- Problem Statement

,

3- Why is this big data?

4- Method & Results, and

5- Conclusion

.

1- Description of the data:

Let us know what the data is. When has it been collected? Who did collect it? What is the source? How large is your data? Do you have any links to the data? How many records does it have? How many features (columns)? Structured or Unstructured?

  • ,…
  • 2- Problem Statement

    What are you trying to do? What is your aim? What are your research questions?

    3- Why is this big data?

    What is the reason that you did select this data? Why is it big data?

    4- Method & Results:

    What methods did you use? Are you using any Hadoop tools? What are your findings? Any plot? Graph?

    5- Conclusion

    Please go ahead and finish your findings, and let us know if you have any suggestions regarding the data. For example, machine 12 has too many issues, so it is better to investigate the machine.

    1.2 Grading Rubric for Presentation

    Each member’s presentation rubric includes the following:

    a) Presentation skills (on-time, clear presentation, narration, your PowerPoint style, etc.) 7pts

    b) Project introduction 3pts

    c) Problem Statement 3pts

    d) Dataset 3pts (how big is your dataset? What is it about? Why is it big data? rows? columns? when collected?…)

    e) Methods 5pt

    f) Conclusion 5pt

    g) Novelty & Creativity 7pts (Being creative in your findings and results. Having a novel method and dataset)

    h) Participation 7pt

    Each team member will evaluate the other team members in this part. There will be a questionnaire in which you can give points to your teammates (not yourself). I’ll send the questionnaire a few days before the presentations. By default, I assume each member gets the total points; otherwise, I’ll look at the given grades.

    2. Report

    Please discuss your project in detail & hand over a clean, professional, neat report. Your report must include an executive summary (learn how to write an executive summary) plus all sections discussed in the presentation (Description of the data, Problem Statement, Why this is big data, Method and results, and Conclusion) and your code in the appendix. Reports are limited to up to 10 pages (excluding appendices). Notes on the final project report:

    • Please submit your Final report in PDF format.
    • Write a professional report including an executive summary as the first section. The executive summary is essential to a report, and you’ll need to ensure it is in your report. You need this skill set in your future career regardless of the field.
    • Please write all team members’ names, IDs, and group names on the report.
    • Please remember that your code should be in the appendix of your final report. If required, charts can be in the appendix, too.
    • As you know, the appendix does not count toward the page limit.
    • Remember to submit your final report on Canvas.

    2.1 Grading Rubric for Report

    a) Professional report skills (clean, on order with clear grammar, having front page, page number .. ) 10 pt

    b) Executive summary 10 pt

    c) Introduction & Problem Statement 5 pt

    d) Code and Dataset 5 pt (how big is your dataset? What is it about? Why is it big data? …)

    e) Methods 5 pt

    f) Conclusion 5 pt

    3. Speech Script

    Please write a speech script corresponds to each slides

    4. Dataset

    You bring your dataset. Your data should be large enough to be qualified for Big Data! (minimum 500 MB) and not so large to fill the server space (maximum 2 GB ). There are many sources from which you can get a dataset. Here, I’m introducing some sources to get a dataset and work on:

    • KaggleLinks to an external site.
    • noaa.govLinks to an external site.
    • Weather.comLinks to an external site.
    • ,…

    You may check with me and discuss if your dataset qualifies for this project.

    5. Samples

    You can find samples of great projects presented by students (on other topics) in the past semester here. You can’t choose your topics from samples

    Sample1.pdfDownload Sample1.pdf

    Sample2.pdfDownload Sample2.pdf

    Sample3.pdfDownload Sample3.pdf

    Sample4.pdfDownload Sample4.pdf

    US Traffic Accidents
    Pattern Analysis
    December 8, 2021
    Our Data
    US Accidents data – a countrywide traffic accident dataset
    1.5 million
    traffic accidents
    more than 500MB
    in 49 states
    1
    US and state departments of transportation
    2
    law enforcement agencies
    3
    traffic cameras
    from February 2016
    to December 2020
    Why is this big data?  It satisfies 4V principles!
    1
    Problem Statement
    US Accidents data – a countrywide traffic accident dataset
    Accidents Distribution
    Based On Time
    Accidents Distribution
    Based On Location
    Accidents Distribution
    Based On Weather
    A LSTM regression model for prediction of accidents
    1
    Hive, Impala
    2
    MapReudce
    3
    Pyspark,Python
    1
    1.Time Analysis
    1 Road Accidents During Day & Night
    From 6 p.m. to 6 a.m.
    Night
    About 1/3 road accidents.
    From 6 a.m. to 6 p.m.
    Day
    Major of road accident occurred at 5 p.m..
    We found that most road accidents do not happen in the middle of the night, but
    mostly in the afternoon.
    9
    2 The Road Accidents Distribution in a week
    Weekdays
    Weekends
    Distribution in a week
    83%
    Occurred in weekdays
    key information
    Thursday of a week is
    having the highest
    percentage of road accidents
    weekdays have almost 2
    times higher accident
    percentage to weekend
    8
    3 The Monthly Road Accidents Distribution Analysis
    Distribution in a year
    May(most)
    18%
    Occurred in May
    key information
    45% of the road accidents
    occurred within 3 month ,
    March to May.
    May has the most accident
    percentage among a year.
    November is the month with
    the least (3.54%) road
    accidents.
    November(least)
    8
    4 Car Accidents Year Analysis based on the severity
    45.3%
    1.78%
    The most accidents occur every year are severe-2 in the last 5 years, especially in 2020,45.3% of
    the total car accidents are moderately severe(level-2).
    In last 5 years (2016-2020),the highly severe (Severity-4) accident cases happened in us remain
    in the range of 0.94% to 1.78%.
    2.Location Analysis
    1 MapReduce
    Mapping
    Reduce
    Outcome
    CA: 448833
    State:
    State
    CA
    FL
    ……
    FL: 153007
    City:
    City
    Los Angles
    Miami
    ……
    Los Angles:39984
    Miami:36233
    We are building a MapReduce function to count the frequency of accidents among State and City
    10
    2
    The Road Accidents Distribution in States
    3
    The Road Accidents Distribution in Cities
    3. Weather Analysis
    1 Weather Analysis – Humidity & Temperature
    Humidity – Frequency & Severity
    12.9% cases occurred in
    extremely humid day vs.
    1.1% in extremely dry day
    Temperature – Frequency & Severity
    ‘Most’ cases occurred on
    the comfortable H & T
    6.7% cases occurred in very
    cold day vs. 3.2% in very
    hot day
    The wetter environment drives cases increasing, but the temperature not.
    4
    2 Specific Weather Analysis
    The Top Weather
    “Fair”
    457086 Cases
    Others
    Mostly Cloudy
    Cloudy
    Clear
    Partly Cloudy
    8
    3 General Figure Obtained From Tableau
    4. Deep Learning
    Forecasting
    1 Building LSTM model to forecast the accidents based
    on weather
    ‘Distance’
    ‘Temperature’ ‘Humidity’
    ‘Visibility’
    ‘Windspeed’
    50 neurons in the hidden layer
    1 neuron in the output layer
    Loss function —— MSE
    Optimizer
    The test MSE is :0.67
    —— Adam
    difference 0.82 miles on average
    2
    User Insight for Twitter Music
    Recommendation System
    1
    2
    3
    4
    5
    Introduction
    Dataset
    Method
    Result
    Conclusion
    Background
    We designed a new
    recommender for Twitter!
    Gather Data
    from User
    Give Appropriate
    Recommendation
    Platform Get
    Higher Profit
    Retain More
    Customers
    Attract More
    Artists
    Problem Statement
    Purpose:
    To design a Music Recommender System providing users
    personalized experiences by recommending songs that
    are most likely-to-be-liked by individuals.
    General Design
    Recommendations
    will be made:
    About the dataset
    Why is it Big Data?
    • Millions of rows, unable to process using traditional
    relational database softwares.
    • Tried in Excel and laptop crashed.
    Method
    Unlike Hive & MapReduce,
    Spark is a general-purpose
    distributed data processing
    engine that is suitable for use
    in a wide range of
    circumstances.


    SparkCore
    SparkSQL
    Tableau is a visual analytics
    platform transforming the way
    we use data to solve problems

    Diverse visualization
    Galileo Galileo
    @TheREALGalileo
    Spark and Tableau are the next generation!!!
    Result I: Recommendation by Artist
    Diana Krall, Katy
    Perry and MDS
    Hash rank the top,
    so their works
    would be
    recommended
    ➔ Group by artist_id
    ➔ Aggregate by Count user_id
    ➔ Sort by counts
    Result I: Recommendation by Artist
    Arctic Monkey,
    Coldplay and Ed
    Sheeran rank the
    top, so their works
    would be
    recommended
    ➔ Group by artist_id
    ➔ Aggregate by distinctCount user_id
    ➔ Sort by counts
    Result I: Recommendation by Artist
    Arctic Monkey,
    Coldplay and Ed
    Sheeran rank the
    top, so their works
    would be
    recommended
    ➔ Group by user_id, artist_id
    ➔ Window
    ➔ Group by artist_id, do count
    Result II: Recommendation by Sentiment
    ● VADER
    ● Sentiment tracker
    ● Momentum Effect
    The first track has
    experience the
    momentum effect on
    April so it should be
    recommended more
    after that




    Join table
    Create Month Column
    Group by Month, track
    Aggregate by averaging Vader score
    Result III: Recommendation by Language
    ● Listening to a
    foreign song is
    joyful
    ● ES, namely
    Spanish, tops on
    this chart;
    recommend more
    spanish songs for
    English users




    Select English user
    Filter by not English in tweet_lang
    Group by tweet_lang
    Aggregate by count
    Result IV: Recommendation by Popularity
    ● Simulated Billboard
    ● The longer it stay
    on board, the high
    priority it gets in
    the
    recommendation
    system
    ➔ Group by Month, track_id
    ➔ Window to decide top 5
    ➔ Sort by Month, rank
    Conclusion
    Thank you !!
    Instructions:
    You are expected to do a final project in this course and utilize the tools that we learned on a qualifying dataset of your
    interests. Your project should include the utilization of either Hive, Impala, Pig, MapReduce, or Spark, or a combination of
    these tools. You might use bash coding in the initial steps to massage your data before feeding it into the Hadoop. The final
    project consists of two parts: 1- Presentation 2- Report.
    General points:

    Please be creative. Writing a few lines of queries in Hive or Impala is not going to land you a good grade.

    Find cool results and use cool charts and techniques.

    For visualization, you can use Tableau to visualize and draw shiny plots. You can also make online WordCloud plots
    using this

    (Links to an external site.)

    website. It is free! Try it.
    1. Final presentation
    You should prepare to present your project in about 8 minutes and be ready to have approximately 2 minutes of Q&A. All members of the
    groups are expected to participate in the presentation.
    1.1. Presentation format
    Apart from your group name, what else does the presentation include?
    1- Description of the data, 2- Problem Statement, 3- Why is this big data? 4- Method & Results, and 5- Conclusion.
    1- Description of the data:
    Let us know what is the data. When it has been collected? Who did collect it? What is the source? How large is your data? Do you have any
    links to the data? How many records does it have? How many features (columns)? Structured or Unstructured? ,…
    2- Problem Statement
    What are you trying to do? What is your aim? What are your research questions?
    3- Why is this big data?
    What is the reason that you did select this data? Why is it big data?
    4- Method & Results:
    What methods did you use? Are you using any Hadoop tools? What are your findings? Any plot? Graph?
    5- Conclusion
    Conclude your findings and let us know if you have any suggestions regarding the data. For example, you might see machine 12 has too
    many issues, and then, it is better to investigate the machine.
    1.2 Grading Rubric for Presentation
    Each member’s presentation rubric includes:
    a) Presentation skills (on-time, clear presentation, your ppt,…) 5 pt
    b) Project introduction 5pt
    c) Problem Statement 5pt
    d) Dataset 5pt (how big is your dataset?, what is it about? Why is it big data? …)
    e) Methods 5pt
    f) Conclusion 5pt
    g) Novelty & Creativity 5pt (Being creative in your findings and results. Having a novel method and dataset)
    h) Participation 5pt
    Each team member will evaluate the other team members in this part. There will be a questionnaire that each of you can give
    points to your teammates. I’ll send the questionnaire a few days before the presentations. By default, I assume that each
    member gets the full points, otherwise, I’ll look at the given grades.
    About the dataset
    Collected by
    Asmita Poddar
    “Nowplaying-
    on 25th August
    RS” dataset
    2019.
    from
    kaggle.com
    Recorded listening events on
    twitter.
    11.6 million music listening
    events, 139K users and 346K
    Size 3.67
    Gigabytes
    tracks.
    11.6 million rows and 43
    columns
    Analysis on COVID-19 Infection
    Situation
    DAT 560M Big Data and Cloud
    Computing
    Agenda
    ● Data Description
    ● Why is this big data
    ● Problem Statement
    ● Methods & Results
    ● Conclusion & Recommendation
    Data Description
    ● A dataset about COVID-19 from CDC
    ● Dating from January 2020 to October 2021
    ● The data is about the COVID-19 case surveillance public use data with factors
    such as geography and demography.
    ● Unstructured Columns: 19
    8; Unstructured Columns: 32
    ● More information about our data in the next page
    4
    Why is this Big Data?
    Case month
    The earlier month of clinical date (Month in year)
    Res state
    State of resident (State: abbreviation, eg: CA)
    Res county
    County of resident
    Age group
    Age group of patients [(0-17),(18-49),(50-64),(65+)]
    Sex
    Gender of patients [male, female]
    Process
    Under what process was the case first identified
    (Clinical evaluation; Routine surveillance; Contact
    tracing
    of case patient; Multiple; Other; Unknow; Missing)
    Did the patient died due to this illness? (Yes or No)
    Number of fully vaccinated people
    Death yn
    Vaccination
    Why is this Big Data?
    Original Size
    Modified
    Size
    Unstructured
    Columns
    Structured
    Columns
    Original
    Row Size
    Modified
    Row Size
    Covid19 Case Dataset 4.9 GB
    4.2 GB
    19
    8
    37,532,072
    32,865,603
    Vaccination dataset:
    100 MB
    32
    4
    1,048,575
    1,048,575
    750 MB
    Problem Statement

    Problem 1: When was the U.S. COVID-19 pandemic most severe?

    Problem 2: Which states and counties face the most severe pandemic problem
    nowadays?

    Problem 3: What is the age with the largest number of people who got COVID-19?

    Problem 4: What is the number of deaths after being confirmed COVID-19?

    Problem 5: Under what the most process was the case first identified?

    Problem 6: For each state, the total number of people who are fully vaccinated.
    ● Problem 7: After comparing the total number of people who are fully vaccinated and
    the total number of confirmed cases for each state, what interesting facts do you find?
    Problem 1
    When was the U.S. COVID-19 pandemic most severe?

    Method: MapReduce

    Result:
    Max: 202012
    # of case: 4738134
    Problem 2
    Which states and counties face the most severe pandemic problem nowadays?

    Method: MapReduce

    Result of state:

    4 of states have more than 1500k cases
    1. California
    2. Florida
    3. New York
    4. Illinois
    Problem 2
    Which states and counties face the most severe pandemic problem nowadays?

    Method: MapReduce

    Result of state:
    4 of states have more than 1500k cases
    1. California
    2. Florida
    NY
    IL
    CA
    3. New York
    4. Illinois
    FL
    Problem 2
    Which states and counties face the most severe pandemic problem nowadays?

    Method: MapReduce

    Result of state:
    4 of states have more than 1500k cases
    1. California
    2. Florida
    3. New York
    4. Illinois

    Result of County:
    Los Angeles: 1.4 millions cases
    Problem 3
    What is the age with the largest number of people who got COVID-19?


    Method: MapReduce
    Result:
    In the range of 18-49 years old, this group
    has the most number of laboratory confirmed cases
    Problem 4
    What is the total number of people who died after being confirmed COVID-19?

    Method: MapReduce

    Result:
    Missing & NA value due to privacy
    There are 306731 patients died after being confirmed by Laboratory
    0.9% death rate
    Problem 5
    Under what the most process was the case first identified?

    Method: MapReduce

    Result:
    Most people confirmed COVID-19
    by clinical evaluation
    Problem 6 & 7
    After comparing the total number of people who are fully vaccinated and the total number of confirmed cases
    for each state, what interesting facts do you find?

    Method: Impala & Tableau

    Result:
    The total # of fully vaccination of each state
    Problem 6 & 7
    After comparing the total number of people who are fully vaccinated and the total number of confirmed cases
    for each state, what interesting facts do you find?


    Method: Impala & Tableau
    Result:
    A.
    B.
    In general, the more confirmed cases in
    which state, the larger the number of
    vaccinations in the state.
    However, some individual states like
    Florida have a large confirmed
    population but low vaccinated
    population
    Conclusion

    Winter

    California

    High population density

    Middle-aged population

    Evenly distributed among male and female

    Protection measures

    Vaccine on the Road – improve vaccination in Florida
    Recommendation
    Thank you!
    US Traffic Accidents
    Pattern Analysis
    December 8, 2021
    Our Data
    US Accidents data – a countrywide traffic accident dataset
    1.5 million
    traffic accidents
    more than 500MB
    in 49 states
    1
    US and state departments of transportation
    2
    law enforcement agencies
    3
    traffic cameras
    from February 2016
    to December 2020
    Why is this big data?  It satisfies 4V principles!
    1
    Problem Statement
    US Accidents data – a countrywide traffic accident dataset
    Accidents Distribution
    Based On Time
    Accidents Distribution
    Based On Location
    Accidents Distribution
    Based On Weather
    A LSTM regression model for prediction of accidents
    1
    Hive, Impala
    2
    MapReudce
    3
    Pyspark,Python
    1
    1.Time Analysis
    1 Road Accidents During Day & Night
    From 6 p.m. to 6 a.m.
    Night
    About 1/3 road accidents.
    From 6 a.m. to 6 p.m.
    Day
    Major of road accident occurred at 5 p.m..
    We found that most road accidents do not happen in the middle of the night, but
    mostly in the afternoon.
    9
    2 The Road Accidents Distribution in a week
    Weekdays
    Weekends
    Distribution in a week
    83%
    Occurred in weekdays
    key information
    Thursday of a week is
    having the highest
    percentage of road accidents
    weekdays have almost 2
    times higher accident
    percentage to weekend
    8
    3 The Monthly Road Accidents Distribution Analysis
    Distribution in a year
    May(most)
    18%
    Occurred in May
    key information
    45% of the road accidents
    occurred within 3 month ,
    March to May.
    May has the most accident
    percentage among a year.
    November is the month with
    the least (3.54%) road
    accidents.
    November(least)
    8
    4 Car Accidents Year Analysis based on the severity
    45.3%
    1.78%
    The most accidents occur every year are severe-2 in the last 5 years, especially in 2020,45.3% of
    the total car accidents are moderately severe(level-2).
    In last 5 years (2016-2020),the highly severe (Severity-4) accident cases happened in us remain
    in the range of 0.94% to 1.78%.
    2.Location Analysis
    1 MapReduce
    Mapping
    Reduce
    Outcome
    CA: 448833
    State:
    State
    CA
    FL
    ……
    FL: 153007
    City:
    City
    Los Angles
    Miami
    ……
    Los Angles:39984
    Miami:36233
    We are building a MapReduce function to count the frequency of accidents among State and City
    10
    2
    The Road Accidents Distribution in States
    3
    The Road Accidents Distribution in Cities
    3. Weather Analysis
    1 Weather Analysis – Humidity & Temperature
    Humidity – Frequency & Severity
    12.9% cases occurred in
    extremely humid day vs.
    1.1% in extremely dry day
    Temperature – Frequency & Severity
    ‘Most’ cases occurred on
    the comfortable H & T
    6.7% cases occurred in very
    cold day vs. 3.2% in very
    hot day
    The wetter environment drives cases increasing, but the temperature not.
    4
    2 Specific Weather Analysis
    The Top Weather
    “Fair”
    457086 Cases
    Others
    Mostly Cloudy
    Cloudy
    Clear
    Partly Cloudy
    8
    3 General Figure Obtained From Tableau
    4. Deep Learning
    Forecasting
    1 Building LSTM model to forecast the accidents based
    on weather
    ‘Distance’
    ‘Temperature’ ‘Humidity’
    ‘Visibility’
    ‘Windspeed’
    50 neurons in the hidden layer
    1 neuron in the output layer
    Loss function —— MSE
    Optimizer
    The test MSE is :0.67
    —— Adam
    difference 0.82 miles on average
    2
    User Insight for Twitter Music
    Recommendation System
    1
    2
    3
    4
    5
    Introduction
    Dataset
    Method
    Result
    Conclusion
    Background
    We designed a new
    recommender for Twitter!
    Gather Data
    from User
    Give Appropriate
    Recommendation
    Platform Get
    Higher Profit
    Retain More
    Customers
    Attract More
    Artists
    Problem Statement
    Purpose:
    To design a Music Recommender System providing users
    personalized experiences by recommending songs that
    are most likely-to-be-liked by individuals.
    General Design
    Recommendations
    will be made:
    About the dataset
    Why is it Big Data?
    • Millions of rows, unable to process using traditional
    relational database softwares.
    • Tried in Excel and laptop crashed.
    Method
    Unlike Hive & MapReduce,
    Spark is a general-purpose
    distributed data processing
    engine that is suitable for use
    in a wide range of
    circumstances.


    SparkCore
    SparkSQL
    Tableau is a visual analytics
    platform transforming the way
    we use data to solve problems

    Diverse visualization
    Galileo Galileo
    @TheREALGalileo
    Spark and Tableau are the next generation!!!
    Result I: Recommendation by Artist
    Diana Krall, Katy
    Perry and MDS
    Hash rank the top,
    so their works
    would be
    recommended
    ➔ Group by artist_id
    ➔ Aggregate by Count user_id
    ➔ Sort by counts
    Result I: Recommendation by Artist
    Arctic Monkey,
    Coldplay and Ed
    Sheeran rank the
    top, so their works
    would be
    recommended
    ➔ Group by artist_id
    ➔ Aggregate by distinctCount user_id
    ➔ Sort by counts
    Result I: Recommendation by Artist
    Arctic Monkey,
    Coldplay and Ed
    Sheeran rank the
    top, so their works
    would be
    recommended
    ➔ Group by user_id, artist_id
    ➔ Window
    ➔ Group by artist_id, do count
    Result II: Recommendation by Sentiment
    ● VADER
    ● Sentiment tracker
    ● Momentum Effect
    The first track has
    experience the
    momentum effect on
    April so it should be
    recommended more
    after that




    Join table
    Create Month Column
    Group by Month, track
    Aggregate by averaging Vader score
    Result III: Recommendation by Language
    ● Listening to a
    foreign song is
    joyful
    ● ES, namely
    Spanish, tops on
    this chart;
    recommend more
    spanish songs for
    English users




    Select English user
    Filter by not English in tweet_lang
    Group by tweet_lang
    Aggregate by count
    Result IV: Recommendation by Popularity
    ● Simulated Billboard
    ● The longer it stay
    on board, the high
    priority it gets in
    the
    recommendation
    system
    ➔ Group by Month, track_id
    ➔ Window to decide top 5
    ➔ Sort by Month, rank
    Conclusion
    Thank you !!
    Instructions:
    You are expected to do a final project in this course and utilize the tools that we learned on a qualifying dataset of your
    interests. Your project should include the utilization of either Hive, Impala, Pig, MapReduce, or Spark, or a combination of
    these tools. You might use bash coding in the initial steps to massage your data before feeding it into the Hadoop. The final
    project consists of two parts: 1- Presentation 2- Report.
    General points:

    Please be creative. Writing a few lines of queries in Hive or Impala is not going to land you a good grade.

    Find cool results and use cool charts and techniques.

    For visualization, you can use Tableau to visualize and draw shiny plots. You can also make online WordCloud plots
    using this

    (Links to an external site.)

    website. It is free! Try it.
    1. Final presentation
    You should prepare to present your project in about 8 minutes and be ready to have approximately 2 minutes of Q&A. All members of the
    groups are expected to participate in the presentation.
    1.1. Presentation format
    Apart from your group name, what else does the presentation include?
    1- Description of the data, 2- Problem Statement, 3- Why is this big data? 4- Method & Results, and 5- Conclusion.
    1- Description of the data:
    Let us know what is the data. When it has been collected? Who did collect it? What is the source? How large is your data? Do you have any
    links to the data? How many records does it have? How many features (columns)? Structured or Unstructured? ,…
    2- Problem Statement
    What are you trying to do? What is your aim? What are your research questions?
    3- Why is this big data?
    What is the reason that you did select this data? Why is it big data?
    4- Method & Results:
    What methods did you use? Are you using any Hadoop tools? What are your findings? Any plot? Graph?
    5- Conclusion
    Conclude your findings and let us know if you have any suggestions regarding the data. For example, you might see machine 12 has too
    many issues, and then, it is better to investigate the machine.
    1.2 Grading Rubric for Presentation
    Each member’s presentation rubric includes:
    a) Presentation skills (on-time, clear presentation, your ppt,…) 5 pt
    b) Project introduction 5pt
    c) Problem Statement 5pt
    d) Dataset 5pt (how big is your dataset?, what is it about? Why is it big data? …)
    e) Methods 5pt
    f) Conclusion 5pt
    g) Novelty & Creativity 5pt (Being creative in your findings and results. Having a novel method and dataset)
    h) Participation 5pt
    Each team member will evaluate the other team members in this part. There will be a questionnaire that each of you can give
    points to your teammates. I’ll send the questionnaire a few days before the presentations. By default, I assume that each
    member gets the full points, otherwise, I’ll look at the given grades.
    About the dataset
    Collected by
    Asmita Poddar
    “Nowplaying-
    on 25th August
    RS” dataset
    2019.
    from
    kaggle.com
    Recorded listening events on
    twitter.
    11.6 million music listening
    events, 139K users and 346K
    Size 3.67
    Gigabytes
    tracks.
    11.6 million rows and 43
    columns

    Still stressed from student homework?
    Get quality assistance from academic writers!

    Order your essay today and save 25% with the discount code LAVENDER