Using the MyOpenMath generated hockey data, create the following charts:
1) A bar chart of the divisions showing how many teams are in each one.
2) A histogram of goals allowed. Describe the shape of this data (symmetric, normal, skewed (if so, with the correct direction)).
3) Box plots of goals scored grouped by division. Make a conclusion about the variance of goals scored based on these box plots. (i.e. which divisions have larger/smaller variance, which ones are similar, etc)
4) A Q-Q plot for goal differential. Then make a conclusion (using just the plot, don’t stress about different statistics) on whether this variable follows a normal distribution. (Note that this part is NOT included in the example provided. A lot of coding involves researching how to do things on your own, so I’m testing your research skills here a bit!)
You might not know what all of the variables mean in this data set. That’s ok, and part of data science — learning about things you aren’t familiar with. (I worked on a clinical team where I had to learn about bilirubins, something I’d literally never heard of until thrust into that role!) On this hockey data, I can assure you that a quick google search will answer any questions you have about what the different things mean.
The using the MyOpenMath generated house data (note that this will be a *new* data set from the previous assignment), create the following charts:
5) A bar chart showing the number of stories of the houses.
6) A histogram of bedrooms. Describe the shape of this data (symmetric, normal, skewed (if so, with the correct direction)).
7) Box plots of square footage grouped by the number of stories.
8) A Q-Q plot for square footage. Then make a conclusion (using just the plot, don’t stress about different statistics) on whether this variable follows a normal distribution.
4
Data Visualization
Examples
• Examples in SAS
– The following boxplot shows the counts of players making the major league minimum vs those who
are not. There are about four times as many players making above the league minimum than those
making the league minimum.
– The following histogram displays player salaries. This is a right skewed data set, with only some
players making above $10 million per season and very few above $20 million per season.
1
– The following boxplot displays player salaries. It shows a similar story as above, with the longer tail
(the line from Q3 to the largest non outlier value) on the right illustrating right skewness. The large
vertical line in the middle of the box is the median, and the mean (the diamond) being to the right
of the median is another sign of right skewness. (Symmetric data sets have medians and means that
are very close.)
– The following boxplots show stolen base counts for players grouped by whether they make the major
league minimum salary or more. Those who make the major league minimum appear to steal more
bases and have a larger variance (as indicated by the wider box).
2
• SAS Code
• Examples in R
– Note that you don’t need two separate write ups; I’m simply bringing these graphics below to keep
the programs separate, but you’re free to include the R screenshots directly behind the SAS ones for
each of the assigned tasks.
3
4
• R Code
5