All of the excersises at the end of the chapters must be completed (microsoft excel). cancer graping assignment does not have a course attached
Section3
Measures of Central Tendency
Rhonda Knehans Drake
Associate Professor, New York University
Data Analytics, Interpretation and Reporting
Copyright © 2013
2
• One way to aid in better understanding your sample data is with
descriptive measures or statistics.
• Each basic summary statistic has its own unique purpose and,
therefore, each plays a critical role in helping you describe and
understand your data. However, not fully understanding what
each of these statistics are measuring and how they are
calculated or when to use one versus the other can cause you to
draw erroneous conclusions.
• For example, Did you know that the average or mean is quite
sensitive to extreme data values (outliers) which could cause you
to make incorrect conclusions based on this data? Do you know
what to do in this case?
• This section will show you how to calculate the basic measures
of central tendency.
Introduction
3
• One way to gain a better understanding of your quantitative data is
with measures of central tendency.
• These measures tell, for example, where the center of the distribution
of income levels lie.
• The Three Measures of Central Tendency are:
– Mean
– Median
– Mode
Measures of Central Tendency
4
• The Mean is the sum of all observations divided by the number of
observations.
• The mean or average is the most widely used measure of central
tendency of a set of observations.
• We will denote the sample mean as x (pronounced “x bar”), the
population mean as (the Greek letter mu), the number of
observations in a sample as n, the number of observations in a
population as N, the observations for the variable of concern as x1,
x2, x3, x4,… and the sum of all observations as x ( is the
uppercase Greek letter sigma).
• With the sample mean we are estimating the population mean
denoted as µ.
The Mean I
5
Example: Suppose the following are the prices of 5 houses sold in
Seattle, in thousands of dollars.
158 189 265 127 191
What is the mean?
The Mean II
6
• Sometimes a data set may contain outliers which are extremely
low or high values. They may be legitimate or not legitimate.
Example: Consider our sample prices of houses. Assume the 265
figure was 982 instead.
158 189 982 127 191
This outlier value of 982 will pull up the mean and not be a
reflective measure.
New mean = $1,647 / 5 = $329 (in thousands)
So, what do we do in this case?
The Mean III
Old Mean = $186
7
• Median represents the “exact middle” observation for the variable
of concern when the values in your sample are ranked from
lowest to highest.
• Median is also an important measure of central tendency and is
not as sensitive to outliers.
The median is determined by performing the following steps:
1. Rank the observations in your data set from the lowest value to
the highest value.
2. Select the (n + 1)/2 observation in this ranked data set, where
n is the size of the sample drawn.
If the sample size, n, is an even number then (n + 1)/2 will lie
exactly between two observations. In this case, the median is
simply the average of these two observations.
The Median I
8
Example: Consider the following 5 observations which are weight
loss figures in pounds at a health club after 4 weeks for new
members.
10 5 19 8 3
What is the median?
The Median II
3 5 8 10 19
9
Because the mean can be influenced by outliers the best
practice is to show both the mean and median on corporate
level reports and dashboards.
Corporate Reporting
10
• Mode is merely the most frequently occurring observation for
the variable of concern in your sample.
• A less commonly used measure of central tendency is the mode.
The mode is determined by one of the following factors:
1. The most frequently occurring observation in the data set.
2. If all observations within the data set occur the same number of
times, there is no mode.
3. If there is a tie for the most frequently occurring observation in the
data set, the data set has multiple modes.
There can be more than one mode in a set of values.
The Mode I
11
• Mode is merely the most frequently occurring observation for
the variable of concern in your sample.
• A less commonly used measure of central tendency is the mode.
The mode is determined by one of the following factors:
1. The most frequently occurring observation in the data set.
2. If all observations within the data set occur the same number of
times, there is no mode.
3. If there is a tie for the most frequently occurring observation in the
data set, the data set has multiple modes.
There can be more than one mode in a set of values.
The Mode I
12
Example: Suppose the following are the ages of 10 college
graduates from NYU.
23 25 24 23 24 23 20 23 22
30
What is the mode?
The Mode II
Observation Occurrences
20 1
22 1
23 4
24 2
25 1
30 1
13
The relationship between mean, median and mode is shown
below for various data sets.
Mean = Median = Mode
The mean, median and mode are all equal
when the distribution is symmetric and bell-shaped.
Mean Median Mode
The mean is less than the median and mode
when the distribution is skewed to the left.
Mode Median Mean
The mean is greater than the median and mode
when the distribution is skewed
to the right.
Mean vs. Median vs. Mode
14
Let’s do an in class example.
Suppose we sample 10 customers off of the database and note their online
expenditures for 2010.
$27 $55 $42 $38 $75 $62 $54 $21 $398 $42
Calculate the mean and median, and explain what is happening here?
In Class Example
21 27 38 42 42 54 55 62 75 398
Mean:
Median:
(42 + 54) / 2 = 48
What measure of central tendency is best?
15
How does Excel do this for us?
Consider the table below which shows the ages of 10 college
graduates from NYU.
Age
23
25
24
23
24
23
20
23
22
30
Calculating the Mean, Median and Mode I
(Using Excel – PC or Mac prior version 8)
16
To gain the measures of central tendency in Excel, we first go to
data, data analysis and then click “Descriptive Statistics.”
Calculating the Mean, Median and Mode II
(Using Excel – PC or Mac prior version 8)
17
Highlight your data in the “Input Range,” check “ Labels,” and
decide your “Output Range”, then check “Summary Statistics.”
Calculating the Mean, Median and Mode III
(Using Excel – PC or Mac prior version 8)
18
Then you have the mean, median and mode of the data set.
Calculating the Mean, Median and Mode IV
(Using Excel – PC or Mac prior version 8)
19
Then you have the mean, median and mode of the data set.
Calculating the Mean, Median and Mode IV
(Using Excel – PC or Mac prior version 8)
Skewness and kurtosis are a
Measure of the level of
skewness the distribution
exhibits and how peaked the
distribution is.
20
Then you have the mean, median and mode of the data set.
Calculating the Mean, Median and Mode IV
(Using Excel – PC or Mac prior version 8)
The Skewness measure indicates the level of
non-symmetry. If the distribution of the data are
symmetric then skewness will be close to 0
(zero). The further from 0, the more skewed the
data. A negative value indicates a skew to the
left. Here we note the data is slightly skewed
to the right.
Kurtosis is a measure of the peakedness of the
data. Again, for data that is not excessively
peaked, kurtosis is 0 (zero). In this case our
data is somewhat peaked.
http://www.stattutorials.com/EXCEL/EXCEL-DESCRIPTIVE-STATISTICS.html
http://www.stattutorials.com/EXCEL/EXCEL-DESCRIPTIVE-STATISTICS.html
http://www.stattutorials.com/EXCEL/EXCEL-DESCRIPTIVE-STATISTICS.html
http://www.stattutorials.com/EXCEL/EXCEL-DESCRIPTIVE-STATISTICS.html
http://www.stattutorials.com/EXCEL/EXCEL-DESCRIPTIVE-STATISTICS.html
http://www.stattutorials.com/EXCEL/EXCEL-DESCRIPTIVE-STATISTICS.html
Levels of Kurtosis
Kurtosis of 0
22
When we examine the average home prices in Seattle, we may
also want to compare them to the average home prices in
Chicago.
For example, suppose the sample average for Seattle was
$186,000 and for Chicago it was $175,000.
Is the observed difference in average home prices significant?
We will learn this in a few weeks.
Teaser for Future Discussion
23
Appendix Part 1
How to load the Analysis Tool
Pak in Windows Excel
24
• Go to the data tab.
• In all likelihood you will not see
the “Data Analysis” option on the
tool bar as is displayed here.
• So, here is what you do.
Appendix Part 1
25
• Click on the Office Button in
the upper left hand corner
and select “Excel Options”
Appendix Part 1
26
• Select Add-ins and then click
on the Go button.
Appendix Part 1
27
• Check off Analysis ToolPak and
select Ok.
• It will now be ready to use.
Appendix Part 1
28
Appendix Part 2
Loading StatPlus for Mac Users
29
Appendix Part 2
If you have a Mac, you unfortunately do not have the analysis tool pak option in
Excel. Luckily StatPlus kindly offers a free version for Mac owners with
Microsoft’s approval
http://www.analystsoft.com/en/products/statplusmacle/
http://www.analystsoft.com/en/products/statplusmacle/
30
Appendix Part 2
Once Downloaded, open the program, open the data, go to statistics, then
Basic Statistics and then Descriptive Statistics
31
Appendix Part 2
Highlight the data and click OK
32
Appendix Part 2
Voila!
33
3.1 Which of the three measures of central tendency (the mean, the median,
and the mode) can be calculated for quantitative data only, and which
ones can be calculated for both quantitative and qualitative data? Illustrate
with examples.
3.2 Which of the three measures of central tendency (the mean, the median,
and the mode) can assume more than one value for a data set? Give an
example of a data set for which that summary measure assumes more
than one value.
3.3 Price of cars have a distribution that is skewed to the right with outliers in
the right tail. Which of the measures of central tendency is the best to
summarize that data set? Explain.
3.4 The following data give the number of car thefts that occurred in a city
during the past 12 days.
6 3 7 11 5 3 8 7 2 6 9 13
Find the mean, median, and mode.
Section 3 Exercises I
34
3.5 The following data give the 2010 total area of farmland (in millions of
acres) for 10 states (Statistical Abstract of the United States). The data
entered in that order, are for the states of Colorado, Iowa, Kansas,
Minnesota, Missouri, Nebraska, North Dakota, Oklahoma, South Dakota,
and Texas, respectively. (Do in Excel)
33 33 48 30 30 47 40 34 44 129
a) Calculate the mean an median for these data
b) Do theses data contain an outlier? If yes, drop this value and
recalculate the mean and median. Which of the two summary
measures changes by a larger amount when you drop the outlier?
c) Is the mean or the median a better summary measure for these
data? Explain.
3.6 The mean 2009 income for five families was $39,520. What was the total
2009 income of these five families?
Section 3 Exercises II
35
3.7 Consider the following two data sets.
Data set 1: 12 25 37 8 41
Data set 2: 19 32 44 15 48
Notice that each value of the second data set is obtained by adding 7 to the
corresponding value of the first data set.
a) Calculate the mean for each of these two data sets.
b) Comment on the relationship between the two means.
Section 3 Exercises II
Section4
Measures of Dispersion
Rhonda Knehans Drake
Associate Professor, New York University
Data Analytics, Interpretation and Reporting
Copyright © 2013
2
• One way to aid in better understanding your sample data is with
descriptive measures or statistics.
• Each basic summary statistic has its own unique purpose and,
therefore, each plays a critical role in helping you describe and
understand your data. However, not fully understanding what
each of these statistics are measuring, how they are calculated
or when to use one versus the other can cause you to draw
erroneous conclusions.
• Did you know the spread of your data reveals how solid your
estimate of the mean is? The tighter your spread the better your
estimate of the average. In other words, we want our data sets
to have as little spread as possible.
• This section will show you how to calculate the basic measures
of dispersion.
Introduction
3
• The two main measures of dispersion of concern are:
1. The Range
2. The Variance and Standard Deviation
Measures of Dispersion
4
• The range is the largest observation minus the smallest observation
for the variable of concern in your sample.
• The range gives a sense of the “true spread” of all observations in
the data set.
• In fact, the range is sometimes referred to as the “spread.” When
being reported, it is often accompanied by the minimum and
maximum values observed.
• The range is denoted by the following formula:
Range = (the maximum observation) – (the minimum observation)
The Range I
5
• For example, assume we survey 20 people and ask them their online
expenditures for 2009.
$100 $50 $100 $150 $125 $100 $80 $75 $125 $150
$150 $175 $50 $80 $25 $100 $100 $75 $125 $100
The Range II
What is the range of this data set?
Max = 175
Min = 25
Range = 175 – 25 = 150
6
• The standard deviation is an average measure of dispersion of each
observation in your data set from the mean for the variable of
concern.
• In other words, the standard deviation tells us how much, on average,
the data lies from the mean.
• It will answer if the observations for the variable of concern lie tightly
or are widely dispersed around the mean.
• To obtain the standard deviation, take the square root of the
variance.
The Standard Deviation I
7
• The formula for the sample variance (S2) is equal to the sum of each
observation minus the mean squared and then divided by your
sample size minus one.
S2 = (X – X )
2
/ (n – 1)
• We square the difference because we do not care if the observation
is, for example, 5 units above or below the mean but only that it is five
units away from the mean.
• We divide by n-1 rather than n because it was proven a long time ago
that when you divide my n-1 it provides a much better estimate of
the true population variance.
• To determine the standard deviation (S), we take the square root of
the variance:
S = √ S2
The Standard Deviation II
8
• We denote the population variance and standard deviation with the
Greek letter sigma:
• Population variance = σ2
• Population standard deviation = σ
The Standard Deviation III
9
• Let’s calculate the variance and standard deviation for our online
spend example.
The Standard Deviation IV
10
• We can also use the shortcut formula as follows:
Using the shortcut formula for our online spend example we get:
The Standard Deviation V
11
• The larger the variation, the more spread out the data.
• The larger the variation, the more difficult it will be for the
marketer to make inferences about the data.
Dispersion of data around the mean.
Variance = S2
2
Variance = S3
2
Variance = S1
2
S3
2
> S2
2
> S1
2
The Standard Deviation VI
12
• There are several rules but we will only focus on the “Empirical
Rule” which state that if your sample has a symmetric and bell
shaped distribution then:
Data Dispersion Rules I
68% of the observations within a data set will lie within one
standard deviation of the mean
95% of the observations within a data set will lie within two
standard deviations of the mean
99.7% of the observations within a data set will lie within three
standard deviations of the mean
13
• Pictorially this looks as follows:
The Empirical Rule
- + – 2 – 3 + 2 + 3
Data Dispersion Rules II
14
• We determine observations in our data to be outliers by
examining if they lie more than 3, 4, 5, or 6 standard deviations
from the mean
• It all depends on the quantities you are dealing with.
• Legitimate or not you must remove outliers before examining
relationships in the data.
• SAS and SPSS offer a drop down menu where you can easily
eliminate outliers.
Outliers
15
• Let’s now use Excel to calculate the variance and standard
deviation.
Consider the table below which shows
the ages of 10 college graduates from NYU.
Age
23
25
24
23
24
23
20
23
22
30
Calculating the Standard Deviation I
(Using Excel – PC)
16
• To determine the measures of central tendency in Excel, we first go to data,
data analysis and then click “Descriptive Statistics” (just as we did to calculate
the mean).
Calculating the Standard Deviation II
(Using Excel – PC)
17
• Highlight your data in the “Input Range,” check “ Labels,” and
decide your “Output Range”, then check “Summary Statistics.”
Calculating the Standard Deviation III
(Using Excel – PC)
18
• Then you have the range and standard deviation of the data set.
Calculating the Standard Deviation IV
(Using Excel – PC)
19
4.1 The range, as a measure of spread, has the disadvantage of being
influenced by outliers. Illustrate this with an example.
4.2 Can the standard deviation have a negative value? Explain.
4.3 When is the value of the standard deviation for a data set zero? Give on
example. Calculate the standard deviation for this example and show that
its value is zero.
4.4 The following table gives the
2009 revenues (rounded to
billions of dollars) of the top
10 companies in Fortune
magazine’s Global 500
(Fortune Magazine). Find the
range, variation, and
standard deviation for these
data (by hand and Excel).
2009 Revenue
(in billions of U.S.
Dollars)
Company
1995 Revenue
(in billions of U.S.
dollars)
Mitsubishi (Japan) 184
Mitsui (Japan) 182
Itochu (Japan) 169
General Motors (U.S.) 169
Sumitomo (Japan) 168
Marubeni (Japan) 161
Ford Motor (U.S.) 137
Toyota Motor (Japan) 111
Exxon (U.S.) 110
Royal Dutch/ Shell Group (Brit/Neth) 110
Company
Section 4 Exercises I
20
Section 4 Exercises II
4.5 Consider the following two data sets.
Data Set I: 12 25 37 8 41
Data Set II: 19 32 44 15 48
Note that each value of the second data set is obtained by adding 7 to the
corresponding value of the first data set. Calculate the standard deviation
for each of these two data sets using the formula for sample data.
Comment on the relationship between the two standard deviations.
Section 5
Data Distribution
Rhonda Knehans Drake
Associate Professor, New York University
Data Analytics, Interpretation and Reporting
Copyright © 2013
2
• There are many distributional forms that our marketing data
can take on.
• Most data follows a specific form of one type or another.
• Because of this fact, we can make estimates and forecasts
about what our data is telling us and do so with a certain level of
confidence.
Introduction
3
Examples of common forms of data in industry are:
– Time to failure follows what is known as an exponential
distribution. Xerox takes advantage of this well know fact to
determine servicing needs and how to set contract prices for its
various equipment.
– Department stores take advantage of the fact that the period of time
between the arrivals of two successive customers also follows an
exponential distribution.
Examples of Data Distribution I
4
Examples of common forms of data in industry are:
– If you are interested in counting the number of occurrences of a
specific event within a set time period then we have what is called
the Poisson distribution.
– The number of murders in NYC during the month of April follows a
Poisson distribution. Based on this fact, the city can then
estimate the murder rate for the next month.
– The hypergeometric distribution is used when doing QC
checking for defectives in a large lot. With this distribution, you
can determine the probability that in a sample of size n from the
entire lot of size N, you will accept it when in reality it exceeds your
defective rate.
Examples of Data Distribution I
5
• However, the most prevalent and most important functional
form for a marketer is the normal distribution.
• Also known as the bell shaped curve.
• Many data elements follow the normal distribution:
– Age
– Income Levels
– GPA’s
– Spend
– Years education
– Etc.
• And because all these measures tend to follow the normal curve,
we as marketers and researchers are able to make many
inferences about our metrics with a high degree of confidence.
Examples of Data Distribution II
6
• Referring back to the distribution of income from a prior chapter,
we now know this data to be normally distributed.
• The distribution of income was symmetric and bell-shaped.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
$
1
0
,0
0
0
–
$
2
5
,0
0
0
$
2
5
,0
0
0
–
$
4
0
,0
0
0
$
4
0
,0
0
0
–
$
5
5
,0
0
0
$
5
5
,0
0
0
–
$
7
0
,0
0
0
$
7
0
,0
0
0
–
$
8
5
,0
0
0
$
8
5
,0
0
0
–
$
1
0
0
,
0
0
0
$
1
0
0
,
0
0
0
–
$
1
1
5
,
0
0
0
Incom e Categories
R
e
la
t
iv
e
F
re
q
u
e
n
c
y
Histogram and polygon for the relative frequency distribution of income levels.
Normal Distribution
7
• What would you think the measure of skewness and kurtosis to
be for this data?.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
$
1
0
,0
0
0
–
$
2
5
,0
0
0
$
2
5
,0
0
0
–
$
4
0
,0
0
0
$
4
0
,0
0
0
–
$
5
5
,0
0
0
$
5
5
,0
0
0
–
$
7
0
,0
0
0
$
7
0
,0
0
0
–
$
8
5
,0
0
0
$
8
5
,0
0
0
–
$
1
0
0
,
0
0
0
$
1
0
0
,
0
0
0
–
$
1
1
5
,
0
0
0
Incom e Categories
R
e
la
t
iv
e
F
re
q
u
e
n
c
y
Histogram and polygon for the relative frequency distribution of income levels.
Normal Distribution
8
• As you recall from Section 4, we discussed the Empirical Rule.
This rule stated that if the distribution of your data is symmetric
and bell-shaped (now known as normally distributed data):
– 68% of the observations within the data set will lie within one
standard deviation of the mean
– 95% of the observations within the data set will lie within two
standard deviations of the mean
– 99.7% of the observations within the data set will lie within
three standard deviations of the mean
The Spread of Normally Distributed Data I
9
• Pictorially, this looks as follows:
- + – 2 – 3 + 2 + 3
The Spread of Normally Distributed Data II
10
5.1 Rite Aid Pharmacy wishes to monitor the number of customers arriving at
the checkout counter on Sunday afternoons for staffing purposes. What
distributional form will this data follow?
5.2 How is the time to failure for GE light bulbs
distributed?
5.3 How would you suspect the average age of your customer base to be
distributed?
5.4 Income on your customer database is distributed normally with a mean of
$55,000 and a standard deviation of $10,000. What percent of the
database do you estimated will have an income within the range $35,000
to $75,000.
5.5 How do the width and height of a normal distribution change when its
mean remains the same but its standard deviation decreases? Show this
graphically.
5.6 How do the width and or height of a normal distribution change when its
standard deviation remains the same but its mean increases? Show this
graphically.
Section 5 Exercises
Section 6
The Central Limit Theorem
Rhonda Knehans Drake
Associate Professor, New York University
Data Analytics, Interpretation and Reporting
Copyright © 2013
2
• The assessment of sample means (average and percentages)
are the basis of many every day business decisions.
• Therefore understanding exactly how an average is distributed
is KEY to properly assessing one versus another (Pre vs. Post,
Control vs. Test, etc.).
• Luck would have it that it that averages will always follow a
normal distribution as n gets large (n>30).
Introduction
3
• When you take a sample from a population and calculate the
average dollars spent, for example, that average or mean has
certain distributional properties.
• According to the Central Limit Theorem, regardless of how the
population from which we sampled is distributed, the sample
mean or response rate or click through rate (for n 30) will be
normally distributed with a mean equal to the mean of the
population from which the sample came and a standard
deviation equal to the standard deviation of the population from
which the sample came divided by the square root of the sample
size.
• This was proven a long time ago. And we can take advantage
of it.
The Central Limit Theorem I
4
• So what does this mean….even if the distribution of dollars
spent is highly skewed and not symmetric and therefore not
normal at all, your statistics such as average spend will be.
• Let’s take a look at what the Central Limit Theorem is saying.
The Central Limit Theorem I
5
ACME Database
(10,000,000 Customer
Record)
Sample 1,000
(n=10,000
customers)
Sample 4
(n=10,000
customers)
Sample 3
(n=10,000
customers)
Sample 2
(n=10,000
customers)
Sample 1
(n=10,000
customers)
X1,000 = avg. income
of
sample 1,000
X4= avg. income of
sample 1,000
X3= avg. income of
sample 1,000
X2 = avg. income of
sample 1,000
X1 = avg. income of
sample 1,000
The Central Limit Theorem II
• The database analyst at ACME Direct draws 1,000 random
samples, of size 10,000 each, from the database and observes
the data element “household income.”
6
• The analyst then calculates the mean “household income” for
each of the 1,000 samples and creates a frequency distribution
of the average incomes from the 1,000 samples drawn.
Income Range Frequency
Less than $10,000 32
$10,000 – $20,000 119
$20,000 – $30,000 187
$30,000 – $40,000 326
$40,000 – $50,000 192
$50,000 – $60,000 116
$60,000+ 28
Total 1,00
0
The Central Limit Theorem III
7
• The analyst creates a histogram using the frequency table and
notes the distribution of these 1,000 sample mean income
values is normally distributed (symmetric and bell-shaped).
Distribution of 1,000 Sample Means
0
50
100
150
200
250
300
350
Less than
$10,000
$10,000 –
$20,000
$20,000 –
$30,000
$30,000 –
$40,000
$40,000 –
$50,000
$50,000 –
$60,000
$60,000+
Income Range s
F
re
q
u
e
n
c
y
The Central Limit Theorem IV
8
• According to the Central Limit Theorem, the histogram will be
normally distributed (bell-shaped and symmetric) with
– a mean equal to the true mean “household income” level of
all people on the database and
– a standard deviation equal to the true standard deviation
for all people on the database (when divided by the square
root of n).
The Central Limit Theorem V
9
So important that the normal curve was printed on the 10 German
Deutsche Mark until 1993.
Gauss and the Deutsche Mark
http://en.wikipedia.org/wiki/File:Carl_Friedrich_Gauss
http://upload.wikimedia.org/wikipedia/commons/0/0d/10_DM_Serie4_Vorderseite
10
Suppose you work for American Express and conducted a test to
1,000 new card members to excite spend.
• Based on your test you received an average spend value of $175 for
this test with a standard deviation of $25.
• You know this is not reality because you only conducted a test.
How can you estimate the spend level for rollout to all new card
members?
• Based on the CLT we can say with 95% certainty, true spend will lie
some where within plus or minus 2 standard deviations of our average.
• So, in other words, the true spend should lie somewhere between
$125 and $225 with 95% certainty
An Example of the CLT in Practice
11
6.1 You note the number of arrivals each day at Starbucks in Grand Central
Station between the hours of 5 pm and 6 pm for the months of April and
May. You do the same for the Starbucks at Penn Station. You calculate
the average for Grand Central and the average for Penn Station.
a) How is the variable number of arrivals between 5pm and 6pm
distributed?
b) How is the average number of arrivals for Grand Central and Penn
Station distributed?
6.2 You know income to be normally distributed on your customer file. You
sample 20 people on the file. How will the average income be distributed
for this small sample?
6.3 Income on your database is highly skewed. You sample 20 people on the
file. How will the average income be distributed for this small sample?
Section 6 Exercises I
12
6.4 Income on your database is highly skewed. You sample 1,000 people on
the file. How will the average income be distributed for this sample of size
1,000?
Section 6 Exercises II
Graphing/Charting Project
Below is data obtained from the National Cancer Society with projections for the number of newly diagnosed cancer patients through
2050
broken out by age.
Age of Newly Diagnosed Cancer Patients
Year
<50
50-64
65-74
75-84
>85
2000
0.17
0.38
0.35
0.31
0.14
2010
0.17
0.46
0.57
0.34
0.14
2020
0.17
0.55
0.68
0.40
0.20
2030
0.18
0.60
0.70
0.62
0.16
2040
0.23
0.59
0.69
0.71
0.28
2050
0.22
0.68
0.65
0.70
0.42
Note: .17 equals 170,000 for example
Instructions:
Prepare one slide that graphically best presents this data and make appropriate observations. It would be interesting to try and show if the number of cancer patients are going up year after year and how the distribution by age range is also changing, all in one graphical representation. A stacked bar chart or area chart might serve this purpose well.