All of the excersises at the end of the chapters must be completed (microsoft excel). cancer graping assignment does not have a course attached

Section3

Measures of Central Tendency

Rhonda Knehans Drake

Associate Professor, New York University

Data Analytics, Interpretation and Reporting

Copyright © 2013

2

• One way to aid in better understanding your sample data is with

descriptive measures or statistics.

• Each basic summary statistic has its own unique purpose and,

therefore, each plays a critical role in helping you describe and

understand your data. However, not fully understanding what

each of these statistics are measuring and how they are

calculated or when to use one versus the other can cause you to

draw erroneous conclusions.

• For example, Did you know that the average or mean is quite

sensitive to extreme data values (outliers) which could cause you

to make incorrect conclusions based on this data? Do you know

what to do in this case?

• This section will show you how to calculate the basic measures

of central tendency.

Introduction

3

• One way to gain a better understanding of your quantitative data is

with measures of central tendency.

• These measures tell, for example, where the center of the distribution

of income levels lie.

• The Three Measures of Central Tendency are:

– Mean

– Median

– Mode

Measures of Central Tendency

4

• The Mean is the sum of all observations divided by the number of

observations.

• The mean or average is the most widely used measure of central

tendency of a set of observations.

• We will denote the sample mean as x (pronounced “x bar”), the

population mean as (the Greek letter mu), the number of

observations in a sample as n, the number of observations in a

population as N, the observations for the variable of concern as x1,

x2, x3, x4,… and the sum of all observations as x ( is the

uppercase Greek letter sigma).

• With the sample mean we are estimating the population mean

denoted as µ.

The Mean I

5

Example: Suppose the following are the prices of 5 houses sold in

Seattle, in thousands of dollars.

158 189 265 127 191

What is the mean?

The Mean II

6

• Sometimes a data set may contain outliers which are extremely

low or high values. They may be legitimate or not legitimate.

Example: Consider our sample prices of houses. Assume the 265

figure was 982 instead.

158 189 982 127 191

This outlier value of 982 will pull up the mean and not be a

reflective measure.

New mean = $1,647 / 5 = $329 (in thousands)

So, what do we do in this case?

The Mean III

Old Mean = $186

7

• Median represents the “exact middle” observation for the variable

of concern when the values in your sample are ranked from

lowest to highest.

• Median is also an important measure of central tendency and is

not as sensitive to outliers.

The median is determined by performing the following steps:

1. Rank the observations in your data set from the lowest value to

the highest value.

2. Select the (n + 1)/2 observation in this ranked data set, where

n is the size of the sample drawn.

If the sample size, n, is an even number then (n + 1)/2 will lie

exactly between two observations. In this case, the median is

simply the average of these two observations.

The Median I

8

Example: Consider the following 5 observations which are weight

loss figures in pounds at a health club after 4 weeks for new

members.

10 5 19 8 3

What is the median?

The Median II

3 5 8 10 19

9

Because the mean can be influenced by outliers the best

practice is to show both the mean and median on corporate

level reports and dashboards.

Corporate Reporting

10

• Mode is merely the most frequently occurring observation for

the variable of concern in your sample.

• A less commonly used measure of central tendency is the mode.

The mode is determined by one of the following factors:

1. The most frequently occurring observation in the data set.

2. If all observations within the data set occur the same number of

times, there is no mode.

3. If there is a tie for the most frequently occurring observation in the

data set, the data set has multiple modes.

There can be more than one mode in a set of values.

The Mode I

11

• Mode is merely the most frequently occurring observation for

the variable of concern in your sample.

• A less commonly used measure of central tendency is the mode.

The mode is determined by one of the following factors:

1. The most frequently occurring observation in the data set.

2. If all observations within the data set occur the same number of

times, there is no mode.

3. If there is a tie for the most frequently occurring observation in the

data set, the data set has multiple modes.

There can be more than one mode in a set of values.

The Mode I

12

Example: Suppose the following are the ages of 10 college

graduates from NYU.

23 25 24 23 24 23 20 23 22

30

What is the mode?

The Mode II

Observation Occurrences

20 1

22 1

23 4

24 2

25 1

30 1

13

The relationship between mean, median and mode is shown

below for various data sets.

Mean = Median = Mode

The mean, median and mode are all equal

when the distribution is symmetric and bell-shaped.

Mean Median Mode

The mean is less than the median and mode

when the distribution is skewed to the left.

Mode Median Mean

The mean is greater than the median and mode

when the distribution is skewed

to the right.

Mean vs. Median vs. Mode

14

Let’s do an in class example.

Suppose we sample 10 customers off of the database and note their online

expenditures for 2010.

$27 $55 $42 $38 $75 $62 $54 $21 $398 $42

Calculate the mean and median, and explain what is happening here?

In Class Example

21 27 38 42 42 54 55 62 75 398

Mean:

Median:

(42 + 54) / 2 = 48

What measure of central tendency is best?

15

How does Excel do this for us?

Consider the table below which shows the ages of 10 college

graduates from NYU.

Age

23

25

24

23

24

23

20

23

22

30

Calculating the Mean, Median and Mode I

(Using Excel – PC or Mac prior version 8)

16

To gain the measures of central tendency in Excel, we first go to

data, data analysis and then click “Descriptive Statistics.”

Calculating the Mean, Median and Mode II

(Using Excel – PC or Mac prior version 8)

17

Highlight your data in the “Input Range,” check “ Labels,” and

decide your “Output Range”, then check “Summary Statistics.”

Calculating the Mean, Median and Mode III

(Using Excel – PC or Mac prior version 8)

18

Then you have the mean, median and mode of the data set.

Calculating the Mean, Median and Mode IV

(Using Excel – PC or Mac prior version 8)

19

Then you have the mean, median and mode of the data set.

Calculating the Mean, Median and Mode IV

(Using Excel – PC or Mac prior version 8)

Skewness and kurtosis are a

Measure of the level of

skewness the distribution

exhibits and how peaked the

distribution is.

20

Then you have the mean, median and mode of the data set.

Calculating the Mean, Median and Mode IV

(Using Excel – PC or Mac prior version 8)

The Skewness measure indicates the level of

non-symmetry. If the distribution of the data are

symmetric then skewness will be close to 0

(zero). The further from 0, the more skewed the

data. A negative value indicates a skew to the

left. Here we note the data is slightly skewed

to the right.

Kurtosis is a measure of the peakedness of the

data. Again, for data that is not excessively

peaked, kurtosis is 0 (zero). In this case our

data is somewhat peaked.

http://www.stattutorials.com/EXCEL/EXCEL-DESCRIPTIVE-STATISTICS.html

http://www.stattutorials.com/EXCEL/EXCEL-DESCRIPTIVE-STATISTICS.html

http://www.stattutorials.com/EXCEL/EXCEL-DESCRIPTIVE-STATISTICS.html

http://www.stattutorials.com/EXCEL/EXCEL-DESCRIPTIVE-STATISTICS.html

http://www.stattutorials.com/EXCEL/EXCEL-DESCRIPTIVE-STATISTICS.html

http://www.stattutorials.com/EXCEL/EXCEL-DESCRIPTIVE-STATISTICS.html

Levels of Kurtosis

Kurtosis of 0

22

When we examine the average home prices in Seattle, we may

also want to compare them to the average home prices in

Chicago.

For example, suppose the sample average for Seattle was

$186,000 and for Chicago it was $175,000.

Is the observed difference in average home prices significant?

We will learn this in a few weeks.

Teaser for Future Discussion

23

Appendix Part 1

How to load the Analysis Tool

Pak in Windows Excel

24

• Go to the data tab.

• In all likelihood you will not see

the “Data Analysis” option on the

tool bar as is displayed here.

• So, here is what you do.

Appendix Part 1

25

• Click on the Office Button in

the upper left hand corner

and select “Excel Options”

Appendix Part 1

26

• Select Add-ins and then click

on the Go button.

Appendix Part 1

27

• Check off Analysis ToolPak and

select Ok.

• It will now be ready to use.

Appendix Part 1

28

Appendix Part 2

Loading StatPlus for Mac Users

29

Appendix Part 2

If you have a Mac, you unfortunately do not have the analysis tool pak option in

Excel. Luckily StatPlus kindly offers a free version for Mac owners with

Microsoft’s approval

http://www.analystsoft.com/en/products/statplusmacle/

http://www.analystsoft.com/en/products/statplusmacle/

30

Appendix Part 2

Once Downloaded, open the program, open the data, go to statistics, then

Basic Statistics and then Descriptive Statistics

31

Appendix Part 2

Highlight the data and click OK

32

Appendix Part 2

Voila!

33

3.1 Which of the three measures of central tendency (the mean, the median,

and the mode) can be calculated for quantitative data only, and which

ones can be calculated for both quantitative and qualitative data? Illustrate

with examples.

3.2 Which of the three measures of central tendency (the mean, the median,

and the mode) can assume more than one value for a data set? Give an

example of a data set for which that summary measure assumes more

than one value.

3.3 Price of cars have a distribution that is skewed to the right with outliers in

the right tail. Which of the measures of central tendency is the best to

summarize that data set? Explain.

3.4 The following data give the number of car thefts that occurred in a city

during the past 12 days.

6 3 7 11 5 3 8 7 2 6 9 13

Find the mean, median, and mode.

Section 3 Exercises I

34

3.5 The following data give the 2010 total area of farmland (in millions of

acres) for 10 states (Statistical Abstract of the United States). The data

entered in that order, are for the states of Colorado, Iowa, Kansas,

Minnesota, Missouri, Nebraska, North Dakota, Oklahoma, South Dakota,

and Texas, respectively. (Do in Excel)

33 33 48 30 30 47 40 34 44 129

a) Calculate the mean an median for these data

b) Do theses data contain an outlier? If yes, drop this value and

recalculate the mean and median. Which of the two summary

measures changes by a larger amount when you drop the outlier?

c) Is the mean or the median a better summary measure for these

data? Explain.

3.6 The mean 2009 income for five families was $39,520. What was the total

2009 income of these five families?

Section 3 Exercises II

35

3.7 Consider the following two data sets.

Data set 1: 12 25 37 8 41

Data set 2: 19 32 44 15 48

Notice that each value of the second data set is obtained by adding 7 to the

corresponding value of the first data set.

a) Calculate the mean for each of these two data sets.

b) Comment on the relationship between the two means.

Section 3 Exercises II

Section4

Measures of Dispersion

Rhonda Knehans Drake

Associate Professor, New York University

Data Analytics, Interpretation and Reporting

Copyright © 2013

2

• One way to aid in better understanding your sample data is with

descriptive measures or statistics.

• Each basic summary statistic has its own unique purpose and,

therefore, each plays a critical role in helping you describe and

understand your data. However, not fully understanding what

each of these statistics are measuring, how they are calculated

or when to use one versus the other can cause you to draw

erroneous conclusions.

• Did you know the spread of your data reveals how solid your

estimate of the mean is? The tighter your spread the better your

estimate of the average. In other words, we want our data sets

to have as little spread as possible.

• This section will show you how to calculate the basic measures

of dispersion.

Introduction

3

• The two main measures of dispersion of concern are:

1. The Range

2. The Variance and Standard Deviation

Measures of Dispersion

4

• The range is the largest observation minus the smallest observation

for the variable of concern in your sample.

• The range gives a sense of the “true spread” of all observations in

the data set.

• In fact, the range is sometimes referred to as the “spread.” When

being reported, it is often accompanied by the minimum and

maximum values observed.

• The range is denoted by the following formula:

Range = (the maximum observation) – (the minimum observation)

The Range I

5

• For example, assume we survey 20 people and ask them their online

expenditures for 2009.

$100 $50 $100 $150 $125 $100 $80 $75 $125 $150

$150 $175 $50 $80 $25 $100 $100 $75 $125 $100

The Range II

What is the range of this data set?

Max = 175

Min = 25

Range = 175 – 25 = 150

6

• The standard deviation is an average measure of dispersion of each

observation in your data set from the mean for the variable of

concern.

• In other words, the standard deviation tells us how much, on average,

the data lies from the mean.

• It will answer if the observations for the variable of concern lie tightly

or are widely dispersed around the mean.

• To obtain the standard deviation, take the square root of the

variance.

The Standard Deviation I

7

• The formula for the sample variance (S2) is equal to the sum of each

observation minus the mean squared and then divided by your

sample size minus one.

S2 = (X – X )

2

/ (n – 1)

• We square the difference because we do not care if the observation

is, for example, 5 units above or below the mean but only that it is five

units away from the mean.

• We divide by n-1 rather than n because it was proven a long time ago

that when you divide my n-1 it provides a much better estimate of

the true population variance.

• To determine the standard deviation (S), we take the square root of

the variance:

S = √ S2

The Standard Deviation II

8

• We denote the population variance and standard deviation with the

Greek letter sigma:

• Population variance = σ2

• Population standard deviation = σ

The Standard Deviation III

9

• Let’s calculate the variance and standard deviation for our online

spend example.

The Standard Deviation IV

10

• We can also use the shortcut formula as follows:

Using the shortcut formula for our online spend example we get:

The Standard Deviation V

11

• The larger the variation, the more spread out the data.

• The larger the variation, the more difficult it will be for the

marketer to make inferences about the data.

Dispersion of data around the mean.

Variance = S2

2

Variance = S3

2

Variance = S1

2

S3

2

> S2

2

> S1

2

The Standard Deviation VI

12

• There are several rules but we will only focus on the “Empirical

Rule” which state that if your sample has a symmetric and bell

shaped distribution then:

Data Dispersion Rules I

68% of the observations within a data set will lie within one

standard deviation of the mean

95% of the observations within a data set will lie within two

standard deviations of the mean

99.7% of the observations within a data set will lie within three

standard deviations of the mean

13

• Pictorially this looks as follows:

The Empirical Rule

- + – 2 – 3 + 2 + 3

Data Dispersion Rules II

14

• We determine observations in our data to be outliers by

examining if they lie more than 3, 4, 5, or 6 standard deviations

from the mean

• It all depends on the quantities you are dealing with.

• Legitimate or not you must remove outliers before examining

relationships in the data.

• SAS and SPSS offer a drop down menu where you can easily

eliminate outliers.

Outliers

15

• Let’s now use Excel to calculate the variance and standard

deviation.

Consider the table below which shows

the ages of 10 college graduates from NYU.

Age

23

25

24

23

24

23

20

23

22

30

Calculating the Standard Deviation I

(Using Excel – PC)

16

• To determine the measures of central tendency in Excel, we first go to data,

data analysis and then click “Descriptive Statistics” (just as we did to calculate

the mean).

Calculating the Standard Deviation II

(Using Excel – PC)

17

• Highlight your data in the “Input Range,” check “ Labels,” and

decide your “Output Range”, then check “Summary Statistics.”

Calculating the Standard Deviation III

(Using Excel – PC)

18

• Then you have the range and standard deviation of the data set.

Calculating the Standard Deviation IV

(Using Excel – PC)

19

4.1 The range, as a measure of spread, has the disadvantage of being

influenced by outliers. Illustrate this with an example.

4.2 Can the standard deviation have a negative value? Explain.

4.3 When is the value of the standard deviation for a data set zero? Give on

example. Calculate the standard deviation for this example and show that

its value is zero.

4.4 The following table gives the

2009 revenues (rounded to

billions of dollars) of the top

10 companies in Fortune

magazine’s Global 500

(Fortune Magazine). Find the

range, variation, and

standard deviation for these

data (by hand and Excel).

2009 Revenue

(in billions of U.S.

Dollars)

Company

1995 Revenue

(in billions of U.S.

dollars)

Mitsubishi (Japan) 184

Mitsui (Japan) 182

Itochu (Japan) 169

General Motors (U.S.) 169

Sumitomo (Japan) 168

Marubeni (Japan) 161

Ford Motor (U.S.) 137

Toyota Motor (Japan) 111

Exxon (U.S.) 110

Royal Dutch/ Shell Group (Brit/Neth) 110

Company

Section 4 Exercises I

20

Section 4 Exercises II

4.5 Consider the following two data sets.

Data Set I: 12 25 37 8 41

Data Set II: 19 32 44 15 48

Note that each value of the second data set is obtained by adding 7 to the

corresponding value of the first data set. Calculate the standard deviation

for each of these two data sets using the formula for sample data.

Comment on the relationship between the two standard deviations.

Section 5

Data Distribution

Rhonda Knehans Drake

Associate Professor, New York University

Data Analytics, Interpretation and Reporting

Copyright © 2013

2

• There are many distributional forms that our marketing data

can take on.

• Most data follows a specific form of one type or another.

• Because of this fact, we can make estimates and forecasts

about what our data is telling us and do so with a certain level of

confidence.

Introduction

3

Examples of common forms of data in industry are:

– Time to failure follows what is known as an exponential

distribution. Xerox takes advantage of this well know fact to

determine servicing needs and how to set contract prices for its

various equipment.

– Department stores take advantage of the fact that the period of time

between the arrivals of two successive customers also follows an

exponential distribution.

Examples of Data Distribution I

4

Examples of common forms of data in industry are:

– If you are interested in counting the number of occurrences of a

specific event within a set time period then we have what is called

the Poisson distribution.

– The number of murders in NYC during the month of April follows a

Poisson distribution. Based on this fact, the city can then

estimate the murder rate for the next month.

– The hypergeometric distribution is used when doing QC

checking for defectives in a large lot. With this distribution, you

can determine the probability that in a sample of size n from the

entire lot of size N, you will accept it when in reality it exceeds your

defective rate.

Examples of Data Distribution I

5

• However, the most prevalent and most important functional

form for a marketer is the normal distribution.

• Also known as the bell shaped curve.

• Many data elements follow the normal distribution:

– Age

– Income Levels

– GPA’s

– Spend

– Years education

– Etc.

• And because all these measures tend to follow the normal curve,

we as marketers and researchers are able to make many

inferences about our metrics with a high degree of confidence.

Examples of Data Distribution II

6

• Referring back to the distribution of income from a prior chapter,

we now know this data to be normally distributed.

• The distribution of income was symmetric and bell-shaped.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

$

1

0

,0

0

0

–

$

2

5

,0

0

0

$

2

5

,0

0

0

–

$

4

0

,0

0

0

$

4

0

,0

0

0

–

$

5

5

,0

0

0

$

5

5

,0

0

0

–

$

7

0

,0

0

0

$

7

0

,0

0

0

–

$

8

5

,0

0

0

$

8

5

,0

0

0

–

$

1

0

0

,

0

0

0

$

1

0

0

,

0

0

0

–

$

1

1

5

,

0

0

0

Incom e Categories

R

e

la

t

iv

e

F

re

q

u

e

n

c

y

Histogram and polygon for the relative frequency distribution of income levels.

Normal Distribution

7

• What would you think the measure of skewness and kurtosis to

be for this data?.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

$

1

0

,0

0

0

–

$

2

5

,0

0

0

$

2

5

,0

0

0

–

$

4

0

,0

0

0

$

4

0

,0

0

0

–

$

5

5

,0

0

0

$

5

5

,0

0

0

–

$

7

0

,0

0

0

$

7

0

,0

0

0

–

$

8

5

,0

0

0

$

8

5

,0

0

0

–

$

1

0

0

,

0

0

0

$

1

0

0

,

0

0

0

–

$

1

1

5

,

0

0

0

Incom e Categories

R

e

la

t

iv

e

F

re

q

u

e

n

c

y

Histogram and polygon for the relative frequency distribution of income levels.

Normal Distribution

8

• As you recall from Section 4, we discussed the Empirical Rule.

This rule stated that if the distribution of your data is symmetric

and bell-shaped (now known as normally distributed data):

– 68% of the observations within the data set will lie within one

standard deviation of the mean

– 95% of the observations within the data set will lie within two

standard deviations of the mean

– 99.7% of the observations within the data set will lie within

three standard deviations of the mean

The Spread of Normally Distributed Data I

9

• Pictorially, this looks as follows:

- + – 2 – 3 + 2 + 3

The Spread of Normally Distributed Data II

10

5.1 Rite Aid Pharmacy wishes to monitor the number of customers arriving at

the checkout counter on Sunday afternoons for staffing purposes. What

distributional form will this data follow?

5.2 How is the time to failure for GE light bulbs

distributed?

5.3 How would you suspect the average age of your customer base to be

distributed?

5.4 Income on your customer database is distributed normally with a mean of

$55,000 and a standard deviation of $10,000. What percent of the

database do you estimated will have an income within the range $35,000

to $75,000.

5.5 How do the width and height of a normal distribution change when its

mean remains the same but its standard deviation decreases? Show this

graphically.

5.6 How do the width and or height of a normal distribution change when its

standard deviation remains the same but its mean increases? Show this

graphically.

Section 5 Exercises

Section 6

The Central Limit Theorem

Rhonda Knehans Drake

Associate Professor, New York University

Data Analytics, Interpretation and Reporting

Copyright © 2013

2

• The assessment of sample means (average and percentages)

are the basis of many every day business decisions.

• Therefore understanding exactly how an average is distributed

is KEY to properly assessing one versus another (Pre vs. Post,

Control vs. Test, etc.).

• Luck would have it that it that averages will always follow a

normal distribution as n gets large (n>30).

Introduction

3

• When you take a sample from a population and calculate the

average dollars spent, for example, that average or mean has

certain distributional properties.

• According to the Central Limit Theorem, regardless of how the

population from which we sampled is distributed, the sample

mean or response rate or click through rate (for n 30) will be

normally distributed with a mean equal to the mean of the

population from which the sample came and a standard

deviation equal to the standard deviation of the population from

which the sample came divided by the square root of the sample

size.

• This was proven a long time ago. And we can take advantage

of it.

The Central Limit Theorem I

4

• So what does this mean….even if the distribution of dollars

spent is highly skewed and not symmetric and therefore not

normal at all, your statistics such as average spend will be.

• Let’s take a look at what the Central Limit Theorem is saying.

The Central Limit Theorem I

5

ACME Database

(10,000,000 Customer

Record)

Sample 1,000

(n=10,000

customers)

Sample 4

(n=10,000

customers)

Sample 3

(n=10,000

customers)

Sample 2

(n=10,000

customers)

Sample 1

(n=10,000

customers)

X1,000 = avg. income

of

sample 1,000

X4= avg. income of

sample 1,000

X3= avg. income of

sample 1,000

X2 = avg. income of

sample 1,000

X1 = avg. income of

sample 1,000

The Central Limit Theorem II

• The database analyst at ACME Direct draws 1,000 random

samples, of size 10,000 each, from the database and observes

the data element “household income.”

6

• The analyst then calculates the mean “household income” for

each of the 1,000 samples and creates a frequency distribution

of the average incomes from the 1,000 samples drawn.

Income Range Frequency

Less than $10,000 32

$10,000 – $20,000 119

$20,000 – $30,000 187

$30,000 – $40,000 326

$40,000 – $50,000 192

$50,000 – $60,000 116

$60,000+ 28

Total 1,00

0

The Central Limit Theorem III

7

• The analyst creates a histogram using the frequency table and

notes the distribution of these 1,000 sample mean income

values is normally distributed (symmetric and bell-shaped).

Distribution of 1,000 Sample Means

0

50

100

150

200

250

300

350

Less than

$10,000

$10,000 –

$20,000

$20,000 –

$30,000

$30,000 –

$40,000

$40,000 –

$50,000

$50,000 –

$60,000

$60,000+

Income Range s

F

re

q

u

e

n

c

y

The Central Limit Theorem IV

8

• According to the Central Limit Theorem, the histogram will be

normally distributed (bell-shaped and symmetric) with

– a mean equal to the true mean “household income” level of

all people on the database and

– a standard deviation equal to the true standard deviation

for all people on the database (when divided by the square

root of n).

The Central Limit Theorem V

9

So important that the normal curve was printed on the 10 German

Deutsche Mark until 1993.

Gauss and the Deutsche Mark

http://en.wikipedia.org/wiki/File:Carl_Friedrich_Gauss

http://upload.wikimedia.org/wikipedia/commons/0/0d/10_DM_Serie4_Vorderseite

10

Suppose you work for American Express and conducted a test to

1,000 new card members to excite spend.

• Based on your test you received an average spend value of $175 for

this test with a standard deviation of $25.

• You know this is not reality because you only conducted a test.

How can you estimate the spend level for rollout to all new card

members?

• Based on the CLT we can say with 95% certainty, true spend will lie

some where within plus or minus 2 standard deviations of our average.

• So, in other words, the true spend should lie somewhere between

$125 and $225 with 95% certainty

An Example of the CLT in Practice

11

6.1 You note the number of arrivals each day at Starbucks in Grand Central

Station between the hours of 5 pm and 6 pm for the months of April and

May. You do the same for the Starbucks at Penn Station. You calculate

the average for Grand Central and the average for Penn Station.

a) How is the variable number of arrivals between 5pm and 6pm

distributed?

b) How is the average number of arrivals for Grand Central and Penn

Station distributed?

6.2 You know income to be normally distributed on your customer file. You

sample 20 people on the file. How will the average income be distributed

for this small sample?

6.3 Income on your database is highly skewed. You sample 20 people on the

file. How will the average income be distributed for this small sample?

Section 6 Exercises I

12

6.4 Income on your database is highly skewed. You sample 1,000 people on

the file. How will the average income be distributed for this sample of size

1,000?

Section 6 Exercises II

**Graphing/Charting Project
**

Below is data obtained from the National Cancer Society with projections for the number of newly diagnosed cancer patients through

2050

broken out by age.

Age of Newly Diagnosed Cancer Patients

Year

<50

50-64

65-74

75-84

>85

2000

0.17

0.38

0.35

0.31

0.14

2010

0.17

0.46

0.57

0.34

0.14

2020

0.17

0.55

0.68

0.40

0.20

2030

0.18

0.60

0.70

0.62

0.16

2040

0.23

0.59

0.69

0.71

0.28

2050

0.22

0.68

0.65

0.70

0.42

Note: .17 equals 170,000 for example

__Instructions:
__

Prepare one slide that graphically best presents this data and make appropriate observations. It would be interesting to try and show if the number of cancer patients are going up year after year and how the distribution by age range is also changing, all in one graphical representation. A stacked bar chart or area chart might serve this purpose well.