the correlation coefficient

Assignment 1: Discussion

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Using one of the two formulas cited in this module calculate the correlation coefficient using the following values presented below. Once you have completed your calculation, discuss the following: Is there a statistically significant correlation between customer service attitude scores and number of overtime hours? State the research question and testable hypothesis. Interpret, discuss, and support your findings with at least two other classmates.

Customer Service Attitude ScoresOT Hours51106521181254134266512  

By 
Friday, February 22, 2013, post to the Discussion Area

 

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Assignment 2: T-Test

By Tuesday, February 26, 2013
, post your assignment to the
 M2: Assignment 2 Dropbox. Any conclusion drawn for the t-test statistical process is only as good as the research question asked and the null hypothesis formulated. T-tests are only used for two sample groups, either on a pre post-test basis or between two samples (independent or dependent). The t-test is optimized to deal with small sample numbers which is often the case with managers in any business. When samples are excessively large the t test becomes difficult to manage due to the mathematical calculations involved.

Calculate the “t” value for independent groups for the following data using the formula presented in the module. Check the accuracy of your calculations. Using the raw measurement data presented above, determine whether or not there exists a statistically significant difference between the salaries of female and male human resource managers using the appropriate t-test. Develop a research question, testable hypothesis, confidence level, and degrees of freedom. Draw the appropriate conclusions with respect to female and male HR salary levels. Report the required “t” critical values based on the degrees of freedom. Your response should be 2-3 pages.

 Salary LevelFemale HR DirectorsMale HR Directors$50,000$58,000$75,000$69,000$72,000$73,000$67,000$67,000$54,000$55,000$58,000$63,000$52,000$53,000$68,000$70,000$71,000$69,000$55,000$60,000*Do not forget what we all learned in high school about “0”s   

CHAPTER 9: Hypothesis Testing
Chapter Outline

9.1
The Null and Alternative Hypotheses and Errors in Hypothesis Testing

9.2

z
Tests about a Population Mean:
σ
Known

9.3

t
Tests about a Population Mean:
σ
Unknown

9.4

z
Tests about a Population Proportion

9.5
Type II Error Probabilities and Sample Size Determination (Optional)

9.6
The Chi-Square Distribution (Optional)

9.7
Statistical Inference for a Population Variance (Optional)

Hypothesis testing is a statistical procedure used to provide evidence in favor of some statement (called a hypothesis). For instance, hypothesis testing might be used to assess whether a population parameter, such as a population mean, differs from a specified standard or previous value. In this chapter we discuss testing hypotheses about population means, proportions, and variances.
In order to illustrate how hypothesis testing works, we revisit several cases introduced in previous chapters and also introduce some new cases:

The Payment Time Case: The consulting firm uses hypothesis testing to provide strong evidence that the new electronic billing system has reduced the mean payment time by more than 50 percent.
The Cheese Spread Case: The cheese spread producer uses hypothesis testing to supply extremely strong evidence that fewer than 10 percent of all current purchasers would stop buying the cheese spread if the new spout were used.
The Electronic Article Surveillance Case: A company that sells and installs EAS systems claims that at most 5 percent of all consumers would never shop in a store again if the store subjected them to a false EAS alarm. A store considering the purchase of such a system uses hypothesis testing to provide extremely strong evidence that this claim is not true.
The Trash Bag Case: A marketer of trash bags uses hypothesis testing to support its claim that the mean breaking strength of its new trash bag is greater than 50 pounds. As a result, a television network approves use of this claim in a commercial.
The Valentine’s Day Chocolate Case: A candy company projects that this year’s sales of its special valentine box of assorted chocolates will be 10 percent higher than last year. The candy company uses hypothesis testing to assess whether it is reasonable to plan for a 10 percent increase in sales of the valentine box.
9.1: The Null and Alternative Hypotheses and Errors in Hypothesis Testing
One of the authors’ former students is employed by a major television network in the standards and practices division. One of the division’s responsibilities is to reduce the chances that advertisers will make false claims in commercials run on the network. Our former student reports that the network uses a statistical methodology called hypothesis testing to do this.

Chapter 9

To see how this might be done, suppose that a company wishes to advertise a claim, and suppose that the network has reason to doubt that this claim is true. The network assumes for the sake of argument that the claim is not valid. This assumption is called the
null hypothesis. The statement that the claim is valid is called the
alternative, or
research, hypothesis. The network will run the commercial only if the company making the claim provides sufficient sample evidence to reject the null hypothesis that the claim is not valid in favor of the alternative hypothesis that the claim is valid. Explaining the exact meaning of sufficient sample evidence is quite involved and will be discussed in the next section.
The Null Hypothesis and the Alternative Hypothesis
In hypothesis testing:
1 The
null hypothesis, denoted H0, is the statement being tested. Usually this statement represents the status quo and is not rejected unless there is convincing sample evidence that it is false.
2 The
alternative, or
research, hypothesis, de noted Ha, is a statement that will be accepted only if there is convincing sample evidence that it is true.
Setting up the null and alternative hypotheses in a practical situation can be tricky. In some situations there is a condition for which we need to attempt to find supportive evidence. We then formulate (1) the alternative hypothesis to be the statement that this condition exists and (2) the null hypothesis to be the statement that this condition does not exist. To illustrate this, we consider the following case studies.
EXAMPLE 9.1: The Trash Bag Case1
A leading manufacturer of trash bags produces the strongest trash bags on the market. The company has developed a new 30-gallon bag using a specially formulated plastic that is stronger and more biodegradable than other plastics. This plastic’s increased strength allows the bag’s thickness to be reduced, and the resulting cost savings will enable the company to lower its bag price by 25 percent. The company also believes the new bag is stronger than its current 30-gallon bag.
The manufacturer wants to advertise the new bag on a major television network. In addition to promoting its price reduction, the company also wants to claim the new bag is better for the environment and stronger than its current bag. The network is convinced of the bag’s environmental advantages on scientific grounds. However, the network questions the company’s claim of increased strength and requires statistical evidence to justify this claim. Although there are various measures of bag strength, the manufacturer and the network agree to employ “breaking strength.” A bag’s breaking strength is the amount of a representative trash mix (in pounds) that, when loaded into a bag suspended in the air, will cause the bag to rip or tear. Tests show that the current bag has a mean breaking strength that is very close to (but does not exceed) 50 pounds. The new bag’s mean breaking strength μ is unknown and in question. The alternative hypothesis Ha is the statement for which we wish to find supportive evidence. Because we hope the new bags are stronger than the current bags, Ha says that μ is greater than 50. The null hypothesis states that Ha is false. Therefore, H0 says that μ is less than or equal to 50. We summarize these hypotheses by stating that we are testing
H0: μ ≤ 50   versus   Ha: μ > 50
The network will run the manufacturer’s commercial if a random sample of n new bags provides sufficient evidence to reject H0: μ ≤ 50   in favor of   Ha: μ > 50.
EXAMPLE 9.2: The Payment Time Case
Recall that a management consulting firm has installed a new computer-based, electronic billing system for a Hamilton, Ohio, trucking company. Because of the system’s advantages, and because the trucking company’s clients are receptive to using this system, the management consulting firm believes that the new system will reduce the mean bill payment time by more than 50 percent. The mean payment time using the old billing system was approximately equal to, but no less than, 39 days. Therefore, if μ denotes the mean payment time using the new system, the consulting firm believes that μ will be less than 19.5 days. Because it is hoped that the new billing system reduces mean payment time, we formulate the alternative hypothesis as Ha: μ < 19.5 and the null hypothesis as H0: μ ≥ 19.5. The consulting firm will randomly select a sample of n invoices and determine if their payment times provide sufficient evidence to reject H0: μ ≥ 19.5 in favor of Ha: μ < 19.5. If such evidence exists, the consulting firm will conclude that the new electronic billing system has reduced the Hamilton trucking company’s mean bill payment time by more than 50 percent. This conclusion will be used to help demonstrate the benefits of the new billing system both to the Hamilton company and to other trucking companies that are considering using such a system. EXAMPLE 9.3: The Valentine’s Day Chocolate Case 2 A candy company annually markets a special 18 ounce box of assorted chocolates to large retail stores for Valentine’s Day. This year the candy company has designed an extremely attractive new valentine box and will fill the box with an especially appealing assortment or chocolates. For this reason, the candy company subjectively projects—based on past experience and knowledge of the candy market—that sales of its valentine box will be 10 percent higher than last year. However, since the candy company must decide how many valentine boxes to produce, the company needs to assess whether it is reasonable to plan for a 10 percent increase in sales. Before the beginning of each Valentine’s Day sales season, the candy company sends large retail stores information about its newest valentine box of assorted chocolates. This information includes a description of the box of chocolates, as well as a preview of advertising displays that the candy company will provide to help retail stores sell the chocolates. Each retail store then places a single (nonreturnable) order of valentine boxes to satisfy its anticipated customer demand for the Valentine’s Day sales season. Last year the mean order quantity of large retail stores was 300 boxes per store. If the projected 10 percent sales increase will occur, the mean order quantity, μ, of large retail stores this year will be 330 boxes per store. Therefore, the candy company wishes to test the null hypothesis H0: μ = 330 versus the alternative hypothesis Ha: μ ≠ 330. To perform the hypothesis test, the candy company will randomly select a sample of n large retail stores and will make an early mailing to these stores promoting this year’s valentine box. The candy company will then ask each retail store to report how many valentine boxes it anticipates ordering. If the sample data do not provide sufficient evidence to reject H0: μ = 330 in favor of Ha: μ ≠ 330, the candy company will base its production on the projected 10 percent sales increase. On the other hand, if there is sufficient evidence to reject H0: μ = 330, the candy company will change its production plans. We next summarize the sets of null and alternative hypotheses that we have thus far considered. The alternative hypothesis Ha: μ > 50 is called a one-sided, greater than alternative
hypothesis, whereas Ha: μ < 19.5 is called a one-sided, less than alternative hypothesis, and Ha: μ ≠ 330 is called a two-sided, not equal to alternative hypothesis. Many of the alternative hypotheses we consider in this book are one of these three types. Also, note that each null hypothesis we have considered involves an equality. For example, the null hypothesis H0: μ ≤ 50 says that μ is either less than or equal to 50. We will see that, in general, the approach we use to test a null hypothesis versus an alternative hypothesis requires that the null hypothesis involve an equality. The idea of a test statistic Suppose that in the trash bag case the manufacturer randomly selects a sample of n = 40 new trash bags. Each of these bags is tested for breaking strength, and the sample mean of the 40 breaking strengths is calculated. In order to test H0: μ ≤ 50 versus Ha: μ > 50, we utilize the
test statistic

The test statistic z measures the distance between and 50. The division by says that this distance is measured in units of the standard deviation of all possible sample means. For example, a value of z equal to, say, 2.4 would tell us that is 2.4 such standard deviations above 50. In general, a value of the test statistic that is less than or equal to zero results when is less than or equal to 50. This provides no evidence to support rejecting H0 in favor of Ha because the point estimate indicates that μ is probably less than or equal to 50. However, a value of the test statistic that is greater than zero results when is greater than 50. This provides evidence to support rejecting H0 in favor of Ha because the point estimate indicates that μ might be greater than 50. Furthermore, the farther the value of the test statistic is above 0 (the farther is above 50), the stronger is the evidence to support rejecting H0 in favor of Ha.
Hypothesis testing and the legal system
If the value of the test statistic z is far enough above 0, we reject H0 in favor of Ha. To see how large z must be in order to reject H0, we must understand that a hypothesis test rejects a null hypothesis H0 only if there is strong statistical evidence against H0. This is similar to our legal system, which rejects the innocence of the accused only if evidence of guilt is beyond a reasonable doubt. For instance, the network will reject H0: μ ≤ 50 and run the trash bag commercial only if the test statistic z is far enough above 0 to show beyond a reasonable doubt that H0: μ ≤ 50 is false and Ha: μ > 50 is true. A test statistic that is only slightly greater than 0 might not be convincing enough. However, because such a test statistic would result from a sample mean that is slightly greater than 50, it would provide some evidence to support rejecting H0: μ ≤ 50, and it certainly would not provide strong evidence sup porting H0: μ ≤ 50. Therefore, if the value of the test statistic is not large enough to convince us to reject H0, we do not say that we accept H0. Rather we say that we do not reject H0 because the evidence against H0 is not strong enough. Again, this is similar to our legal system, where the lack of evidence of guilt beyond a reasonable doubt results in a verdict of not guilty, but does not prove that the accused is innocent.
Type I and Type II errors and their probabilities
To determine exactly how much statistical evidence is required to reject H0, we consider the errors and the correct decisions that can be made in hypothesis testing. These errors and correct decisions, as well as their implications in the trash bag advertising example, are summarized in Tables 9.1 and 9.2. Across the top of each table are listed the two possible “states of nature.” Either H0: μ ≤ 50 is true, which says the manufacturer’s claim that μ is greater than 50 is false, or H0 is false, which says the claim is true. Down the left side of each table are listed the two possible decisions we can make in the hypothesis test. Using the sample data, we will either reject H0: μ ≤ 50, which implies that the claim will be advertised, or we will not reject H0, which implies that the claim will not be advertised.
Table 9.1: Type I and Type II Errors

Table 9.2: The Implications of Type I and Type II Errors in the Trash Bag Example

In general, the two types of errors that can be made in hypothesis testing are defined here:
Type I and Type II Errors
If we reject H0 when it is true, this is a
Type I error.
If we do not reject H0 when it is false, this is a
Type II error.
As can be seen by comparing Tables 9.1 and 9.2, if we commit a Type I error, we will advertise a false claim. If we commit a Type II error, we will fail to advertise a true claim.
We now let the symbol
α
(pronounced alpha) denote the probability of a Type I error, and we let
β
(pronounced beta) denote the probability of a Type II error. Obviously, we would like both α and β to be small. A common (but not the only) procedure is to base a hypothesis test on taking a sample of a fixed size (for example, n = 40 trash bags) and on setting α equal to a small prespecified value. Setting α low means there is only a small chance of rejecting H0 when it is true. This implies that we are requiring strong evidence against H0 before we reject it.
We sometimes choose α as high as .10, but we usually choose α between .05 and .01. A frequent choice for α is .05. In fact, our former student tells us that the network often tests advertising claims by setting the probability of a Type I error equal to .05. That is, the network will run a commercial making a claim if the sample evidence allows it to reject a null hypothesis that says the claim is not valid in favor of an alternative hypothesis that says the claim is valid with α set equal to .05. Since a Type I error is deciding that the claim is valid when it is not, the policy of setting α equal to .05 says that, in the long run, the network will advertise only 5 percent of all invalid claims made by advertisers.
One might wonder why the network does not set α lower—say at .01. One reason is that it can be shown that, for a fixed sample size, the lower we set α, the higher is β, and the higher we set α, the lower is β. Setting α at .05 means that β, the probability of failing to advertise a true claim (a Type II error), will be smaller than it would be if α were set at .01. As long as (1) the claim to be advertised is plausible and (2) the consequences of advertising the claim even if it is false are not terribly serious, then it is reasonable to set α equal to .05. However, if either (1) or (2) is not true, then we might set α lower than .05. For example, suppose a pharmaceutical company wishes to advertise that it has developed an effective treatment for a disease that has formerly been very resistant to treatment. Such a claim is (perhaps) difficult to believe. Moreover, if the claim is false, patients suffering from the disease would be subjected to false hope and needless expense. In such a case, it might be reasonable for the network to set α at .01 because this would lower the chance of advertising the claim if it is false. We usually do not set α lower than .01 because doing so often leads to an unacceptably large value of β. We explain some methods for computing the probability of a Type II error in optional Section 9.6. However, β can be difficult or impossible to calculate in many situations, and we often must rely on our intuition when deciding how to set α.
Exercises for Section 9.1
CONCEPTS

9.1 Which hypothesis (the null hypothesis, H0, or the alternative hypothesis, Ha) is the “status quo” hypothesis (that is, the hypothesis that states that things are remaining “as is”)? Which hypothesis is the hypothesis that says that a “hoped for” or “suspected” condition exists?
9.2 Which hypothesis (H0 or Ha) is not rejected unless there is convincing sample evidence that it is false? Which hypothesis (H0 or Ha) will be accepted only if there is convincing sample evidence that it is true?
9.3 Define each of the following:
a Type I error
b Type II error
c α

d β

9.4 For each of the following situations, indicate whether an error has occurred and, if so, indicate what kind of error (Type I or Type II) has occurred.
a We do not reject H0 and H0 is true.
b We reject H0 and H0 is true.
c We do not reject H0 and H0 is false.
d We reject H0 and H0 is false.
9.5 If we reject H0, what is the only type of error that we could be making? Explain.
9.6 If we do not reject H0, what is the only type of error that we could be making? Explain.
9.7 When testing a hypothesis, why don’t we set the probability of a Type I error to be extremely small? Explain.
METHODS AND APPLICATIONS
9.8 THE VIDEO GAME SATISFACTION RATING CASE VideoGame
Recall that “very satisfied” customers give the XYZ-Box video game system a rating that is at least 42. Suppose that the manufacturer of the XYZ-Box wishes to use the random sample of 65 satisfaction ratings to provide evidence supporting the claim that the mean composite satisfaction rating for the XYZ-Box exceeds 42.
a Letting μ represent the mean composite satisfaction rating for the XYZ-Box, set up the null and alternative hypotheses needed if we wish to attempt to provide evidence supporting the claim that μ exceeds 42.
b In the context of this situation, interpret making a Type I error; interpret making a Type II error.
9.9 THE BANK CUSTOMER WAITING TIME CASE WaitTime
Recall that a bank manager has developed a new system to reduce the time customers spend waiting for teller service during peak hours. The manager hopes the new system will reduce waiting times from the current 9 to 10 minutes to less than 6 minutes.
Suppose the manager wishes to use the random sample of 100 waiting times to support the claim that the mean waiting time under the new system is shorter than six minutes.
a Letting μ represent the mean waiting time under the new system, set up the null and alternative hypotheses needed if we wish to attempt to provide evidence supporting the claim that μ is shorter than six minutes.
b In the context of this situation, interpret making a Type I error; interpret making a Type II error.
9.10 An automobile parts supplier owns a machine that produces a cylindrical engine part. This part is supposed to have an outside diameter of three inches. Parts with diameters that are too small or too large do not meet customer requirements and must be rejected. Lately, the company has experienced problems meeting customer requirements. The technical staff feels that the mean diameter produced by the machine is off target. In order to verify this, a special study will randomly sample 100 parts produced by the machine. The 100 sampled parts will be measured, and if the results obtained cast a substantial amount of doubt on the hypothesis that the mean diameter equals the target value of three inches, the company will assign a problem-solving team to intensively search for the causes of the problem.
a The parts supplier wishes to set up a hypothesis test so that the problem-solving team will be assigned when the null hypothesis is rejected. Set up the null and alternative hypotheses for this situation.
b In the context of this situation, interpret making a Type I error; interpret making a Type II error.
c Suppose it costs the company $3,000 a day to assign the problem-solving team to a project. Is this $3,000 figure the daily cost of a Type I error or a Type II error? Explain.
9.11 The Crown Bottling Company has just installed a new bottling process that will fill 16-ounce bottles of the popular Crown Classic Cola soft drink. Both overfilling and underfilling bottles are undesirable: Underfilling leads to customer complaints and overfilling costs the company considerable money. In order to verify that the filler is set up correctly, the company wishes to see whether the mean bottle fill, μ, is close to the target fill of 16 ounces. To this end, a random sample of 36 filled bottles is selected from the output of a test filler run. If the sample results cast a substantial amount of doubt on the hypothesis that the mean bottle fill is the desired 16 ounces, then the filler’s initial setup will be readjusted.
a The bottling company wants to set up a hypothesis test so that the filler will be readjusted if the null hypothesis is rejected. Set up the null and alternative hypotheses for this hypothesis test.
b In the context of this situation, interpret making a Type I error; interpret making a Type II error.
9.12 Consolidated Power, a large electric power utility, has just built a modern nuclear power plant. This plant discharges waste water that is allowed to flow into the Atlantic Ocean. The Environmental Protection Agency (EPA) has ordered that the waste water may not be excessively warm so that thermal pollution of the marine environment near the plant can be avoided. Because of this order, the waste water is allowed to cool in specially constructed ponds and is then released into the ocean. This cooling system works properly if the mean temperature of waste water discharged is 60°F or cooler. Consolidated Power is required to monitor the temperature of the waste water. A sample of 100 temperature readings will be obtained each day, and if the sample results cast a substantial amount of doubt on the hypothesis that the cooling system is working properly (the mean temperature of waste water discharged is 60°F or cooler), then the plant must be shut down and appropriate actions must be taken to correct the problem.
a Consolidated Power wishes to set up a hypothesis test so that the power plant will be shut down when the null hypothesis is rejected. Set up the null and alternative hypotheses that should be used.
b In the context of this situation, interpret making a Type I error; interpret making a Type II error.
c The EPA periodically conducts spot checks to determine whether the waste water being discharged is too warm. Suppose the EPA has the power to impose very severe penalties (for example, very heavy fines) when the waste water is excessively warm. Other things being equal, should Consolidated Power set the probability of a Type I error equal to α = .01 or α = .05? Explain.
9.13 Consider Exercise 9.12, and suppose that Consolidated Power has been experiencing technical problems with the cooling system. Because the system has been unreliable, the company feels it must take precautions to avoid failing to shut down the plant when its waste water is too warm. Other things being equal, should Consolidated Power set the probability of a Type I error equal to α = .01 or α = .05? Explain.
9.2: z Tests about a Population Mean: σ Known
In this section we discuss hypothesis tests about a population mean that are based on the normal distribution. These tests are called
z tests, and they require that the true value of the population standard deviation σ is known. Of course, in most real-world situations the true value of σ is not known. However, the concepts and calculations of hypothesis testing are most easily illustrated using the normal distribution. Therefore, in this section we will assume that—through theory or history related to the population under consideration—we know σ. When σ is unknown, we test hypotheses about a population mean by using the t distribution. In Section 9.3 we study
t tests, and we will revisit the examples of this section assuming that σ is unknown.

Chapter 9

Testing a “greater than” alternative hypothesis by using a critical value rule
In Section 9.1 we explained how to set up appropriate null and alternative hypotheses. We also discussed how to specify a value for α, the probability of a Type I error (also called the level of significance) of the hypothesis test, and we introduced the idea of a test statistic. We can use these concepts to begin developing a seven step hypothesis testing procedure. We will introduce these steps in the context of the trash bag case and testing a “greater than” alternative hypothesis.
Step 1: State the null hypothesis H0 and the alternative hypothesis Ha. In the trash bag case, we will test H0: μ ≤ 50 versus Ha: μ > 50. Here, μ is the mean breaking strength of the new trash bag.
Step 2: Specify the level of significance α. The television network will run the commercial stating that the new trash bag is stronger than the former bag if we can reject H0: μ ≤ 50 in favor of Ha: μ > 50 by setting α equal to .05.
Step 3: Select the test statistic. In order to test H0: μ ≤ 50 versus Ha: μ > 50, we will test the modified null hypothesis H0: μ = 50 versus Ha: μ > 50. The idea here is that if there is sufficient evidence to reject the hypothesis that μ equals 50 in favor of μ > 50, then there is certainly also sufficient evidence to reject the hypothesis that μ is less than or equal to 50. In order to test H0: μ = 50 versus Ha: μ > 50, we will randomly select a sample of n = 40 new trash bags and calculate the mean of the breaking strengths of these bags. We will then utilize the test statistic

A positive value of this test statistic results from an that is greater than 50 and thus provides evidence against H0: μ = 50 and in favor of Ha: μ > 50.
Step 4: Determine the critical value rule for deciding whether to reject H0. To decide how large the test statistic z must be to reject H0 in favor of Ha by setting the probability of a Type I error equal to α, we note that different samples would give different sample means and thus different values of z. Because the sample size n = 40 is large, the Central Limit Theorem tells us that the sampling distribution of z is (approximately) a standard normal distribution if the null hypothesis H0: μ = 50 is true. Therefore, we do the following:
Place the probability of a Type I error, α, in the right-hand tail of the standard normal curve and use the normal table (see Table A.3, page 863) to find the normal point zα. Here zα, which we call a
critical value, is the point on the horizontal axis under the standard normal curve that gives a right-hand tail area equal to α.
Reject H0: μ = 50 in favor of Ha: μ > 50 if and only if the test statistic z is greater than the critical value zα
(This is the critical value rule.)
Figure 9.1 illustrates that since we have set α equal to .05, we should use the critical value zα = z.05 = 1.645 (see Table A.3). This says that we should reject H0 if z > 1.645 and we should not reject H0 if z ≤ 1.645.
Figure 9.1: The Critical Value for Testing H0: μ = 50 versus Ha: μ > 50 by Setting α = .05

To better understand the critical value rule, consider the standard normal curve in Figure 9.1. The area of .05 in the right-hand tail of this curve implies that values of the test statistic z that are greater than 1.645 are unlikely to occur if the null hypothesis H0: μ = 50 is true. There is a 5 percent chance of observing one of these values—and thus wrongly rejecting H0—if H0 is true. However, we are more likely to observe a value of z greater than 1.645—and thus correctly reject H0—if H0 is false. Therefore, it is intuitively reasonable to reject H0 if the value of the test statistic z is greater than 1.645.
Step 5: Collect the sample data and compute the value of the test statistic. When the sample of n = 40 new trash bags is randomly selected, the mean of the breaking strengths is calculated to be . Assuming that σ is known to equal 1.65, the value of the test statistic is

Step 6: Decide whether to reject H0 by using the test statistic value and the critical value rule. Since the test statistic value z = 2.20 is greater than the critical value z.05 = 1.645, we can reject H0: μ = 50 in favor of Ha: μ > 50 by setting α equal to .05. Furthermore, we can be intuitively confident that H0: μ = 50 is false and Ha: μ > 50 is true. This is because, since we have rejected H0 by setting α equal to .05, we have rejected H0 by using a test that allows only a 5 percent chance of wrongly rejecting H0. In general, if we can reject a null hypothesis in favor of an alternative hypothesis by setting the probability of a Type I error equal to α, we say that we have
statistical significance at the
α
level.

Step 7: Interpret the statistical results in managerial (real-world) terms and assess their practical importance. Since we have rejected H0: μ = 50 in favor of Ha: μ > 50 by setting α equal to .05, we conclude (at an α of .05) that the mean breaking strength of the new trash bag exceeds 50 pounds. Furthermore, this conclusion has practical importance to the trash bag manufacturer because it means that the television network will approve running commercials claiming that the new trash bag is stronger than the former bag. Note, however, that the point estimate of μ, , indicates that μ is not much larger than 50. Therefore, the trash bag manufacturer can claim only that its new bag is slightly stronger than its former bag. Of course, this might be practically important to consumers who feel that, because the new bag is 25 percent less expensive and is more environmentally sound, it is definitely worth purchasing if it has any strength advantage. However, to customers who are looking only for a substantial increase in bag strength, the statistical results would not be practically important. This illustrates that, in general, a finding of statistical significance (that is, concluding that the alternative hypothesis is true) can be practically important to some people but not to others. Notice that the point estimate of the parameter involved in a hypothesis test can help us to assess practical importance. We can also use confidence intervals to help assess practical importance.
Considerations in setting α
We have reasoned in Section 9.1 that the television network has set α equal to .05 rather than .01 because doing so means that β, the probability of failing to advertise a true claim (a Type II error), will be smaller than it would be if α were set at .01. It is informative, however, to see what would have happened if the network had set α equal to .01. Figure 9.2 illustrates that as we decrease α from .05 to .01, the critical value zα increases from z.05 = 1.645 to z.01 = 2.33. Because the test statistic value z = 2.20 is less than z.01 = 2.33, we cannot reject H0: μ = 50 in favor of Ha: μ > 50 by setting α equal to .01. This illustrates the point that, the smaller we set α, the larger is the critical value, and thus the stronger is the statistical evidence that we are requiring to reject the null hypothesis H0. Some statisticians have concluded (somewhat subjectively) that (1) if we set α equal to .05, then we are requiring strong evidence to reject H0; and (2) if we set α equal to .01, then we are requiring very strong evidence to reject H0.
Figure 9.2: The Critical Values for Testing H0: μ = 50 versus Ha: μ > 50 by Setting α = .05 and .01

A
p
-value for testing a “greater than” alternative hypothesis
To decide whether to reject the null hypothesis H0 at level of significance α, steps 4, 5, and 6 of the seven-step hypoth esis testing procedure compare the test statistic value with a critical value. Another way to make this decision is to calculate a
p
-value, which measures the likelihood of the sample results if the null hypothesis H0 is true. Sample results that are not likely if H0 is true are evidence that H0 is not true. To test H0 by using a p-value, we use the following steps 4, 5, and 6:
Step 4: Collect the sample data and compute the value of the test statistic. In the trash bag case, we have computed the value of the test statistic to be z = 2.20.
Step 5: Calculate the p-value by using the test statistic value. The p-value for testing H0: μ = 50 versus Ha: μ > 50 in the trash bag case is the area under the standard normal curve to the right of the test statistic value z = 2.20. As illustrated in Figure 9.3(b), this area is 1 − .9861 = .0139. The p-value is the probability, computed assuming that H0: μ = 50 is true, of observing a value of the test statistic that is greater than or equal to the value z = 2.20 that we have actually computed from the sample data. The p-value of .0139 says that, if H0: μ = 50 is true, then only 139 in 10,000 of all possible test statistic values are at least as large, or extreme, as the value z = 2.20. That is, if we are to believe that H0 is true, we must believe that we have observed a test statistic value that can be described as a 139 in 10,000 chance. Because it is difficult to believe that we have observed a 139 in 10,000 chance, we intuitively have strong evidence that H0: μ = 50 is false and Ha: μ > 50 is true.
Figure 9.3: Testing H0: μ = 50 versus Ha: μ > 50 by Using Critical Values and the p-Value

Step 6: Reject H0 if the p-value is less than α. Recall that the television network has set α equal to .05. The p-value of .0139 is less than the α of .05. Comparing the two normal curves in Figures 9.3(a) and (b), we see that this implies that the test statistic value z = 2.20 is greater than the critical value z.05 = 1.645. Therefore, we can reject H0 by setting α equal to .05. As another example, suppose that the television network had set α equal to .01. The p-value of .0139 is greater than the α of .01. Comparing the two normal curves in Figures 9.3(b) and (c), we see that this implies that the test statistic value z = 2.20 is less than the critical value z.01 = 2.33. Therefore, we cannot reject H0 by setting α equal to .01. Generalizing these examples, we conclude that the value of the test statistic z will be greater than the critical value zα if and only if the p-value is less than α. That is, we can reject H0 in favor of Ha at level of significance α if and only if the p-value is less than α.

© NBC, Inc. Used with permission.
Note: This logo appears on an NBC advertising standards booklet. This booklet, along with other information provided by NBC and CBS, forms the basis for much of the discussion in the paragraph to the right.
Comparing the critical value and p-value methods
Thus far we have considered two methods for testing H0: μ = 50 versus Ha: μ > 50 at the .05 and .01 values of α. Using the first method, we determine if the test statistic value z = 2.20 is greater than the critical values z.05 = 1.645 and z.01 = 2.33. Using the second method, we determine if the p-value of .0139 is less than .05 and .01. Whereas the critical value method requires that we look up a different critical value for each different α value, the p-value method requires only that we calculate a single p-value and compare it directly with the different α values. It follows that the p-value method is the most efficient way to test a hypothesis at different α values. This can be useful when there are different decision makers who might use different α values. For example, television networks do not always evaluate advertising claims by setting α equal to .05. The reason is that the consequences of a Type I error (advertising a false claim) are more serious for some claims than for others. For example, the consequences of a Type I error would be fairly serious for a claim about the effectiveness of a drug or for the superiority of one product over another. However, these consequences might not be as serious for a noncomparative claim about an inexpensive and safe product, such as a cosmetic. Networks sometimes use α values between .01 and .04 for claims having more serious Type I error consequences, and they sometimes use α values between .06 and .10 for claims having less serious Type I error consequences. Furthermore, one network’s policies for setting α can differ somewhat from those of another. As a result, reporting an advertising claim’s p-value to each network is the most efficient way to tell the network whether to allow the claim to be advertised. For example, most networks would evaluate the trash bag claim by choosing an α value between .025 and .10. Since the p-value of .0139 is less than all these α values, most networks would allow the trash bag claim to be advertised.
A summary of the seven steps of hypothesis testing
For almost every hypothesis test discussed in this book, statisticians have developed both a critical value rule and a p-value that can be used to perform the hypothesis test. Furthermore, it can be shown that for each hypothesis test the p-value has been defined so that we can reject the null hypothesis at level of significance α if and only if the p-value is less than α
. We now summarize a seven-step procedure for performing a hypothesis test.
The Seven Steps of Hypothesis Testing

1 State the null hypothesis H0 and the alternative hypothesis Ha.
2 Specify the level of significance α.
3 Select the test statistic.
Using a critical value rule:

4 Determine the critical value rule for deciding whether to reject H0. Use the specified value of α to find the critical value in the critical value rule.
5 Collect the sample data and compute the value of the test statistic.
6 Decide whether to reject H0 by using the test statistic value and the critical value rule.
Using a p-value:

4 Collect the sample data and compute the value of the test statistic.
5 Calculate the p-value by using the test statistic value.
6 Reject H0 at level of significance α if the p-value is less than α.
7 Interpret your statistical results in managerial (real-world) terms and assess their practical importance.
In the real world both critical value rules and p-values are used to carry out hypothesis tests. For example, NBC uses critical value rules, whereas CBS uses p-values, to statistically verify the validity of advertising claims. Throughout this book we will continue to present both the critical value and the p-value approaches to hypothesis testing.
Testing a “less than” alternative hypothesis
We next consider the payment time case and testing a “less than” alternative hypothesis:
Step 1: In order to study whether the new electronic billing system reduces the mean bill payment time by more than 50 percent, the management consulting firm will test H0: μ ≥ 19.5 versus Ha: μ < 19.5. Step 2: The management consulting firm wishes to make sure that it truthfully describes the benefits of the new system both to the Hamilton, Ohio, trucking company and to other companies that are considering installing such a system. Therefore, the firm will require very strong evidence to conclude that μ is less than 19.5, which implies that it will test H0: μ ≥ 19.5 versus Ha: μ < 19.5 by setting α equal to .01. Step 3: In order to test H0: μ ≥ 19.5 versus Ha: μ < 19.5, we will test the modified null hypothesis H0: μ = 19.5 versus Ha: μ < 19.5. The idea here is that if there is sufficient evidence to reject the hypothesis that μ equals 19.5 in favor of μ < 19.5, then there is certainly also sufficient evidence to reject the hypothesis that μ is greater than or equal to 19.5. In order to test H0: μ = 19.5 versus Ha: μ < 19.5, we will randomly select a sample of n = 65 invoices paid using the billing system and calculate the mean of the payment times of these invoices. Since the sample size is large, the Central Limit Theorem applies, and we will utilize the test statistic A value of the test statistic z that is less than zero results when is less than 19.5. This provides evidence to support rejecting H0 in favor of Ha because the point estimate indicates that μ might be less than 19.5. Step 4: To decide how much less than zero the test statistic must be to reject H0 in favor of Ha by setting the probability of a Type I error equal to α, we do the following: Place the probability of a Type I error, α, in the left-hand tail of the standard normal curve and use the normal table to find the critical value −zα. Here −zα is the negative of the normal point zα. That is, −zα is the point on the horizontal axis under the standard normal curve that gives a left-hand tail area equal to α. Reject H0: μ = 19.5 in favor of Ha: μ < 19.5 if and only if the test statistic z is less than the critical value −zα. Because α equals .01, the critical value −zα is −z.01 = −2.33 [see Fig. 9.4(a)]. Figure 9.4: Testing H0: μ = 19.5 versus Ha: μ < 19.5 by Using Critical Values and the p-Value Step 5: When the sample of n = 65 invoices is randomly selected, the mean of the payment times of these invoices is calculated to be . Assuming that σ is known to equal 4.2, the value of the test statistic is Step 6: Since the test statistic value z = −2.67 is less than the critical value −z.01 = −2.33, we can reject H0: μ = 19.5 in favor of Ha: μ < 19.5 by setting α equal to .01. Step 7: We conclude (at an α of .01) that the mean payment time for the new electronic billing system is less than 19.5 days. This, along with the fact that the sample mean is slightly less than 19.5, implies that it is reasonable for the management consulting firm to conclude that the new electronic billing system has reduced the mean payment time by slightly more than 50 percent (a substantial improvement over the old system). A p-value for testing a “less than” alternative hypothesis To test H0: μ = 19.5 versus Ha: μ < 19.5 in the payment time case by using a p-value, we use the following steps 4, 5, and 6: Step 4: We have computed the value of the test statistic in the payment time case to be z = −2.67. Step 5: The p-value for testing H0: μ = 19.5 versus Ha: μ < 19.5 is the area under the standard normal curve to the left of the test statistic value z = −2.67. As illustrated in Figure 9.4(b), this area is .0038. The p-value is the probability, computed assuming that H0: μ = 19.5 is true, of observing a value of the test statistic that is less than or equal to the value z = −2.67 that we have actually computed from the sample data. The p-value of .0038 says that, if H0: μ = 19.5 is true, then only 38 in 10,000 of all possible test statistic values are at least as negative, or extreme, as the value z = −2.67. That is, if we are to believe that H0 is true, we must believe that we have observed a test statistic value that can be described as a 38 in 10,000 chance. Step 6: The management consulting firm has set α equal to .01. The p-value of .0038 is less than the α of .01. Therefore, we can reject H0 by setting α equal to .01. Testing a “not equal to” alternative hypothesis We next consider the Valentine’s Day chocolate case and testing a “not equal to” alternative hypothesis. Step 1: To assess whether this year’s sales of its valentine box of assorted chocolates will be ten percent higher than last year’s, the candy company will test H0: μ = 330 versus Ha: μ ≠ 330. Here, μ is the mean order quantity of this year’s valentine box by large retail stores. Step 2: If the candy company does not reject H0: μ = 330 and H0: μ = 330 is false—a Type II error—the candy company will base its production of valentine boxes on a 10 percent projected sales increase that is not correct. Since the candy company wishes to have a reasonably small probability of making this Type II error, the company will set α equal to .05. Setting α equal to .05 rather than .01 makes the probability of a Type II error smaller than it would be if α were set at .01. Note that in optional Section 9.5 we will verify that the probability of a Type II error in this situation is reasonably small. Therefore, if the candy company ends up not rejecting H0: μ = 330 and therefore decides to base its production of valentine boxes on the ten percent projected sales increase, the company can be intuitively confident that it has made the right decision. Step 3: The candy company will randomly select n = 100 large retail stores and will make an early mailing to these stores promoting this year’s valentine box of assorted chocolates. The candy company will then ask each sampled retail store to report its anticipated order quantity of valentine boxes and will calculate the mean of the reported order quantities. Since the sample size is large, the Central Limit Theorem applies, and we will utilize the test statistic A value of the test statistic that is greater than 0 results when is greater than 330. This provides evidence to support rejecting H0 in favor of Ha because the point estimate indicates that μ might be greater than 330. Similarly, a value of the test statistic that is less than 0 results when is less than 330. This also provides evidence to support rejecting H0 in favor of Ha because the point estimate indicates that μ might be less than 330. Step 4: To decide how different from zero (positive or negative) the test statistic must be in order to reject H0 in favor of Ha by setting the probability of a Type I error equal to α, we do the following: Divide the probability of a Type I error, α, into two equal parts, and place the area α/2 in the right-hand tail of the standard normal curve and the area α/2 in the left-hand tail of the standard normal curve. Then use the normal table to find the critical values zα/2 and −zα/2. Here zα/2 is the point on the horizontal axis under the standard normal curve that gives a right-hand tail area equal to α/2, and −zα/2 is the point giving a left-hand tail area equal to α/2. Reject H0: μ = 330 in favor of Ha: μ ≠ 330 if and only if the test statistic z is greater than the critical value zα/2 or less than the critical value −zα/2. Note that this is equivalent to saying that we should reject H0 if and only if the absolute value of the test statistic, | z | is greater than the critical value zα/2. Because α equals .05, the critical values are [see Figure 9.5(a)] Figure 9.5: Testing H0: μ = 330 versus Ha: μ ≠ 330 by Using Critical Values and the p-Value Step 5: When the sample of n = 100 large retail stores is randomly selected, the mean of their reported order quantities is calculated to be . Assuming that σ is known to equal 40, the value of the test statistic is Step 6: Since the test statistic value z = −1 is greater than − z.025 = −1.96 (or, equivalently, since | z | = 1 is less than z.025 = 1.96), we cannot reject H0: μ = 330 in favor of Ha: μ ≠ 330 by setting α equal to .05. Step 7: We cannot conclude (at an α of .05) that the mean order quantity of this year’s valentine box by large retail stores will differ from 330 boxes. Therefore, the candy company will base its production of valentine boxes on the ten percent projected sales increase. A p -value for testing a “not equal to” alternative hypothesis To test H0: μ = 330 versus Ha: μ ≠ 330 in the Valentine’s Day chocolate case by using a p-value, we use the following steps 4, 5, and 6: Step 4: We have computed the value of the test statistic in the Valentine’s Day chocolate case to be z = −1. Step 5: Note from Figure 9.5(b) that the area under the standard normal curve to the right of | z | = 1 is .1587. Twice this area—that is, 2(.1587) = .3174—is the p-value for testing H0: μ = 330 versus Ha: μ ≠ 330. To interpret the p-value as a probability, note that the symmetry of the standard normal curve implies that twice the area under the curve to the right of | z | = 1 equals the area under this curve to the right of 1 plus the area under the curve to the left of −1 [see Figure 9.5(b)]. Also, note that since both positive and negative test statistic values count against H0: μ = 330, a test statistic value that is either greater than or equal to 1 or less than or equal to −1 is at least as extreme as the observed test statistic value z = −1. It follows that the p-value of .3174 says that, if H0: μ = 330 is true, then 31.74 percent of all possible test statistic values are at least as extreme as z = −1. That is, if we are to believe that H0 is true, we must believe that we have observed a test statistic value that can be described as a 31.74 percent chance. Step 6: The candy company has set α equal to .05. The p-value of .3174 is greater than the α of .05. Therefore, we cannot reject H0 by setting α equal to .05. A general procedure for testing a hypothesis about a population mean In the trash bag case we have tested H0: μ ≤ 50 versus Ha: μ > 50 by testing H0: μ = 50 versus Ha: μ > 50. In the payment time case we have tested H0: μ ≥ 19.5 versus Ha: μ < 19.5 by testing H0: μ = 19.5 versus Ha: μ < 19.5. In general, the usual procedure for testing a “less than or equal to” null hypothesis or a “greater than or equal to” null hypothesis is to change the null hypothesis to an equality. We then test the “equal to” null hypothesis versus the alternative hypothesis. Furthermore, the critical value and p-value procedures for testing a null hypothesis versus an alternative hypothesis depend upon whether the alternative hypothesis is a “greater than,” a “less than,” or a “not equal to” alternative hypothesis. The following summary box gives the appropriate procedures. Specifically, letting μ0 be a particular number, the summary box shows how to test H0: μ = μ0 versus either Ha: μ > μ0, Ha: μ < μ0, or Ha: μ ≠ μ0: Testing a Hypothesis about a Population Mean when σ Is Known Define the test statistic and assume that the population sampled is normally distributed, or that the sample size n is large. We can test H0: μ = μ0 versus a particular alternative hypothesis at level of significance α by using the appropriate critical value rule, or, equivalently, the corresponding p-value. Using confidence intervals to test hypotheses Confidence intervals can be used to test hypotheses. Specifically, it can be proven that we can reject H0: μ = μ0 in favor of Ha: μ ≠ μ0 by setting the probability of a Type I error equal to α if and only if the 100(1 − α) percent confidence interval for μ does not contain μ0. For example, consider the Valentine’s Day chocolate case and testing H0: μ = 330 versus Ha: μ ≠ 330 by setting α equal to .05. To do this, we use the mean of the sample of n = 100 reported order quantities to calculate the 95 percent confidence interval for μ to be Because this interval does contain 330, we cannot reject H0: μ = 330 in favor of Ha: μ ≠ 330 by setting α equal to .05. Whereas we can use two-sided confidence intervals to test “not equal to” alternative hypotheses, we must use one-sided confidence intervals to test “greater than” or “less than” alternative hypotheses. We will not study one-sided confidence intervals in this book. However, it should be emphasized that we do not need to use confidence intervals (one-sided or two-sided) to test hypotheses. We can test hypotheses by using test statistics and critical values or p-values, and these are the approaches that we will feature throughout this book. Measuring the weight of evidence against the null hypothesis We have seen that in some situations the decision to take an action is based solely on whether a null hypothesis can be rejected in favor of an alternative hypothesis by setting α equal to a single, prespecified value. For example, in the trash bag case the television network decided to run the trash bag commercial because H0: μ = 50 was rejected in favor of Ha: μ > 50 by setting α equal to .05. Also, in the payment time case the management consulting firm decided to claim that the new electronic billing system has reduced the Hamilton trucking company’s mean payment time by more than 50 percent because H0: μ = 19.5 was rejected in favor of Ha: μ < 19.5 by setting α equal to .01. Furthermore, in the Valentine’s Day chocolate case, the candy company decided to base its production of valentine boxes on the ten percent projected sales increase because H0: μ = 330 could not be rejected in favor of Ha: μ ≠ 330 by setting α equal to .05. Although hypothesis testing at a fixed α level is sometimes used as the sale basis for deciding whether to take an action, this is not always the case. For example, consider again the payment time case. The reason that the management consulting firm wishes to make the claim about the new electronic billing system is to demonstrate the benefits of the new system both to the Hamilton company and to other trucking companies that are considering using such a system. Note, however, that a potential user will decide whether to install the new system by considering factors beyond the results of the hypothesis test. For example, the cost of the new billing system and the receptiveness of the company’s clients to using the new system are among other factors that must be considered. In complex business and industrial situations such as this, hypothesis testing is used to accumulate knowledge about and understand the problem at hand. The ultimate decision (such as whether to adopt the new billing system) is made on the basis of nonstatistical considerations, intuition, and the results of one or more hypothesis tests. Therefore, it is important to know all the information—called the weight of evidence—that a hypothesis test provides against the null hypothesis and in favor of the alternative hypothesis. Furthermore, even when hypothesis testing at a fixed α level is used as the sole basis for deciding whether to take an action, it is useful to evaluate the weight of evidence. For example, the trash bag manufacturer would almost certainly wish to know how much evidence there is that its new bag is stronger than its former bag. The most informative way to measure the weight of evidence is to use the p-value. For every hypothesis test considered in this book we can interpret the p-value to be the probability, computed assuming that the null hypothesis H0 is true, of observing a value of the test statistic that is at least as extreme as the value actually computed from the sample data. The smaller the p -value is, the less likely are the sample results if the null hypothesis H0 is true. Therefore, the stronger is the evidence that H0 is false and that the alternative hypothesis Ha is true. Experience with hypothesis testing has resulted in statisticians making the following (somewhat subjective) conclusions: Interpreting the Weight of Evidence against the Null Hypothesis If the p-value for testing H0 is less than • .10, we have some evidence that H0 is false. • .05, we have strong evidence that H0 is false. • .01, we have very strong evidence that H0 is false. • .001, we have extremely strong evidence that H0 is false. We will frequently use these conclusions in future examples. Understand, however, that there are really no sharp borders between different weights of evidence. Rather, there is really only increasingly strong evidence against the null hypothesis as the p-value decreases. For example, recall that the p-value for testing H0: μ = 50 versus Ha: μ > 50 in the trash bag case is .0139. This p-value is less than .05 but not less than .01. Therefore, we have strong evidence, but not very strong evidence, that H0: μ = 50 is false and Ha: μ > 50 is true. That is, we have strong evidence that the mean breaking strength of the new trash bag exceeds 50 pounds. As another example, the p-value for testing H0: μ = 19.5 versus Ha: μ < 19.5 in the payment time case is .0038. This p-value is less than .01 but not less than .001. Therefore, we have very strong evidence, but not extremely strong evidence, that H0: μ = 19.5 is false and Ha: μ < 19.5 is true. That is, we have very strong evidence that the new billing system has reduced the mean payment time by more than 50 percent. Finally, the p-value for testing H0: μ = 330 versus Ha: μ ≠ 330 in the Valentine’s Day chocolate case is .3174. This p-value is greater than .10. Therefore, we have little evidence that H0: μ = 330 is false and Ha: μ ≠ 330 is true. That is, we have little evidence that the increase in the mean order quantity of the valentine box by large retail stores will differ from ten percent. Exercises for Section 9.2 CONCEPTS 9.14 Explain what a critical value is, and explain how it is used to test a hypothesis. 9.15 Explain what a p-value is, and explain how it is used to test a hypothesis. METHODS AND APPLICATIONS In Exercises 9.16 through 9.22 we consider using a random sample of 100 measurements to test H0: μ = 80 versus Ha: μ > 80. If and σ = 20:
9.16 Calculate the value of the test statistic z.
9.17 Use a critical value to test H0 versus Ha by setting α equal to .10.
9.18 Use a critical value to test H0 versus Ha by setting α equal to .05.
9.19 Use a critical value to test H0 versus Ha by setting α equal to .01.
9.20 Use a critical value to test H0 versus Ha by setting α equal to .001.
9.21 Calculate the p-value and use it to test H0 versus Ha at each of α = .10, .05, .01, and .001.
9.22 How much evidence is there that H0: μ = 80 is false and Ha: μ > 80 is true?
In Exercises 9.23 through 9.29 we consider using a random sample of 49 measurements to test H0: μ = 20 versus Ha: μ < 20. If and σ = 7: 9.23 Calculate the value of the test statistic z. 9.24 Use a critical value to test H0 versus Ha by setting α equal to .10. 9.25 Use a critical value to test H0 versus Ha by setting α equal to .05. 9.26 Use a critical value to test H0 versus Ha by setting α equal to .01. 9.27 Use a critical value to test H0 versus Ha by setting α equal to .001. 9.28 Calculate the p-value and use it to test H0 versus Ha at each of α = .10, .05, .01, and .001. 9.29 How much evidence is there that H0: μ = 20 is false and Ha: μ < 20 is true? In Exercises 9.30 through 9.36 we consider using a random sample of n = 81 measurements to test H0: μ = 40 versus Ha: μ ≠ 40. If and σ = 18: 9.30 Calculate the value of the test statistic z. 9.31 Use critical values to test H0 versus Ha by setting α equal to .10. 9.32 Use critical values to test H0 versus Ha by setting α equal to .05. 9.33 Use critical values to test H0 versus Ha by setting α equal to .01. 9.34 Use critical values to test H0 versus Ha by setting α equal to .001. 9.35 Calculate the p-value and use it to test H0 versus Ha at each of α = .10, .05, .01, and .001. 9.36 How much evidence is there that H0: μ = 40 is false and Ha: μ ≠ 40 is true? 9.37 THE VIDEO GAME SATISFACTION RATING CASE VideoGame Recall that “very satisfied” customers give the XYZ-Box video game system a rating that is at least 42. Suppose that the manufacturer of the XYZ-Box wishes to use the random sample of 65 satisfaction ratings to provide evidence supporting the claim that the mean composite satisfaction rating for the XYZ-Box exceeds 42. a Letting μ represent the mean composite satisfaction rating for the XYZ-Box, set up the null hypothesis H0 and the alternative hypothesis Ha needed if we wish to attempt to provide evidence supporting the claim that μ exceeds 42. b The random sample of 65 satisfaction ratings yields a sample mean of . Assuming that σ equals 2.64, use critical values to test H0 versus Ha at each of α = .10, .05, .01, and .001. c Using the information in part (b), calculate the p-value and use it to test H0 versus Ha at each of α = .10, .05, .01, and .001. d How much evidence is there that the mean composite satisfaction rating exceeds 42? 9.38 THE BANK CUSTOMER WAITING TIME CASE WaitTime Letting μ be the mean waiting time under the new system, we found in Exercise 9.9 that we should test H0: μ ≥ 6 versus Ha: μ < 6 in order to attempt to provide evidence that μ is less than six minutes. The random sample of 100 waiting times yields a sample mean of minutes. Moreover, Figure 9.6 gives the MINITAB output obtained when we use the waiting time data to test H0: μ = 6 versus Ha: μ < 6. On this output the label “SE Mean,” which stands for “the standard error of the mean,” denotes the quantity , and the label “Z” denotes the calculated test statistic. Assuming that σ equals 2.47: a Use critical values to test H0 versus Ha at each of α = .10, .05, .01, and .001. b Calculate the p-value and verify that it equals .014, as shown on the MINITAB output. Use the p-value to test H0 versus Ha at each of α = .10, .05, .01, and .001. c How much evidence is there that the new system has reduced the mean waiting time to below six minutes? Figure 9.6: MINITAB Output of the Test of H0: μ = 6 versus Ha: μ < 6 in the Bank Customer Waiting Time Case Note: Because the test statistic z has a denominator that uses the population standard deviation σ, MINITAB makes the user specify an assumed value for σ. 9.39 Again consider the audit delay situation of Exercise 8.11. Letting μ be the mean audit delay for all public owner-controlled companies in New Zealand, formulate the null hypothesis H0 and the alternative hypothesis Ha that would be used to attempt to provide evidence supporting the claim that μ is less than 90 days. Suppose that a random sample of 100 public owner-controlled companies in New Zealand is found to give a mean audit delay of days. Assuming that σ equals 32.83, calculate the p-value for testing H0 versus Ha and determine how much evidence there is that the mean audit delay for all public owner-controlled companies in New Zealand is less than 90 days. 9.40 Consolidated Power, a large electric power utility, has just built a modern nuclear power plant. This plant discharges waste water that is allowed to flow into the Atlantic Ocean. The Environmental Protection Agency (EPA) has ordered that the waste water may not be excessively warm so that thermal pollution of the marine environment near the plant can be avoided. Because of this order, the waste water is allowed to cool in specially constructed ponds and is then released into the ocean. This cooling system works properly if the mean temperature of waste water discharged is 60°F or cooler. Consolidated Power is required to monitor the temperature of the waste water. A sample of 100 temperature readings will be obtained each day, and if the sample results cast a substantial amount of doubt on the hypothesis that the cooling system is working properly (the mean temperature of waste water discharged is 60°F or cooler), then the plant must be shut down and appropriate actions must be taken to correct the problem. a Consolidated Power wishes to set up a hypothesis test so that the power plant will be shut down when the null hypothesis is rejected. Set up the null hypothesis H0 and the alternative hypothesis Ha that should be used. b Suppose that Consolidated Power decides to use a level of significance of α = .05, and suppose a random sample of 100 temperature readings is obtained. If the sample mean of the 100 temperature readings is , test H0 versus Ha and determine whether the power plant should be shut down and the cooling system repaired. Perform the hypothesis test by using a critical value and a p-value. Assume σ = 2. 9.41 Do part (b) of Exercise 9.40 if . 9.42 Do part (b) of Exercise 9.40 if . 9.43 An automobile parts supplier owns a machine that produces a cylindrical engine part. This part is supposed to have an outside diameter of three inches. Parts with diameters that are too small or too large do not meet customer requirements and must be rejected. Lately, the company has experienced problems meeting customer requirements. The technical staff feels that the mean diameter produced by the machine is off target. In order to verify this, a special study will randomly sample 100 parts produced by the machine. The 100 sampled parts will be measured, and if the results obtained cast a substantial amount of doubt on the hypothesis that the mean diameter equals the target value of three inches, the company will assign a problem-solving team to intensively search for the causes of the problem. a The parts supplier wishes to set up a hypothesis test so that the problem-solving team will be assigned when the null hypothesis is rejected. Set up the null and alternative hypotheses for this situation. b A sample of 40 parts yields a sample mean diameter of inches. Assuming σ equals .016, use a critical value and a p-value to test H0 versus Ha by setting α equal to .05. Should the problem-solving team be assigned? 9.44 The Crown Bottling Company has just installed a new bottling process that will fill 16-ounce bottles of the popular Crown Classic Cola soft drink. Both overfilling and underfilling bottles are undesirable: Underfilling leads to customer complaints and overfilling costs the company considerable money. In order to verify that the filler is set up correctly, the company wishes to see whether the mean bottle fill, μ, is close to the target fill of 16 ounces. To this end, a random sample of 36 filled bottles is selected from the output of a test filler run. If the sample results cast a substantial amount of doubt on the hypothesis that the mean bottle fill is the desired 16 ounces, then the filler’s initial setup will be readjusted. a The bottling company wants to set up a hypothesis test so that the filler will be readjusted if the null hypothesis is rejected. Set up the null and alternative hypotheses for this hypothesis test. b Suppose that Crown Bottling Company decides to use a level of significance of α = .01, and suppose a random sample of 36 bottle fills is obtained from a test run of the filler. For each of the following three sample means, determine whether the filler’s initial setup should be readjusted. In each case, use a critical value and a p-value, and assume that σ equals .1. 9.45 Use the first sample mean in Exercise 9.44 and a confidence interval to perform the hypothesis test by setting α equal to .05. What considerations would help you to decide whether the result has practical importance? 9.46 THE DISK BRAKE CASE National Motors has equipped the ZX-900 with a new disk brake system. We define the stopping distance for a ZX-900 as the distance (in feet) required to bring the automobile to a complete stop from a speed of 35 mph under normal driving conditions using this new brake system. In addition, we define μ to be the mean stopping distance of all ZX-900s. One of the ZX-900’s major competitors is advertised to achieve a mean stopping distance of 60 ft. National Motors would like to claim in a new television commercial that the ZX-900 achieves a shorter mean stopping distance. a Set up the null hypothesis H0 and the alternative hypothesis Ha that would be used to attempt to provide evidence supporting the claim that μ is less than 60. b A television network will permit National Motors to claim that the ZX-900 achieves a shorter mean stopping distance than the competitor if H0 can be rejected in favor of Ha by setting α equal to .05. If the stopping distances of a random sample of n = 81 ZX-900s have a mean of , will National Motors be allowed to run the commercial? Perform the hypothesis test by using a critical value and a p-value. Assume here that σ = 6.02. 9.47 Consider part (b) of Exercise 9.46, and calculate a 95 percent confidence interval for μ. Do the point estimate of μ and confidence interval for μ indicate that μ might be far enough below 60 feet to suggest that we have a practically important result? 9.48 Recall from Exercise 8.12 that Bayus (1991) studied the mean numbers of auto dealers visited by early and late replacement buyers. a Letting μ be the mean number of dealers visited by early replacement buyers, suppose that we wish to test H0: μ = 4 versus Ha: μ ≠ 4. A random sample of 800 early replacement buyers yields a mean number of dealers visited of . Assuming σ equals .71, calculate the p-value and test H0 versus Ha. Do we estimate that μ is less than 4 or greater than 4? b Letting μ be the mean number of dealers visited by late replacement buyers, suppose that we wish to test H0: μ = 4 versus Ha: μ ≠ 4. A random sample of 500 late replacement buyers yields a mean number of dealers visited of . Assuming σ equals .66, calculate the p-value and test H0 versus Ha. Do we estimate that μ is less than 4 or greater than 4? 9.3: t Tests about a Population Mean: σ Unknown If we do not know σ (which is usually the case), we can base a hypothesis test about μ on the sampling distribution of If the sampled population is normally distributed, then this sampling distribution is a t distribution having n − 1 degrees of freedom. This leads to the following results: A t Tests about a Population Mean: σ UnKnown Define the test statistic and assume that the population sampled is normally distributed. We can test H0: μ = μ0 versus a particular alternative hypothesis at level of significance α by using the appropriate critical value rule, or, equivalently, the corresponding p-value. Here tα, tα/2, and the p-values are based on n − 1 degrees of freedom. In the rest of this chapter and in Chapter 10 we will present most of the hypothesis testing examples by using hypothesis testing summary boxes and the seven hypothesis testing steps given in the previous section. However, to be concise, we will not formally number each hypothesis testing step. Rather, for each of the first six steps, we will set out in boldface font a key phrase that indicates that the step is being carried out. Then, we will highlight the seventh step—the business improvement conclusion—as we highlight all business improvement conclusions in this book. After Chapter 10, we will continue to use hypothesis testing summary boxes, and we will more informally use the seven steps. As illustrated in the following example, we will often first use a critical value rule to test the hypotheses under consideration at a fixed value of α and then use a p-value to assess the weight of evidence against the null hypothesis. EXAMPLE 9.4 In 1991 the average interest rate charged by U.S. credit card issuers was 18.8 percent. Since that time, there has been a proliferation of new credit cards affiliated with retail stores, oil companies, alumni associations, professional sports teams, and so on. A financial officer wishes to study whether the increased competition in the credit card business has reduced interest rates. To do this, the officer will test a hypothesis about the current mean interest rate, μ, charged by U.S. credit card issuers. The null hypothesis to be tested is H0: μ = 18.8%, and the alternative hypothesis is Ha: μ < 18.8%. If H0 can be rejected in favor of Ha at the .05 level of significance, the officer will conclude that the current mean interest rate is less than the 18.8% mean interest rate charged in 1991. To perform the hypothesis test, suppose that we randomly select n = 15 credit cards and determine their current interest rates. The interest rates for the 15 sampled cards are given in Table 9.3. A stem-and-leaf display and MINITAB box plot are given in Figure 9.7. The stem-and-leaf display looks reasonably mound-shaped, and both the stem-and-leaf display and the box plot look reasonably symmetrical. It follows that it is appropriate to calculate the value of the test statistic t in the summary box. Furthermore, since Ha: μ < 18.8% is of the form Ha: μ < μ0, we should reject H0: μ = 18.8% if the value of t is less than the critical value −tα = −t.05 = −1.761. Here, −t.05 = −1.761 is based on n − 1 = 15 −1 = 14 degrees of freedom and this critical value is illustrated in Figure 9.8(a). The mean and the standard deviation of the n = 15 interest rates in Table 9.3 are and s = 1.538. This implies that the value of the test statistic is Figure 9.7: Stem-and-Leaf Display and Box Plot of the Interest Rates Figure 9.8: Testing H0: μ = 18.8% versus Ha: μ < 18.8% by Using a Critical Value and a p-Value Table 9.3: Interest Rates Charged by 15 Randomly Selected Credit Cards CreditCd Since t = −4.97 is less than −t.05 = −1.761, we reject H0: μ = 18.8% in favor of Ha: μ < 18.8%. That is, we conclude (at an α of .05) that the current mean credit card interest rate is lower than 18.8%, the mean interest rate in 1991. Furthermore, the sample mean says that we estimate the mean interest rate is 18.8% − 16.827% = 1.973% lower than it was in 1991. The p-value for testing H0: μ = 18.8% versus Ha: μ < 18.8% is the area under the curve of the t distribution having 14 degrees of freedom to the left of t = −4.97. Tables of t points (such as Table A.4, page 864) are not complete enough to give such areas for most t statistic values, so we use computer software packages to calculate p-values that are based on the t distribution. For example, the MINITAB output in Figure 9.9(a) and the MegaStat output in Figure 9.10 tell us that the p-value for testing H0: μ = 18.8% versus Ha: μ < 18.8% is .0001. Notice that both MINITAB and MegaStat round p-values to three or four decimal places. The Excel output in Figure 9.9(b) gives the slightly more accurate value of 0.000103 for the p-value. Because this p-value is less than .05, .01, and .001, we can reject H0 at the .05, .01, and .001 levels of significance. Also note that the p-value of .0001 on the MegaStat output is shaded dark yellow. This indicates that we can reject H0 at the .01 level of significance (light yellow shading would indicate significance at the .05 level, but not at the .01 level). As a probability, the p-value of .0001 says that if we are to believe that H0: μ = 18.8% is true, we must believe that we have observed a t statistic value (t = −4.97) that can be described as a 1 in 10,000 chance. In summary, we have extremely strong evidence that H0: μ = 18.8% is false and Ha: μ < 18.8% is true. That is, we have extremely strong evidence that the current mean credit card interest rate is less than 18.8%. Figure 9.9: The MINITAB and Excel Outputs for Testing H0: μ = 18.8% versus Ha: μ < 18.8% Figure 9.10: The MegaStat Output for Testing H0: μ = 18.8% versus Ha: μ < 18.8% Recall that in three cases discussed in Section 9.2 we tested hypotheses by assuming that the population standard deviation σ is known and by using z tests. If σ is actually not known in these cases (which would probably be true), we should test the hypotheses under consideration by using t tests. Furthermore, recall that in each case the sample size is large (at least 30). In general, it can be shown that if the sample size is large, the t test is approximately valid even if the sampled population is not normally distributed (or mound shaped). Therefore, consider the Valentine’s Day chocolate case and testing H0: μ = 330 versus Ha: μ ≠ 330 at the .05 level of significance. To perform the hypothesis test, assume that we will randomly select n = 100 large retail stores and use their anticipated order quantities to calculate the value of the test statistic t in the summary box. Then, since the alternative hypothesis Ha: μ ≠ 330 is of the form Ha: μ ≠ μ0, we will reject H0: μ = 330 if the absolute value of t is greater than tα/2 = t.025 = 1.984 (based on n − 1 = 99 degrees of freedom). Suppose that when the sample is randomly selected, the mean and the standard deviation of the n = 100 reported order quantities are calculated to be and s = 39.1. The value of the test statistic is Since | t | = 1.023 is less than t.025 = 1.984, we cannot reject H0: μ = 330 by setting α equal to .05. It follows that we cannot conclude (at an α of .05) that this year’s mean order quantity of the valentine box by large retail stores will differ from 330 boxes. Therefore, the candy company will base its production of valentine boxes on the ten percent projected sales increase. The p-value for the hypothesis test is twice the area under the t distribution curve having 99 degrees of freedom to the right of | t | = 1.023. Using a computer, we find that this p-value is .3088, which provides little evidence against H0: μ = 330 and in favor of Ha: μ ≠ 330. As another example, consider the trash bag case and note that the sample of n = 40 trash bag breaking strengths has mean and standard deviation s = 1.6438. The p-value for testing H0: μ = 50 versus Ha: μ > 50 is the area under the t distribution curve having n − 1 = 39 degrees of freedom to the right of

Using a computer, we find that this p-value is .0164, which provides strong evidence against H0: μ = 50 and in favor of Ha: μ > 50. In particular, recall that most television networks would evaluate the claim that the new trash bag has a mean breaking strength that exceeds 50 pounds by choosing on α value between .025 and .10. It follows, since the p-value of .0164 is less than all these α values, that most networks would allow the trash bag claim to be advertised.
As a third example, consider the payment time case and note that the sample of n = 65 payment times has mean and standard deviation s = 3.9612. The p-value for testing H0: μ = 19.5 versus Ha: μ < 19.5 is the area under the t distribution curve having n − 1 = 64 degrees of freedom to the left of Using a computer, we find that this p-value is .0031, which is less than the management consulting firm’s α value of .01. It follows that the consulting firm will claim that the new electronic billing system has reduced the Hamilton, Ohio, trucking company’s mean bill payment time by more than 50 percent. To conclude this section, note that if the sample size is small (<30) and the sampled population is not mound-shaped, or if the sampled population is highly skewed, then it might be appropriate to use a nonparametric test about the population median. Such a test is discussed in Chapter 18. Exercises for Section 9.3 CONCEPTS 9.49 What assumptions must be met in order to carry out the test about a population mean based on the t distribution? 9.50 How do we decide whether to use a z test or a t test when testing a hypothesis about a population mean? METHODS AND APPLICATIONS 9.51 Suppose that a random sample of 16 measurements from a normally distributed population gives a sample mean of and a sample standard deviation of s = 6. Use critical values to test H0: μ ≤ 10 versus Ha: μ > 10 using levels of significance α = .10, α = .05, α = .01, and α = .001. What do you conclude at each value of α?
9.52 Suppose that a random sample of nine measurements from a normally distributed population gives a sample mean of and a sample standard deviation of s = .3. Use critical values to test H0: μ = 3 versus Ha: μ ≠ 3 using levels of significance α = .10, α = .05, α = .01, and α = .001. What do you conclude at each value of α?
9.53 THE AIR TRAFFIC CONTROL CASE AlertTime
Recall that it is hoped that the mean alert time, μ, using the new display panel is less than eight seconds. Formulate the null hypothesis H0 and the alternative hypothesis Ha that would be used to attempt to provide evidence that μ is less than eight seconds. The mean and the standard deviation of the sample of n = 15 alert times are and s = 1.0261. Perform a t test of H0 versus Ha by setting α equal to .05 and using a critical value. Interpret the results of the test.
9.54 THE AIR TRAFFIC CONTROL CASE AlertTime
The p-value for the hypothesis test of Exercise 9.53 can be computer calculated to be .0200. How much evidence is there that μ is less than eight seconds?
9.55 The bad debt ratio for a financial institution is defined to be the dollar value of loans defaulted divided by the total dollar value of all loans made. Suppose that a random sample of seven Ohio banks is selected and that the bad debt ratios (written as percentages) for these banks are 7%, 4%, 6%, 7%, 5%, 4%, and 9%. BadDebt
a Banking officials claim that the mean bad debt ratio for all Midwestern banks is 3.5 percent and that the mean bad debt ratio for Ohio banks is higher. Set up the null and alternative hypotheses needed to attempt to provide evidence supporting the claim that the mean bad debt ratio for Ohio banks exceeds 3.5 percent.
b Assuming that bad debt ratios for Ohio banks are approximately normally distributed, use critical values and the given sample information to test the hypotheses you set up in part a by setting α equal to .10, .05, .01, and .001. How much evidence is there that the mean bad debt ratio for Ohio banks exceeds 3.5 percent? What does this say about the banking official’s claim?
c Are you qualified to decide whether we have a practically important result? Who would be? How might practical importance be defined in this situation?
d The p-value for the hypothesis test of part (b) can be computer calculated to be .006. What does this p-value say about whether the mean bad debt ratio for Ohio banks exceeds 3.5 percent?
9.56 In the book Business Research Methods, Donald R. Cooper and C. William Emory (1995) discuss using hypothesis testing to study receivables outstanding. To quote Cooper and Emory:
…the controller of a large retail chain may be concerned about a possible slowdown in payments by the company’s customers. She measures the rate of payment in terms of the average number of days receivables outstanding. Generally, the company has maintained an average of about 50 days with a standard deviation of 10 days. Since it would be too expensive to analyze all of a company’s receivables frequently, we normally resort to sampling.
a Set up the null and alternative hypotheses needed to attempt to show that there has been a slowdown in payments by the company’s customers (there has been a slowdown if the average days outstanding exceeds 50).
b Assume approximate normality and suppose that a random sample of 25 accounts gives an average days outstanding of with a standard deviation of s = 8. Use critical values to test the hypotheses you set up in part a at levels of significance α = .10, α = .05, α = .01, and α = .001. How much evidence is there of a slowdown in payments?
c Are you qualified to decide whether this result has practical importance? Who would be?
9.57 Consider a chemical company that wishes to determine whether a new catalyst, catalyst XA-100, changes the mean hourly yield of its chemical process from the historical process mean of 750 pounds per hour. When five trial runs are made using the new catalyst, the following yields (in pounds per hour) are recorded: 801, 814, 784, 836, and 820. ChemYield
a Let μ be the mean of all possible yields using the new catalyst. Assuming that chemical yields are approximately normally distributed, the MegaStat output of the test statistic and p-value, and the Excel output of the p-value, for testing H0: μ = 750 versus Ha: μ ≠ 750 are as follows:

(Here we had Excel calculate twice the area under the t distribution curve having 4 degrees of freedom to the right of 6.942585.) Use the sample data to verify that the values of , s, and t given on the output are correct.
b Use the test statistic and critical values to test H0 versus Ha by setting α equal to .10, .05, .01, and .001.
9.58 Consider Exercise 9.57. Use the p-value to test H0: μ = 750 versus Ha: μ ≠ 750 by setting α equal to .10, .05, .01, and .001. How much evidence is there that the new catalyst changes the mean hourly yield?
9.59 Whole Foods is an all-natural grocery chain that has 50,000 square foot stores, up from the industry average of 34,000 square feet. Sales per square foot of supermarkets average just under $400 per square foot, as reported by USA Today in an article on “A whole new ballgame in grocery shopping.” Suppose that sales per square foot in the most recent fiscal year are recorded for a random sample of 10 Whole Foods supermarkets. The data (sales dollars per square foot) are as follows: 854, 858, 801, 892, 849, 807, 894, 863, 829, 815. Let μ denote the mean sales dollars per square foot for all Whole Foods supermarkets during the most recent fiscal year, and note that the historical mean sales dollars per square foot for Whole Foods supermarkets in previous years has been $800. Below we present the MINITAB output obtained by using the sample data to test H0: μ = 800 versus Ha: μ > 800. WholeFoods

a Use the p-value to test H0 versus Ha by setting α equal to .10, .05, and .01.
b How much evidence is there that μ exceeds $800?
9.60 Consider Exercise 9.59. Do you think that the difference between the sample mean of $846.20 and the historical average of $800 has practical importance?
9.61 THE VIDEO GAME SATISFACTION RATING CASE VideoGame
The mean and the standard deviation of the sample of n = 65 customer satisfaction ratings are and s = 2.6424. Let μ denote the mean of all possible customer satisfaction ratings for the XYZ-Box video game system, and consider testing H0: μ = 42 versus Ha: μ > 42 Perform a t test of these hypotheses by setting α equal to .05 and using a critical value. Also, interpret the p-value of .0025 for the hypothesis test.
9.62 THE BANK CUSTOMER WAITING TIME CASE WaitTime
The mean and the standard deviation of the sample of 100 bank customer waiting times are and s = 2.475. Let μ denote the mean of all possible bank customer waiting times using the new system and consider testing H0: μ = 6 versus Ha: μ < 6. Perform a t test of these hypotheses by setting α equal to .05 and using a critical value. Also, interpret the p-value of .0158 for the hypothesis test. 9.4: z Tests about a Population Proportion In this section we study a large sample hypothesis test about a population proportion (that is, about the fraction of population units that possess some characteristic). We begin with an example. EXAMPLE 9.5: The Cheese Spread Case Recall that the soft cheese spread producer has decided that replacing the current spout with the new spout is profitable only if p, the true proportion of all current purchasers who would stop buying the cheese spread if the new spout were used, is less than .10. The producer feels that it is unwise to change the spout unless it has very strong evidence that p is less than .10. Therefore, the spout will be changed if and only if the null hypothesis H0: p = .10 can be rejected in favor of the alternative hypothesis Ha: p < .10 at the .01 level of significance. In order to see how to test this kind of hypothesis, remember that when n is large, the sampling distribution of is approximately a standard normal distribution. Let p0 denote a specified value between 0 and 1 (its exact value will depend on the problem), and consider testing the null hypothesis H0: p = p0. We then have the following result: A Large Sample Test about a Population Proportion Define the test statistic If the sample size n is large, we can test H0: p = p0 versus a particular alternative hypothesis at level of significance α by using the appropriate critical value rule, or, equivalently, the corresponding p-value. Here n should be considered large if both np0 and n(1 − p0) are at least 5.3 EXAMPLE 9.6: The Cheese Spread Case We have seen that the cheese spread producer wishes to test H0: p = .10 versus Ha: p < .10, where p is the proportion of all current purchasers who would stop buying the cheese spread if the new spout were used. The producer will use the new spout if H0 can be rejected in favor of Ha at the .01 level of significance. To perform the hypothesis test, we will randomly select n = 1,000 current purchasers of the cheese spread, find the proportion of these purchasers who would stop buying the cheese spread if the new spout were used, and calculate the value of the test statistic z in the summary box. Then, since the alternative hypothesis Ha: p < .10 is of the form Ha: p < p0, we will reject H0: p = .10 if the value of z is less than − z α = − z.01= −2.33. (Note that using this procedure is valid because np0 = 1,000(.10) = 100 and n(1 − p0) = 1,000(1 − .10) = 900 are both at least 5.) Suppose that when the sample is randomly selected, we find that 63 of the 1,000 current purchasers say they would stop buying the cheese spread if the new spout were used. Since , the value of the test statistic is Because z = −3.90 is less than −z.01 = −2.33, we reject H0: p = .10 in favor of Ha: p < .10. That is, we conclude (at an α of .01) that the proportion of current purchasers who would stop buying the cheese spread if the new spout were used is less than .10. It follows that the company will use the new spout. Furthermore, the point estimate says we estimate that 6.3 percent of all current customers would stop buying the cheese spread if the new spout were used. Although the cheese spread producer has made its decision by setting α equal to a single, prechosen value (.01), it would probably also wish to know the weight of evidence against H0 and in favor of Ha. The p-value is the area under the standard normal curve to the left of z = − 3.90. Table A.3 (page 862) tells us that this area is .00005. Because this p-value is less than .001, we have extremely strong evidence that Ha: p < .10 is true. That is, we have extremely strong evidence that fewer than 10 percent of current purchasers would stop buying the cheese spread if the new spout were used. EXAMPLE 9.7 Recent medical research has sought to develop drugs that lessen the severity and duration of viral infections. Virol, a relatively new drug, has been shown to provide relief for 70 percent of all patients suffering from viral upper respiratory infections. A major drug company is developing a competing drug called Phantol. The drug company wishes to investigate whether Phantol is more effective than Virol. To do this, the drug company will test a hypothesis about the true proportion, p, of all patients whose symptoms would be relieved by Phantol. The null hypothesis to be tested is H0: p = .70, and the alternative hypothesis is Ha: p > .70. If H0 can be rejected in favor of Ha at the .05 level of significance, the drug company will conclude that Phantol helps more than the 70 percent of patients helped by Virol. To perform the hypothesis test, we will randomly select n = 300 patients having viral upper respiratory infections, find the proportion of these patients whose symptoms are relieved by Phantol and calculate the value of the test statistic z in the summary box. Then, since the alternative hypothesis Ha: p > .70 is of the form Ha: p> p0, we will reject H0: p = .70 if the value of z is greater than zα = z.05 = 1.645. (Note that using this procedure is valid because np0 = 300(.70) = 210 and n(1 − p0) = 300(1 − .70) = 90 are both at least 5.) Suppose that when the sample is randomly selected, we find that Phantol provides relief for 231 of the 300 patients. Since the value of the test statistic is

Because z = 2.65 is greater than z.05 = 1.645, we reject H0: p = .70 in favor of Ha: p > .70. That is, we conclude (at an α of .05) that Phantol will provide relief for more than 70 percent of all patients suffering from viral upper respiratory infections. More specifically, the point estimate of p says that we estimate that Phantol will provide relief for 77 percent of all such patients. Comparing this estimate to the 70 percent of patients whose symptoms are relieved by Virol, we conclude that Phantol is somewhat more effective.
The p-value for testing H0: p = .70 versus Ha: p > .70 is the area under the standard normal curve to the right of z = 2.65. This p-value is (1.0 − .9960) = .004 (see Table A.3, page 863), and it provides very strong evidence against H0: p = .70 and in favor of Ha: p > .70. That is, we have very strong evidence that Phantol will provide relief for more than 70 percent of all patients suffering from viral upper respiratory infections.
EXAMPLE 9.8: The Electronic Article Surveillance Case
Suppose that a company selling electronic article surveillance devices claims that the proportion, p, of all consumers who would never shop in a store again if the store subjected them to a false alarm is no more than .05. A store considering installing such a device is concerned that p is greater than .05 and wishes to test
H0: p = .05 versus Ha: p > .05. To perform the hypothesis test, the store will calculate a p-value and use it to measure the weight of evidence against H0 and in favor of Ha. In an actual systematic sample, 40 out of 250 consumers said they would never shop in a store again if the store subjected them to a false alarm. Therefore, the sample proportion of lost consumers is Since np0 = 250(.05) = 12.5 and n(1 − p0) = 250(1 − .05) = 237.5 are both at least 5, we can use the test statistic z in the summary box. The value of the test statistic is

Noting that Ha: p > .05 is of the form Ha: p>p0, the p-value is the area under the standard normal curve to the right of z = 7.98. The normal table tells us that the area under the standard normal curve to the right of 3.99 is (1.0 − .99997) = .00003. Therefore, the p-value is less than .00003 and provides extremely strong evidence against H0: p = .05 and in favor of Ha: p > .05. That is, we have extremely strong evidence that the proportion of all consumers who say they would never shop in a store again if the store subjected them to a false alarm is greater than .05. Furthermore, the point estimate says we estimate that the percentage of such consumers is 11 percent more than the 5 percent maximum claimed by the company selling the electronic article surveillance devices. A 95 percent confidence interval for p is

This interval says we are 95 percent confident that the percentage of consumers who would never shop in a store again if the store subjected them to a false alarm is between 6.46 percent and 15.54 percent more than the 5 percent maximum claimed by the company selling the electronic article surveillance devices. The rather large increases over the claimed 5 percent maximum implied by the point estimate and the confidence interval would mean substantially more lost customers and thus are practically important. Figure 9.11 gives the MegaStat output for testing H0: p = .05 versus Ha: p > .05. Note that this output includes a 95 percent confidence interval for p. Also notice that MegaStat expresses the p-value for this test in scientific notation. In general, when a p-value is less than .0001, MegaStat (and also Excel) express the p-value in scientific notation. Here the p-value of 7.77 E-16 says that we must move the decimal point 16 places to the left to obtain the decimal equivalent. That is, the p-value is .000000000000000777.
Figure 9.11: The MegaStat Output for Testing H0: p = .05 versus Ha: p > .05

Exercises for Section 9.4
CONCEPTS

9.63 If we test a hypothesis to provide evidence supporting the claim that a majority of voters prefer a political candidate, explain the difference between p and .
9.64 If we test a hypothesis to provide evidence supporting the claim that more than 30 percent of all consumers prefer a particular brand of beer, explain the difference between p and .
9.65 If we test a hypothesis to provide evidence supporting the claim that fewer than 5 percent of the units produced by a process are defective, explain the difference between p and .
9.66 What condition must be satisfied in order to appropriately use the methods of this section?
METHODS AND APPLICATIONS
9.67 For each of the following sample sizes and hypothesized values of the population proportion p, determine whether the sample size is large enough to use the large sample test about p given in this section:
a n = 400 and p0 = .5.
b n = 100 and p0 = .01.
c n = 10,000 and p0 = .01.
d n = 100 and p0 = .2.
e n = 256 and p0 = .7.
f n = 200 and p0 = .98.
g n = 1,000 and p0 = .98.
h n = 25 and p0 = .4.
9.68 Suppose we wish to test H0: p ≤ .8 versus Ha: p > .8 and that a random sample of n = 400 gives a sample proportion .
a Test H0 versus Ha at the .05 level of significance by using a critical value. What do you conclude?
b Find the p-value for this test.
c Use the p-value to test H0 versus Ha by setting α equal to .10, .05, .01, and .001. What do you conclude at each value of α?
9.69 Suppose we test H0: p = .3 versus Ha: p ≠ .3 and that a random sample of n = 100 gives a sample proportion .
a Test H0 versus Ha at the .01 level of significance by using a critical value. What do you conclude?
b Find the p-value for this test.
c Use the p-value to test H0 versus Ha by setting α equal to .10, .05, .01, and .001. What do you conclude at each value of α?
9.70 Suppose we are testing H0: p ≤ .5 versus Ha: p > .5, where p is the proportion of all beer drinkers who have tried at least one brand of “cold-filtered beer.” If a random sample of 500 beer drinkers has been taken and if equals .57, how many beer drinkers in the sample have tried at least one brand of “cold-filtered beer”?
9.71 THE MARKETING ETHICS CASE: CONFLICT OF INTEREST
Recall that a conflict of interest scenario was presented to a sample of 205 marketing researchers and that 111 of these researchers disapproved of the actions taken.
a Let p be the proportion of all marketing researchers who disapprove of the actions taken in the conflict of interest scenario. Set up the null and alternative hypotheses needed to attempt to provide evidence supporting the claim that a majority (more than 50 percent) of all marketing researchers disapprove of the actions taken.
b Assuming that the sample of 205 marketing researchers has been randomly selected, use critical values and the previously given sample information to test the hypotheses you set up in part a at the .10, .05, .01, and .001 levels of significance. How much evidence is there that a majority of all marketing researchers disapprove of the actions taken?
c Suppose a random sample of 1,000 marketing researchers reveals that 540 of the researchers disapprove of the actions taken in the conflict of interest scenario. Use critical values to determine how much evidence there is that a majority of all marketing researchers disapprove of the actions taken.
d Note that in parts b and c the sample proportion is (essentially) the same. Explain why the results of the hypothesis tests in parts b and c differ.
9.72 Last year, television station WXYZ’s share of the 11 p.m. news audience was approximately equal to, but no greater than, 25 percent. The station’s management believes that the current audience share is higher than last year’s 25 percent share. In an attempt to substantiate this belief, the station surveyed a random sample of 400 11 p.m. news viewers and found that 146 watched WXYZ.
a Let p be the current proportion of all 11 p.m. 9.7 news viewers who watch WXYZ. Set up the null and alternative hypotheses needed to attempt to provide evidence supporting the claim that the current audience share for WXYZ is higher than last year’s 25 percent share.
b Use critical values and the following MINITAB output to test the hypotheses you set up in part a at the .10, .05, .01, and .001 levels of significance. How much evidence is there that the current audience share is higher than last year’s 25 percent share?

c Find the p-value for the hypothesis test in part b. Use the p-value to carry out the test by setting α equal to .10, .05, .01, and .001. Interpret your results.
d Do you think that the result of the station’s survey has practical importance? Why or why not?
9.73 In the book Essentials of Marketing Research, William R. Dillon, Thomas J. Madden, and Neil H. Firtle discuss a marketing research proposal to study day-after recall for a brand of mouthwash. To quote the authors:
The ad agency has developed a TV ad for the introduction of the mouthwash. The objective of the ad is to create awareness of the brand. The objective of this research is to evaluate the awareness generated by the ad measured by aided- and unaided-recall scores.
A minimum of 200 respondents who claim to have watched the TV show in which the ad was aired the night before will be contacted by telephone in 20 cities.
The study will provide information on the incidence of unaided and aided recall.
Suppose a random sample of 200 respondents shows that 46 of the people interviewed were able to recall the commercial without any prompting (unaided recall).
a In order for the ad to be considered successful, the percentage of unaided recall must be above the category norm for a TV commercial for the product class. If this norm is 18 percent, set up the null and alternative hypotheses needed to attempt to provide evidence that the ad is successful.
b Use the previously given sample information to compute the p-value for the hypothesis test you set up in part a. Use the p-value to carry out the test by setting α equal to .10, .05, .01, and .001. How much evidence is there that the TV commercial is successful?
c Do you think the result of the ad agency’s survey has practical importance? Explain your opinion.
9.74 Quality Progress, February 2005, reports on the results achieved by Bank of America in improving customer satisfaction and customer loyalty by listening to the ‘voice of the customer’. A key measure of customer satisfaction is the response on a scale from 1 to 10 to the question: “Considering all the business you do with Bank of America, what is your overall satisfaction with Bank of America?”4 Suppose that a random sample of 350 current customers results in 195 customers with a response of 9 or 10 representing ‘customer delight.’
a Let p denote the true proportion of all current Bank of America customers who would respond with a 9 or 10, and note that the historical proportion of customer delight for Bank of America has been .48. Calculate the p-value for testing H0: p = .48 versus Ha: p > .48. How much evidence is there that p exceeds .48?
b Bank of America has a base of nearly 30 million customers. Do you think that the sample results have practical importance? Explain your opinion.
9.75 The manufacturer of the ColorSmart-5000 television set claims that 95 percent of its sets last at least five years without needing a single repair. In order to test this claim, a consumer group randomly selects 400 consumers who have owned a ColorSmart-5000 television set for five years. Of these 400 consumers, 316 say that their ColorSmart-5000 television sets did not need repair, while 84 say that their ColorSmart-5000 television sets did need at least one repair.
a Letting p be the proportion of ColorSmart-5000 television sets that last five years without a single repair, set up the null and alternative hypotheses that the consumer group should use to attempt to show that the manufacturer’s claim is false.
b Use critical values and the previously given sample information to test the hypotheses you set up in part a by setting α equal to .10, .05, .01, and .001. How much evidence is there that the manufacturer’s claim is false?
c Do you think the results of the consumer group’s survey have practical importance? Explain your opinion.
9.5: Type II Error Probabilities and Sample Size Determination (Optional)

Chapters 9
and
11

As we have seen, we usually take action (for example, advertise a claim) on the basis of having rejected the null hypothesis. In this case, we know the chances that the action has been taken erroneously because we have prespecified α, the probability of rejecting a true null hypothesis. However, sometimes we must act (for example, use a day’s production of camshafts to make V6 engines) on the basis of not rejecting the null hypothesis. If we must do this, it is best to know the probability of not rejecting a false null hypothesis (a Type II error). If this probability is not small enough, we may change the hypothesis testing procedure. In order to discuss this further, we must first see how to compute the probability of a Type II error.
As an example, the Federal Trade Commission (FTC) often tests claims that companies make about their products. Suppose coffee is being sold in cans that are labeled as containing three pounds, and also suppose that the FTC wishes to determine if the mean amount of coffee μ in all such cans is at least three pounds. To do this, theFTC tests H0: μ ≥ 3 (or μ = 3) versus Ha: μ < 3 by setting α = .05. Suppose that a sample of 35 coffee cans yields . Assuming that σ equals .0147, we see that because is not less than −z.05 = −1.645, we cannot reject H0: μ ≥ 3 by setting α = .05. Since we cannot reject H0, we cannot have committed a Type I error, which is the error of rejecting a true H0. However, we might have committed a Type II error, which is the error of not rejecting a false H0. Therefore, before we make a final conclusion about μ, we should calculate the probability of a Type II error. A Type II error is not rejecting H0: μ ≥ 3 when H0 is false. Because any value of μ that is less than 3 makes H0 false, there is a different Type II error (and, therefore, a different Type II error probability) associated with each value of μ that is less than 3. In order to demonstrate how to calculate these probabilities, we will calculate the probability of not rejecting H0: μ ≥ 3 when in fact μ equals 2.995. This is the probability of failing to detect an average underfill of .005 pounds. For a fixed sample size (for example, n = 35 coffee can fills), the value of β, the probability of a Type II error, depends upon how we set α, the probability of a Type I error. Since we have set α = .05, we reject H0 if or, equivalently, if Therefore, we do not reject H0 if . It follows that β, the probability of not rejecting H0: μ ≥ 3 when μ equals 2.995, is This calculation is illustrated in Figure 9.12. Similarly, it follows that β, the probability of not rejecting H0: μ ≥ 3 when μ equals 2.99, is Figure 9.12: Calculating β When μ Equals 2.995 It also follows that β, the probability of not rejecting H0: μ ≥ 3 when μ equals 2.985, is This probability is less than .00003 (because z is greater than 3.99). In Figure 9.13 we illustrate the values of β that we have calculated. Notice that the closer an alternative value of μ is to 3 (the value specified by H0: μ = 3), the larger is the associated value of β. Although alternative values of μ that are closer to 3 have larger associated probabilities of Type II errors, these values of μ have associated Type II errors with less serious consequences. For example, we are more likely to not reject H0: μ = 3 when μ = 2.995 (β = .3557) than we are to not reject H0: μ = 3 when μ = 2.99 (β = .0087). However, not rejecting H0: μ = 3 when μ = 2.995, which means that we are failing to detect an average underfill of .005 pounds, is less serious than not rejecting H0: μ = 3 when μ = 2.99, which means that we are failing to detect a larger average underfill of .01 pounds. In order to decide whether a particular hypothesis test adequately controls the probability of a Type II error, we must determine which Type II errors are serious, and then we must decide whether the probabilities of these errors are small enough. For example, suppose that the FTC and the coffee producer agree that failing to reject H0: μ = 3 when μ equals 2.99 is a serious error, but that failing to reject H0: μ = 3 when μ equals 2.995 is not a particularly serious error. Then, since the probability of not rejecting H0: μ = 3 when μ equals 2.99, which is .0087, is quite small, we might decide that the hypothesis test adequately controls the probability of a Type II error. To understand the implication of this, recall that the sample of 35 coffee cans, which has , does not provide enough evidence to reject H0: μ ≥ 3 by setting α = .05. We have just shown that the probability that we have failed to detect a serious underfill is quite small (.0087), so the FTC might decide that no action should be taken against the coffee producer. Of course, this decision should also be based on the variability of the fills of the individual cans. Because and σ = .0147, we estimate that 99.73 percent of all individual coffee can fills are contained in the interval If the FTC believes it is reasonable to accept fills as low as (but no lower than) 2.9532 pounds, this evidence also suggests that no action against the coffee producer is needed. Figure 9.13: How β Changes as the Alternative Value of μ Changes Suppose, instead, that the FTC and the coffee producer had agreed that failing to reject H0: μ ≥ 3 when μ equals 2.995 is a serious mistake. The probability of this Type II error, which is .3557, is large. Therefore, we might conclude that the hypothesis test is not adequately controlling the probability of a serious Type II error. In this case, we have two possible courses of action. First, we have previously said that, for a fixed sample size, the lower we set α, the higher is β, and the higher we set α, the lower is β. Therefore, if we keep the sample size fixed at n = 35 coffee cans, we can reduce β by increasing α. To demonstrate this, suppose we increase α to .10. In this case we reject H0 if or, equivalently, if Therefore, we do not reject H0 if It follows that β, the probability of not rejecting H0: μ ≥ 3  when  μ equals 2.995, is We thus see that increasing α from .05 to .10 reduces β from .3557 to .2327. However, β is still too large, and, besides, we might not be comfortable making α larger than .05. Therefore, if we wish to decrease β and maintain α at .05, we must increase the sample size. We will soon present a formula we can use to find the sample size needed to make both α and β as small as we wish. Once we have computed β, we can calculate what we call the power of the test. The power of a statistical test is the probability of rejecting the null hypothesis when it is false. Just as β depends upon the alternative value of μ, so does the power of a test. In general, the power associated with a particular alternative value of μ equals 1 − β, where β is the probability of a Type II error associated with the same alternative value of μ. For example, we have seen that, when we set α = .05, the probability of not rejecting H0: μ ≥ 3 when μ equals 2.99 is .0087. Therefore, the power of the test associated with the alternative value 2.99 (that is, the probability of rejecting H0: μ ≥ 3 when μ equals 2.99) is 1 − .0087 = .9913. Thus far we have demonstrated how to calculate β when testing a less than alternative hypothesis. In the following box we present (without proof) a method for calculating the probability of a Type II error when testing a less than, a greater than, or a not equal to alternative hypothesis: Calculating the Probability of a Type II Error Assume that the sampled population is normally distributed, or that a large sample will be taken. Consider testing H0: μ = μ0 versus one of Ha: μ > μ0, Ha: μ < μ0, or Ha: μ ≠ μ0. Then, if we set the probability of a Type I error equal to α and randomly select a sample of size n, the probability, β, of a Type II error corresponding to the alternative value μa of μ is (exactly or approximately) equal to the area under the standard normal curve to the left of Here z* equals zα if the alternative hypothesis is one-sided (μ > μ0 or μ < μ0), in which case the method for calculating β is exact. Furthermore, z* equals zα/2 if the alternative hypothesis is two-sided (μ ≠ μ0), in which case the method for calculating β is approximate. EXAMPLE 9.9: The Valentine’s Day Chocolate Case In the Valentine’s Day chocolate case we are testing H0: μ = 330 versus Ha: μ ≠ 330 by setting α = .05. We have seen that the mean of the reported order quantities of a random sample of n = 100 large retail stores is . Assuming that σ equals 40, it follows that because is between −z.025 = −1.96 and z.025 = 1.96, we cannot reject H0: μ = 330 by setting α = .05. Since we cannot reject H0, we might have committed a Type II error. Suppose that the candy company decides that failing to reject H0: μ = 330 when μ differs from 330 by as many as 15 valentine boxes (that is, when μ is 315 or 345) is a serious Type II error. Because we have set α equal to .05, β for the alternative value μa = 315 (that is, the probability of not rejecting H0: μ = 330 when μ equals 315) is the area under the standard normal curve to the left of Here z* = zα/2 = z.05/2 = z.025 since the alternative hypothesis (μ ≠ 330) is two-sided. The area under the standard normal curve to the left of −1.79 is 1 − .9633 = .0377. Therefore, β for the alternative value μa = 315 is .0377. Similarly, it can be verified that β for the alternative value μa = 345 is .0377. It follows, because we cannot reject H0: μ = 330 by setting α = .05, and because we have just shown that there is a reasonably small (.0377) probability that we have failed to detect a serious (that is, a 15 valentine box) deviation of μ from 330, that it is reasonable for the candy company to base this year’s production of valentine boxes on the projected mean order quantity of 330 boxes per large retail store. In the following box we present (without proof) a formula that tells us the sample size needed to make both the probability of a Type I error and the probability of a Type II error as small as we wish: Calculating the Sample Size Needed to Achieve Specified Values of α and β Assume that the sampled population is normally distributed, or that a large sample will be taken. Consider testing H0: μ = μ0 versus one of Ha: μ > μ0, Ha: μ < μ0, or Ha: μ ≠ μ0. Then, in order to make the probability of a Type I error equal to α and the probability of a Type II error corresponding to the alternative value μa of μ equal to β, we should take a sample of size Here z* equals zα if the alternative hypothesis is one-sided (μ > μ0 or μ < μ0), and z* equals zα/2 if the alternative hypothesis is two-sided (μ ≠ μ0). Also, zβ is the point on the scale of the standard normal curve that gives a right-hand tail area equal to β. EXAMPLE 9.10 Again consider the coffee fill example and suppose we wish to test H0: μ ≥ 3 (or μ = 3) versus Ha: μ < 3. If we wish α to be .05 and β for the alternative value μa = 2.995 of μ to be .05, we should take a sample of size Here, z* = zα = z.05 = 1.645 because the alternative hypothesis (μ < 3) is one-sided, and zβ = z.05 = 1.645. Although we have set both α and β equal to the same value in the coffee fill situation, it is not necessary for α and β to be equal. As an example, again consider the Valentine’s Day chocolate case, in which we are testing H0: μ = 330 versus Ha: μ ≠ 330. Suppose that the candy company decides that failing to reject H0: μ = 330 when μ differs from 330 by as many as 15 valentine boxes (that is, when μ is 315 or 345) is a serious Type II error. Furthermore, suppose that it is also decided that this Type II error is more serious than a Type I error. Therefore, α will be set equal to .05 and β for the alternative value μa = 315 (or μa = 345) of μ will be set equal to .01. It follows that the candy company should take a sample of size Here, z* = zα/2 = z.05/2 = z.025 = 1.96 because the alternative hypothesis (μ ≠ 330) is two-sided, and zβ = z.01 = 2.326 (see the bottom row of the t table on page 865). To conclude this section, we point out that the methods we have presented for calculating the probability of a Type II error and determining sample size can be extended to other hypothesis tests that utilize the normal distribution. We will not, however, present the extensions in this book. Exercises for Section 9.5 CONCEPTS 9.76 We usually take action on the basis of having rejected the null hypothesis. When we do this, we know the chances that the action has been taken erroneously because we have prespecified α, the probability of rejecting a true null hypothesis. Here, it is obviously important to know (prespecify) α, the probability of a Type I error. When is it important to know the probability of a Type II error? Explain why. 9.77 Explain why we are able to compute many different values of β, the probability of a Type II error, for a single hypothesis test. 9.78 Explain what is meant by a A serious Type II error. b The power of a statistical test. 9.79 In general, do we want the power corresponding to a serious Type II error to be near 0 or near 1? Explain. METHODS AND APPLICATIONS 9.80 Again consider the Consolidated Power waste water situation. Remember that the power plant will be shut down and corrective action will be taken on the cooling system if the null hypothesis H0: μ ≤ 60 is rejected in favor of Ha: μ > 60. In this exercise we calculate probabilities of various Type II errors in the context of this situation.
a Recall that Consolidated Power’s hypothesis test is based on a sample of n = 100 temperature readings and assume that σ equals 2. If the power company sets α = .025, calculate the probability of a Type II error for each of the following alternative values of μ: 60.1, 60.2, 60.3, 60.4, 60.5, 60.6, 60.7, 60.8, 60.9, 61.
b If we want the probability of making a Type II error when μ equals 60.5 to be very small, is Consolidated Power’s hypothesis test adequate? Explain why or why not. If not, and if we wish to maintain the value of α at .025, what must be done?
c The power curve for a statistical test is a plot of the power = 1 − β on the vertical axis versus values of μ that make the null hypothesis false on the horizontal axis. Plot the power curve for Consolidated Power’s test of H0: μ ≤ 60 versus Ha: μ > 60 by plotting power = 1 − β for each of the alternative values of μ in part a. What happens to the power of the test as the alternative value of μ moves away from 60?
9.81 Again consider the automobile parts supplier situation. Remember that a problem-solving team will be assigned to rectify the process producing the cylindrical engine parts if the null hypothesis H0: μ = 3 is rejected in favor of Ha: μ ≠ 3. In this exercise we calculate probabilities of various Type II errors in the context of this situation.
a Suppose that the parts supplier’s hypothesis test is based on a sample of n = 100 diameters and that σ equals .023. If the parts supplier sets α = .05, calculate the probability of a TypeII error for each of the following alternative values of μ: 2.990, 2.995, 3.005, 3.010.
b If we want the probabilities of making a Type II error when μ equals 2.995 and when μ equals 3.005 to both be very small, is the parts supplier’s hypothesis test adequate? Explain why or why not. If not, and if we wish to maintain the value of α at .05, what must be done?
c Plot the power of the test versus the alternative values of μ in part a. What happens to the power of the test as the alternative value of μ moves away from 3?
9.82 In the Consolidated Power hypothesis test of H0: μ ≤ 60 versus Ha: μ > 60 (as discussed in Exercise 9.80) find the sample size needed to make the probability of a Type I error equal to .025 and the probability of a Type II error corresponding to the alternative value μa = 60.5 equal to .025. Here, assume σ equals 2.
9.83 In the automobile parts supplier’s hypothesis test of H0: μ = 3 versus Ha: μ ≠ 3 (as discussed in Exercise 9.81) find the sample size needed to make the probability of a Type I error equal to .05and the probability of a Type II error corresponding to the alternative value μa = 3.005 equal to .05. Here, assume σ equals .023.
9.6: The Chi-Square Distribution (Optional)
Sometimes we can make statistical inferences by using the
chi-square distribution. The probability curve of the χ2 (pronounced chi-square) distribution is skewed to the right. Moreover, the exact shape of this probability curve depends on a parameter that is called the number of degrees of freedom (denoted df). Figure 9.14 illustrates chi-square distributions having 2, 5, and 10 degrees of freedom.
Figure 9.14: Chi-Square Distributions with 2, 5, and 10 Degrees of Freedom

Chapter 5

In order to use the chi-square distribution, we employ a chi-square point, which is denoted . As illustrated in the upper portion of Figure 9.15, is the point on the horizontal axis under the curve of the chi-square distribution that gives a right-hand tail area equal to α. The value of in a particular situation depends on the right-hand tail area α and the number of degrees of freedom (df) of the chi-square distribution. Values of are tabulated in a chi-square table. Such a table is given in Table A.17 of Appendix A (pages 877–878); a portion of this table is reproduced as Table 9.4. Looking at the chi-square table, the rows correspond to the appropriate number of degrees of freedom (values of which are listed down the right side of the table), while the columns designate the right-hand tail area α. For example, suppose we wish to find the chi-square point that gives a right-hand tail area of .05 under a chi-square curve having 5 degrees of freedom. To do this, we look in Table 9.4 at the row labeled 5 and the column labeled . We find that this point is 11.0705 (see the lower portion of Figure 9.15).
Figure 9.15: Chi-Square Points

Table 9.4: A Portion of the Chi-Square Table

9.7: Statistical Inference for a Population Variance (Optional)

Chapter 9

A vital part of a V6 automobile engine is the engine camshaft. As the camshaft turns, parts of the camshaft make repeated contact with engine lifters and thus must have the appropriate hardness to wear properly. To harden the camshaft, a heat treatment process is used, and a hardened layer is produced on the surface of the camshaft. The depth of the layer is called the hardness depth of the camshaft. Suppose that an automaker knows that the mean and the variance of the camshaft hardness depths produced by its current heat treatment process are, respectively, 4.5 mm and .2209 mm. To reduce the variance of the camshaft hardness depths, a new heat treatment process is designed, and a random sample of n = 30 camshaft hardness depths produced by using the new process has a mean of and a variance of s2 = .0885. In order to attempt to show that the variance, σ2, of the population of all camshaft hardness depths that would be produced by using the new process is less than .2209, we can use the following result:
Statistical Inference for a Populationa Variance
Suppose that s2 is the variance of a sample of n measurements randomly selected from a normally distributed population having variance σ2. The sampling distribution of the statistic (n− 1) s2 /σ2 is a chi-square distribution having n − 1 degrees of freedom. This implies that
1 A 100(1 − α) percent confidence interval for σ2 is

Here and are the points under the curve of the chi-square distribution having n − 1 degrees of freedom that give right-hand tail areas of, respectively, α/2 and 1 − (α/2).
2 We can test by using the test statistic

Specifically, if we set the probability of a Type I error equal to α, then we can reject H0 in favor of
a

b

c
Here , and are based on n − 1 degrees of freedom.
The assumption that the sampled population is normally distributed must hold fairly closely for the statistical inferences just given about σ2 to be valid. When we check this assumption in the camshaft situation, we find that a histogram (not given here) of the sample of n = 30 hardness depths is bell-shaped and symmetrical. In order to compute a 95 percent confidence interval for σ2, we note that Table A.17 (pages 877 and 878) tells us that these points—based on n − 1 = 29 degrees of freedom—are and (see Figure 9.16). It follows that a 95 percent confidence interval for σ2 is
Figure 9.16: The Chi-Square Points and

This interval provides strong evidence that σ2 is less than .2209.
If we wish to use a hypothesis test, we test the null hypothesis H0: σ2 = .2209 versus the alternative hypothesis Ha: σ2 < .2209. If H0 can be rejected in favor of Ha at the .05 level of significance, we will conclude that the new process has reduced the variance of the camshaft hardness depths. Since the histogram of the sample of n = 30 hardness depths is bell shaped and symmetrical, the appropriate test statistic is given in the summary box. Furthermore, since Ha: σ2 < .2209 is of the form , we should reject H0: σ 2=.2209 if the value of χ2 is less than the critical value Here is based on n − 1 = 30 − 1 = 29 degrees of freedom, and this critical value is illustrated in Figure 9.17. Since the sample variance is s2 = .0885, the value of the test statistic is Figure 9.17: Testing H0: σ2 = .2209 versus Ha: σ2 < .2209 by Setting α = .05 Since χ2 = 11.6184 is less than we reject H0: σ2 = .2209 in favor of Ha: σ2 < .2209. That is, we conclude (at an α of .05) that the new process has reduced the variance of the camshaft hardness depths. Exercises for Sections 9.6 and 9.7 CONCEPTS 9.84 What assumption must hold to use the chi-square distribution to make statistical inferences about a population variance? 9.85 Define the meaning of the chi-square points and . Hint: Draw a picture. 9.86 Give an example of a situation in which we might wish to compute a confidence interval for σ2. METHODS AND APPLICATIONS Exercises 9.87 through 9.90 relate to the following situation: Consider an engine parts supplier and suppose the supplier has determined that the variance of the population of all cylindrical engine part outside diameters produced by the current machine is approximately equal to, but no less than, .0005. To reduce this variance, a new machine is designed, and a random sample of n = 25 outside diameters produced by this new machine has a mean of and a variance of s2 = .00014. Assume the population of all cylindrical engine part outside diameters that would be produced by the new machine is normally distributed, and let σ2 denote the variance of this population. 9.87 Find a 95 percent confidence interval for σ2. 9.88 Test H0: σ2 = .0005 versus Ha: σ2 < .0005 by setting α = .05. 9.89 Find a 99 percent confidence interval for σ2. 9.90 Test H0: σ2 = .0005 versus Ha: σ2 ≠ .0005 by setting α = .01. Chapter Summary We began this chapter by learning about the two hypotheses that make up the structure of a hypothesis test. The null hypothesis is the statement being tested. Usually it represents the status quo and it is not rejected unless there is convincing sample evidence that it is false. The alternative, or, research, hypothesis is a statement that is accepted only if there is convincing sample evidence that it is true and that the null hypothesis is false. In some situations, the alternative hypothesis is a condition for which we need to attempt to find supportive evidence. We also learned that two types of errors can be made in a hypothesis test. A Type I error occurs when we reject a true null hypothesis, and a Type II error occurs when we do not reject a false null hypothesis. We studied two commonly used ways to conduct a hypothesis test. The first involves comparing the value of a test statistic with what is called a critical value, and the second employs what is called a p-value. The p-value measures the weight of evidence against the null hypothesis. The smaller the p-value, the more we doubt the null hypothesis. We learned that, if we can reject the null hypothesis with the probability of a Type I error equal to α, then we say that the test result has statistical significance at the α level. However, we also learned that, even if the result of a hypothesis test tells us that statistical significance exists, we must carefully assess whether the result is practically important. One good way to do this is to use a point estimate and confidence interval for the parameter of interest. The specific hypothesis tests we covered in this chapter all dealt with a hypothesis about one population parameter. First, we studied a test about a population mean that is based on the assumption that the population standard deviation σ is known. This test employs the normal distribution. Second, we studied a test about a population mean that assumes that σ is unknown. We learned that this test is based on the t distribution. Figure 9.18 presents a flowchart summarizing how to select an appropriate test statistic to test a hypothesis about a population mean. Then we presented a test about a population proportion that is based on the normal distribution. Next (in optional Section 9.5) we studied Type II error probabilities, and we showed how we can find the sample size needed to make both the probability of a Type I error and the probability of a serious Type II error as small as we wish. We concluded this chapter by discussing (in optional Sections 9.6 and 9.7) the chi-square distribution and its use in making statistical inferences about a population variance. Figure 9.18: Selecting an Appropriate Test Statistic to Test a Hypothesis about a Population Mean Glossary of Terms alternative (research) hypothesis: A statement that will be accepted only if there is convincing sample evidence that it is true. Sometimes it is a condition for which we need to attempt to find supportive evidence. (page 347) chi-square distribution: A useful continuous probability distribution. Its probability curve is skewed to the right, and the exact shape of the probability curve depends on the number of degrees of freedom associated with the curve. (page 382) critical value: The value of the test statistic is compared with a critical value in order to decide whether the null hypothesis can be rejected. (pages 354, 358, 360) greater than alternative: An alternative hypothesis that is stated as a greater than ( > ) inequality. (page 349)
less than alternative:
An alternative hypothesis that is stated as a less than ( < ) inequality. (page 349) not equal to alternative: An alternative hypothesis that is stated as a not equal to ( ≠ ) inequality. (page 349) null hypothesis: The statement being tested in a hypothesis test. It usually represents the status quo and it is not rejected unless there is convincing sample evidence that it is false. (page 347) one-sided alternative hypothesis: An alternative hypothesis that is stated as either a greater than ( > ) or a less than ( < ) inequality. (page 349) power (of a statistical test): The probability of rejecting the null hypothesis when it is false. (page 379) p-value (probability value): The probability, computed assuming that the null hypothesis is true, of observing a value of the test statistic that is at least as extreme as the value actually computed from the sample data. The p-value measures how much doubt is cast on the null hypothesis by the sample data. The smaller the p-value, the more we doubt the null hypothesis. (pages 355, 358, 360, 362) statistical significance at the α level: When we can reject the null hypothesis by setting the probability of a Type I error equal to α. (page 354) test statistic: A statistic computed from sample data in a hypothesis test. It is either compared with a critical value or used to compute a p-value. (page 349) two-sided alternative hypothesis: An alternative hypothesis that is stated as a not equal to ( ≠ ) inequality. (page 349) Type I error: Rejecting a true null hypothesis. (page 350) Type II error: Failing to reject a false null hypothesis. (page 350) Important Formulas and Tests Hypothesis Testing steps: page 357 A hypothesis test about a population mean (σ known): page 361 A t test about a population mean (σ unknown): page 366 A large sample hypothesis test about a population proportion: page 371 Calculating the probability of a Type II error: page 379 Sample size determination to achieve specified values of α and β: page 380 Statistical inference about a population variance: page 383 Supplementary Exercises 9.91 The auditor for a large corporation routinely monitors cash disbursements. As part of this process, the auditor examines check request forms to determine whether they have been properly approved. Improper approval can occur in several ways. For instance, the check may have no approval, the check request might be missing, the approval might be written by an unauthorized person, or the dollar limit of the authorizing person might be exceeded. a Last year the corporation experienced a 5 percent improper check request approval rate. Since this was considered unacceptable, efforts were made to reduce the rate of improper approvals. Letting p be the proportion of all checks that are now improperly approved, set up the null and alternative hypotheses needed to attempt to demonstrate that the current rate of improper approvals is lower than last year’s rate of 5 percent. b Suppose that the auditor selects a random sample of 625 checks that have been approved in the last month. The auditor finds that 18 of these 625 checks have been improperly approved. Use critical values and this sample information to test the hypotheses you set up in part a at the .10, .05, .01, and .001 levels of significance. How much evidence is there that the rate of improper approvals has been reduced below last year’s 5 percent rate? c Find the p-value for the test of part b. Use the p-value to carry out the test by setting α equal to .10, .05, .01, and .001. Interpret your results. d Suppose the corporation incurs a $10 cost to detect and correct an improperly approved check. If the corporation disburses at least 2 million checks per year, does the observed reduction of the rate of improper approvals seem to have practical importance? Explain your opinion. 9.92 THE CIGARETTE ADVERTISEMENT CASE ModelAge Recall that the cigarette industry requires that models in cigarette ads must appear to be at least 25 years old. Also recall that a sample of 50 people is randomly selected at a shopping mall. Each person in the sample is shown a “typical cigarette ad” and is asked to estimate the age of the model in the ad. a Let μ be the mean perceived age estimate for all viewers of the ad, and suppose we consider the industry requirement to be met if μ is at least 25. Set up the null and alternative hypotheses needed to attempt to show that the industry requirement is not being met. b Suppose that a random sample of 50 perceived age estimates gives a mean of years and a standard deviation of s = 3.596 years. Use these sample data and critical values to test the hypotheses of part a at the .10, .05, .01, and .001 levels of significance. c How much evidence do we have that the industry requirement is not being met? d Do you think that this result has practical importance? Explain your opinion. 9.93 THE CIGARETTE ADVERTISEMENT CASE ModelAge Consider the cigarette ad situation discussed in Exercise 9.92. Using the sample information given in that exercise, the p-value for testing H0 versus Ha can be calculated to be .0057. a Determine whether H0 would be rejected at each of α = .10, α = .05, α = .01, and α = .001. b Describe how much evidence we have that the industry requirement is not being met. 9.94 In an article in the Journal of Retailing, Kumar, Kerwin, and Pereira study factors affecting merger and acquisition activity in retailing. As part of the study, the authors compare the characteristics of “target firms” (firms targeted for acquisition) and “bidder firms” (firms attempting to make acquisitions). Among the variables studied in the comparison were earnings per share, debt-to-equity ratio, growth rate of sales, market share, and extent of diversification. a Let μ be the mean growth rate of sales for all target firms (firms that have been targeted for acquisition in the last five years and that have not bid on other firms), and assume growth rates are approximately normally distributed. Furthermore, suppose a random sample of 25target firms yields a sample mean sales growth rate of with a standard deviation of s = 0.12. Use critical values and this sample information to test H0: μ ≤ .10 versus Ha: μ > .10 by setting α equal to .10, .05, .01, and .001. How much evidence is there that the mean growth rate of sales for target firms exceeds .10 (that is, exceeds 10 percent)?
b Now let μ be the mean growth rate of sales for all firms that are bidders (firms that have bid to acquire at least one other firm in the last five years), and again assume growth rates are approximately normally distributed. Furthermore, suppose a random sample of 25 bidders yields a sample mean sales growth rate of with a standard deviation of s = 0.09. Use critical values and this sample information to test H0: μ ≤ .10 versus Ha: μ > .10 by setting α equal to .10, .05, .01, and .001. How much evidence is there that the mean growth rate of sales for bidders exceeds .10 (that is, exceeds 10 percent)?
9.95 A consumer electronics firm has developed a new type of remote control button that is designed to operate longer before becoming intermittent. A random sample of 35 of the new buttons is selected and each is tested in continuous operation until becoming intermittent. The resulting lifetimes are found to have a sample mean of hours and a sample standard deviation of s = 110.8.
a Independent tests reveal that the mean lifetime (in continuous operation) of the best remote control button on the market is 1,200 hours. Letting μ be the mean lifetime of the population of all new remote control buttons that will or could potentially be produced, set up the null and alternative hypotheses needed to attempt to provide evidence that the new button’s mean lifetime exceeds the mean lifetime of the best remote button currently on the market.
b Using the previously given sample results, use critical values to test the hypotheses you set up in part a by setting α equal to .10, .05, .01, and .001. What do you conclude for each value of α?
c Suppose that and s = 110.8 had been obtained by testing a sample of 100 buttons. Use critical values to test the hypotheses you set up in part a by setting α equal to .10, .05, .01, and .001. Which sample (the sample of 35 or the sample of 100) gives a more statistically significant result? That is, which sample provides stronger evidence that Ha is true?
d If we define practical importance to mean that μ exceeds 1,200 by an amount that would be clearly noticeable to most consumers, do you think that the result has practical importance? Explain why the samples of 35 and 100 both indicate the same degree of practical importance.
e Suppose that further research and development effort improves the new remote control button and that a random sample of 35 buttons gives hours and s = 102.8 hours. Test your hypotheses of part a by setting α equal to .10, .05, .01, and .001.
(1) Do we have a highly statistically significant result? Explain.
(2) Do you think we have a practically important result? Explain.
9.96 Again consider the remote control button lifetime situation discussed in Exercise 9.95. Using the sample information given in the introduction to Exercise 9.95, the p-value for testing H0 versus Ha can be calculated to be .0174.
a Determine whether H0 would be rejected at each of α = .10, α = .05, α = .01, and α = .001.
b Describe how much evidence we have that the new button’s mean lifetime exceeds the mean lifetime of the best remote button currently on the market.
9.97 Calculate and use an appropriate 95 percent confidence interval to help evaluate practical importance as it relates to the hypothesis test in each of the following situations discussed in previous review exercises. Explain what you think each confidence interval says about practical importance.
a The check approval situation of Exercise 9.91.
b The cigarette ad situation of Exercise 9.92.
c The remote control button situation of Exercise 9.95
a
,
c
, and
e
.
9.98 Several industries located along the Ohio River discharge a toxic substance called carbon tetrachloride into the river. The state Environmental Protection Agency monitors the amount of carbon tetrachloride pollution in the river. Specifically, the agency requires that the carbon tetrachloride contamination must average no more than 10 parts per million. In order to monitor the carbon tetrachloride contamination in the river, the agency takes a daily sample of 100 pollution readings at a specified location. If the mean carbon tetrachloride reading for this sample casts substantial doubt on the hypothesis that the average amount of carbon tetrachloride contamination in the river is at most 10 parts per million, the agency must issue a shutdown order. In the event of such a shutdown order, industrial plants along the river must be closed until the carbon tetrachloride contamination is reduced to a more acceptable level. Assume that the state Environmental Protection Agency decides to issue a shutdown order if a sample of 100 pollution readings implies that H0: μ ≤ 10 can be rejected in favor of Ha: μ > 10 by setting α = .01. If σ equals 2, calculate the probability of a Type II error for each of the following alternative values of μ: 10.1, 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, and 11.0.
9.99 THE INVESTMENT CASE InvestRet
Suppose that random samples of 50 returns for each of the following investment classes give the indicated sample mean and sample standard deviation:

a For each investment class, set up the null and alternative hypotheses needed to test whether the current mean return differs from the historical (1970 to 1994) mean return given in Table 3.11 (page 159).
b Test each hypothesis you set up in part a at the .05 level of significance. What do you conclude? For which investment classes does the current mean return differ from the historical mean?
9.100 THE UNITED KINGDOM INSURANCE CASE
Assume that the U.K. insurance survey is based on 1,000 randomly selected United Kingdom households and that 640 of these households spent money to buy life insurance in 1993.
a If p denotes the proportion of all U.K. households that spent money to buy life insurance in 1993, set up the null and alternative hypotheses needed to attempt to justify the claim that more than 60 percent of U.K. households spent money to buy life insurance in 1993.
b Test the hypotheses you set up in part a by setting α = .10, .05, .01, and .001. How much evidence is there that more than 60 percent of U.K. households spent money to buy life insurance in 1993?
9.101 How safe are child car seats? Consumer Reports (May 2005) tested the safety of child car seats in 30 mph crashes. They found “slim safety margins” for some child car seats. Suppose that Consumer Reports simulates the safety of the market-leading child car seat. Their test consists of placing the maximum claimed weight in the car seat and simulating crashes at higher and higher miles per hour until a problem occurs. The following data identify the speed at which a problem with the car seat first appeared; such as the strap breaking, seat shell cracked, strap adjuster broke, detached from the base, etc.: 31.0, 29.4, 30.4, 28.9, 29.7, 30.1, 32.3, 31.7, 35.4, 29.1, 31.2, 30.2. Let μ denote the true mean speed at which a problem with the car seat first appears. The following MINITAB output gives the results of using the sample data to test H0: μ = 30 versus Ha: μ > 30. CarSeat

How much evidence is there that μ exceeds 30 mph?
9.102 Consumer Reports (January 2005) indicates that profit margins on extended warranties are much greater than on the purchase of most products.5 In this exercise we consider a major electronics retailer that wishes to increase the proportion of customers who buy extended warranties on digital cameras. Historically, 20 percent of digital camera customers have purchased the retailer’s extended warranty. To increase this percentage, the retailer has decided to offer a new warranty that is less expensive and more comprehensive. Suppose that three months after starting to offer the new warranty, a random sample of 500 customer sales invoices shows that 152 out of 500 digital camera customers purchased the new warranty. Letting p denote the proportion of all digital camera customers who have purchased the new warranty, calculate the p-value for testing H0: p = .20 versus Ha: p > .20. How much evidence is there that p exceeds .20? Does the difference between and .2 seem to be practically important? Explain your opinion.
9.103 Fortune magazine has periodically reported on the rise of fees and expenses charged by stock funds.
a Suppose that 10 years ago the average annual expense for stock funds was 1.19 percent. Let μ be the current mean annual expense for all stock funds, and assume that stock fund annual expenses are approximately normally distributed. If a random sample of 12 stock funds gives a sample mean annual expense of with a standard deviation of s = .31%, use critical values and this sample information to test H0: μ ≤ 1.19% versus Ha: μ > 1.19% by setting α equal to .10, .05, .01, and .001. How much evidence is there that the current mean annual expense for stock funds exceeds the average of 10 years ago?
b Do you think that the result in part a has practical importance? Explain your opinion.
9.104: Internet Exercise
Are American consumers comfortable using their credit cards to make purchases over the Internet? Suppose that a noted authority suggests that credit cards will be firmly established on the Internet once the 80 percent barrier is broken; that is, as soon as more than 80 percent of those who make purchases over the Internet are willing to use a credit card to pay for their transactions. A recent Gallup Poll (story, survey results, and analysis can be found at http://www.gallup.com/poll/releases/pr000223.asp) found that, out of n = 302 Internet purchasers surveyed, 267 have paid for Internet purchases using a credit card. Based on the results of the Gallup survey, is there sufficient evidence to conclude that the proportion of Internet purchasers willing to use a credit card now exceeds 0.80? Set up the appropriate null and alternative hypotheses, test at the 0.05 and 0.01 levels of significance, and calculate a p-value for your test.
Go to the Gallup Organization website (http://www.gallup.com) and find the index of recent poll results (http://www.gallup.com/poll/index.asp). Select an interesting current poll and prepare a brief written summary of the poll or some aspect thereof. Include a statistical test for the significance of a proportion (you may have to make up your own value for the hypothesized proportion p0) as part of your report. For example, you might select a political poll and test whether a particular candidate is preferred by a majority of voters (p > 0.50).
Appendix 9.1: One-Sample Hypothesis Testing Using MINITAB
The first instruction block in this section begins by describing the entry of data into the MINITAB data window. Alternatively, the data may be loaded directly from the data disk included with the text. The appropriate data file name is given at the top of the instruction block. Please refer to Appendix 1.1 for further information about entering data, saving data, and printing results when using MINITAB.
Hypothesis test for a population mean in Figure 9.9(a) on page 368 (data file: CreditCd.MTW):
• In the Data window, enter the interest rate data from Table 9.3 (page 367) into a single column with variable name Rate.
• Select Stat: Basic Statistics : 1-Sample t.
• In the “1-Sample t (Test and Confidence Interval)” dialog box, select the “Samples in columns” option.
• Select the variable name Rate into the “Samples in columns” window.
• Place a checkmark in the “Perform hypothesis test” checkbox.
• Enter the hypothesized mean (here 18.8) into the “Hypothesized mean” window.
• Click the Options… button, select the desired alternative (in this case “less than”) from the Alternative drop-down menu, and click OK in the “1-Sample t-Options” dialog box.
• To produce a boxplot of the data with a graphical representation of the hypothesis test, click the Graphs… button in the “1-Sample t (Test and Confidence Interval)” dialog box, check the “Boxplot of data” checkbox, and click OK in the “1-Sample t—Graphs” dialog box.
• Click OK in the “1-Sample t (Test and Confidence Interval)” dialog box.
• The confidence interval is given in the Session window, and the boxplot is displayed in a graphics window.
A “1-Sample Z” test is also available in MINITAB under Basic Statistics. It requires a user-specified value of the population standard deviation, which is rarely known.

Hypothesis test for a population proportion in Exercise 9.72 on page 375:
• Select Stat: Basic Statistics : 1 Proportion
• In the “1 Proportion (Test and Confidence Interval)” dialog box, select the “Summarized data” option.
• Enter the sample number of successes (here equal to 146) into the “Number of events” window.
• Enter the sample size (here equal to 400) into the “Number of trials” window.
• Place a checkmark in the “Perform hypothesis test” checkbox.
• Enter the hypothesized proportion (here equal to 0.25) into the “Hypothesized proportion” window.
• Click on the Options… button.
• In the “1 Proportion—Options” dialog box, select the desired alternative (in this case “greater than”) from the Alternative drop-down menu.
• Place a checkmark in the “Use test and interval based on normal distribution” checkbox.
• Click OK in the “1 Proportion—Options” dialog box and click OK in the “1 Proportion (Test and Confidence Interval)” dialog box.
• The hypothesis test results are given in the Session window.

Appendix 9.2: One-Sample Hypothesis Testing Using Excel
The instruction block in this section begins by describing the entry of data into an Excel spreadsheet. Alternatively, the data may be loaded directly from the data disk included with the text. The appropriate data file name is given at the top of the instruction block. Please refer to Appendix 1.2 for further information about entering data, saving data, and printing results.

Hypothesis test for a population mean in Figure 9.9(b) on page 368 (data file: CreditCd.xlsx):
The Data Analysis ToolPak in Excel does not explicitly provide for one-sample tests of hypotheses. A one-sample test can be conducted using the Descriptive Statistics component of the Analysis ToolPak and a few additional computations using Excel.
Descriptive statistics:
• Enter the interest rate data from Table 9.3 (page 367) into cells A2.A16 with the label Rate in cell A1.
• Select Data: Data Analysis : Descriptive Statistics.
• Click OK in the Data Analysis dialog box.
• In the Descriptive Statistics dialog box, enter A1.A16 into the Input Range box.
• Place a checkmark in the “Labels in first row” check box.
• Under output options, select “New Worksheet Ply” to have the output placed in a new worksheet and enter the name Output for the new worksheet.
• Place a checkmark in the Summary Statistics checkbox.
• Click OK in the Descriptive Statistics dialog box.
The resulting block of descriptive statistics is displayed in the Output worksheet and the entries needed to carry out the test computations have been entered into the range D3.E6.
Computation of the test statistic and p-value:
• In cell E7, use the formula = (E3 − E4)/(E5/SQRT(E6)) to compute the test statistic t (= −4.970).
• Click on cell E8 and then select the Insert Function button on the Excel toolbar.
• In the Insert Function dialog box, select Statistical from the “Or select a category:” menu, select TDIST from the “Select a function:” menu, and click OK in the Insert Function dialog box.
• In the TDIST Function Arguments dialog box, enter abs(E7) in the X window.
• Enter 14 in the Deg_freedom window.
• Enter 1 in the Tails window to select a one-tailed test.
• Click OK in the TDIST Function Arguments dialog box.
• The p-value related to the test will be placed in cell E8.
Appendix 9.3: One-Sample Hypothesis Testing Using MegaStat
The instructions in this section begin by describing the entry of data into an Excel worksheet. Alternatively, the data may be loaded directly from the data disk included with the text. The appropriate data file name is given at the top of each instruction block. Please refer to Appendix 1.2 for further information about entering data and saving and printing results in Excel. Please refer to Appendix 1.3 for more information about using MegaStat.

Hypothesis test for a population mean in Figure 9.10 on page 368 (data file: CreditCd.xlsx):
• Enter the interest rate data from Table 9.3 (page 367) into cells A2.A16 with the label Rate in cell A1.
• Select Add-Ins: MegaStat: Hypothesis Tests : Mean vs. Hypothesized Value
• In the “Hypothesis Test: Mean vs. Hypothesized Value” dialog box, click on “data input” and use the autoexpand feature to enter the range A1.A16 into the Input Range window.
• Enter the hypothesized value (here equal to 18.8) into the Hypothesized Mean window.
• Select the desired alternative (here “less than”) from the drop-down menu in the Alternative box.
• Click on t-test and click OK in the “Hypothesis Test: Mean vs. Hypothesized Value” dialog box.
• A hypothesis test employing summary data can be carried out by clicking on “summary data,” and by entering a range into the Input Range window that contains the following—label; sample mean; sample standard deviation; sample size n.
A z test can be carried out (in the unlikely event that the population standard deviation is known) by clicking on “z-test.”
Hypothesis test for a population proportion shown in Figure 9.11 in the electronic article surveillance situation on pages 373 and 374:

• Select Add-Ins: MegaStat: Hypothesis Tests : Proportion vs. Hypothesized Value
• In the “Hypothesis Test: Proportion vs. Hypothesized Value” dialog box, enter the hypothesized value (here equal to 0.05) into the “Hypothesized p” window.
• Enter the observed sample proportion (here equal to 0.16) into the “Observed p” window.
• Enter the sample size (here equal to 250) into the “n” window.
• Select the desired alternative (here “greater than”) from the drop-down menu in the Alternative box.
• Check the “Display confidence interval” checkbox (if desired), and select or type the appropriate level of confidence.
• Click OK in the “Hypothesis Test: Proportion vs. Hypothesized Value” dialog box.

Hypothesis test for a population variance in the camshaft situation of Section 9.7 on pages 383 and 384:
• Enter a label (in this case Depth) into cell A1, the sample variance (here equal to .0885) into cell A2, and the sample size (here equal to 30) into cell A3.
• Select Add-Ins: MegaStat: Hypothesis Tests : Chi-square Variance Test
• Click on “summary input.”
• Enter the range A1.A3 into the Input Range window—that is, enter the range containing the data label, the sample variance, and the sample size.
• Enter the hypothesized value (here equal to 0.2209) into the “Hypothesized variance” window.
• Select the desired alternative (in this case “less than”) from the drop-down menu in the Alternative box.
• Check the “Display confidence interval” checkbox (if desired) and select or type the appropriate level of confidence.
• Click OK in the “Chi-square Variance Test” dialog box.
• A chi-square variance test may be carried out using data input by entering the observed sample values into a column in the Excel worksheet, and by then using the autoexpand feature to enter the range containing the label and sample values into the Input Range window.
1 This case is based on conversations by the authors with several employees working for a leading producer of trash bags. For purposes of confidentiality, we have agreed to withhold the company’s name.
2 Thanks to Krogers of Oxford, Ohio, for helpful discussions concerning this case.
3 Some statisticians suggest using the more conservative rule that both np0 and n(1 − p0) must be at least 10.
4 Source: “Driving Organic Growth at Bank of America” Quality Progress (February 2005), pp. 23–27.
5 Consumer Reports, January 2005, page 51.
(Bowerman 346)
Bowerman, Bruce L. Business Statistics in Practice, 5th Edition. McGraw-Hill Learning Solutions, 022008. .
CHAPTER 10: Statistical Inferences Based on Two Samples
Chapter Outline

10.1
Comparing Two Population Means by Using Independent Samples: Variances Known

10.2
Comparing Two Population Means by Using Independent Samples: Variances Unknown

10.3
Paired Difference Experiments

10.4
Comparing Two Population Proportions by Using Large, Independent Samples

10.5
Comparing Two Population Variances by Using Independent Samples
Business improvement often requires making comparisons. For example, to increase consumer awareness of a product or service, it might be necessary to compare different types of advertising campaigns. Or to offer more profitable investments to its customers, an investment firm might compare the profitability of different investment portfolios. As a third example, a manufacturer might compare different production methods in order to minimize or eliminate out-of-specification product.
In this chapter we discuss using confidence intervals and hypothesis tests to compare two populations. Specifically, we compare two population means, two population variances, and two population proportions. We make these comparisons by studying differences and ratios. For instance, to compare two population means, say μ1 and μ2, we consider the difference between these means, μ1 − μ2. If, for example, we use a confidence interval or hypothesis test to conclude that μ1 − μ2 is a positive number, then we conclude that μ1 is greater than μ2. On the other hand, if a confidence interval or hypothesis test shows that μ1 − μ2 is a negative number, then we conclude that μ1 is less than μ2. As another example, if we compare two population variances, say and , we might consider the ratio If a hypothesis test shows that this ratio exceeds 1, then we can conclude that is greater than .
We explain many of this chapter’s methods in the context of three new cases:

The Catalyst Comparison Case: The production supervisor at a chemical plant uses confidence intervals and hypothesis tests for the difference between two population means to determine which of two catalysts maximizes the hourly yield of a chemical process. By maximizing yield, the plant increases its productivity and improves its profitability.
The Repair Cost Comparison Case: In order to reduce the costs of automobile accident claims, an insurance company uses confidence intervals and hypothesis tests for the difference between two population means to compare repair cost estimates for damaged cars at two different garages.
The Advertising Media Case: An advertising agency is test marketing a new product by using one advertising campaign in Des Moines, Iowa, and a different campaign in Toledo, Ohio. The agency uses confidence intervals and hypothesis tests for the difference between two population proportions to compare the effectiveness of the two advertising campaigns.
10.1: Comparing Two Population Means by Using Independent Samples: Variances Known

Chapter 10

A bank manager has developed a new system to reduce the time customers spend waiting to be served by tellers during peak business hours. We let μ1 denote the mean customer waiting time during peak business hours under the current system. To estimate μ1, the manager randomly selects n1 = 100 customers and records the length of time each customer spends waiting for service. The manager finds that the sample mean waiting time for these 100 customers is minutes. We let μ2 denote the mean customer waiting time during peak business hours for the new system. During a trial run, the manager finds that the mean waiting time for a random sample of n2 = 100 customers is minutes.
In order to compare μ1 and μ2, the manager estimates μ1 − μ2, the difference between μ1 and μ2. Intuitively, a logical point estimate of μ1 − μ2 is the difference between the sample means

This says we estimate that the current mean waiting time is 3.65 minutes longer than the mean waiting time under the new system. That is, we estimate that the new system reduces the mean waiting time by 3.65 minutes.

To compute a confidence interval for μ1 − μ2 (or to test a hypothesis about μ1 − μ2), we need to know the properties of the sampling distribution of To understand this sampling distribution, consider randomly selecting a sample1 of n1 measurements from a population having mean μ1 and variance Let be the mean of this sample. Also consider randomly selecting a sample of n2 measurements from another population having mean μ2 and variance Let be the mean of this sample. Different samples from the first population would give different values of and different samples from the second population would give different values of —so different pairs of samples from the two populations would give different values of In the following box we describe the sampling distribution of which is the probability distribution of all possible values of :
The Sampling Distribution of
If the randomly selected samples are independent of each other,2 then the population of all possible values of

1 Has a normal distribution if each sampled population has a normal distribution, or has approximately a normal distribution if the sampled populations are not normally distributed and each of the sample sizes n1 and n2 is large.
2 Has mean

3 Has standard deviation

Figure 10.1 illustrates the sampling distribution of Using this sampling distribution, we can find a confidence interval for and test a hypothesis about μ1 − μ2. Although the interval and test assume that the true values of the population variances and are known, we believe that they are worth presenting because they provide a simple introduction to the basic idea of comparing two population means. Readers who wish to proceed more quickly to the more practical t-based procedures of the next section may skip the rest of this section without loss of continuity.
Figure 10.1: The Sampling Distribution of Has Mean μ1 − μ2 and Standard Deviation

A z-Based Confidence Interval for the Difference between Two Population Means, when σ1 and σ2 are Known
Let be the mean of a sample of size n1 that has been randomly selected from a population with mean μ1 and standard deviation σ1, and let be the mean of a sample of size n2 that has been randomly selected from a population with mean μ2 and standard deviation σ2. Furthermore, suppose that each sampled population is normally distributed, or that each of the sample sizes n1 and n2 is large. Then, if the samples are independent of each other, a 100(1 − α) percent confidence interval for μ1 − μ2 is

EXAMPLE 10.1: The Bank Customer Waiting Time Case
Suppose the random sample of n1 = 100 waiting times observed under the current system gives a sample mean and the random sample of n2 = 100 waiting times observed during the trial run of the new system yields a sample mean Assuming that is known to equal 4.7 and is known to equal 1.9, and noting that each sample is large, a 95 percent confidence interval for μ1 − μ2 is

This interval says we are 95 percent confident that the new system reduces the mean waiting time by between 3.15 minutes and 4.15 minutes.
Suppose we wish to test a hypothesis about μ1 − μ2. In the following box we describe how this can be done. Here we test the null hypothesis H0: μ1 − μ2 = D0, where D0 is a number whose value varies depending on the situation.
A z Test about the Difference between Two Population Means when σ1 and σ2 Are Known
Let all notation be as defined in the preceding box, and define the test statistic

Assume that each sampled population is normally distributed, or that each of the sample sizes n1 and n2 is large. Then, if the samples are independent of each other, we can test H0: μ1 − μ2 = D0 versus a particular alternative hypothesis at level of significance α by using the appropriate critical value rule, or, equivalently, the corresponding p-value.

Often D0 will be the number 0. In such a case, the null hypothesis H0: μ1 − μ2 = 0 says there is no difference between the population means μ1 and μ2. For example, in the bank customer waiting time situation, the null hypothesis H0: μ1 − μ2 = 0 says there is no difference between the mean customer waiting times under the current and new systems. When D0 is 0, each alternative hypothesis in the box implies that the population means μ1 and μ2 differ. For instance, in the bank waiting time situation, the alternative hypothesis Ha: μ1 − μ2 > 0 says that the current mean customer waiting time is longer than the new mean customer waiting time. That is, this alternative hypothesis says that the new system reduces the mean customer waiting time.
EXAMPLE 10.2: The Bank Customer Waiting Time Case
To attempt to provide evidence supporting the claim that the new system reduces the mean bank customer waiting time, we will test
H0: μ1 − μ2 = 0 versus Ha: μ1 − μ2 > 0 at the .05 level of significance. To perform the hypothesis test, we will use the sample information in Example 10.1 to calculate the value of the test statistic z in the summary box. Then, since Ha: μ1 − μ2 > 0 is of the form Ha: μ1 − μ2 > D0, we will reject H0: μ1 − μ2 = 0 if the value of z is greater than zα = z.05 = 1.645. Assuming that and the value of the test statistic is

Because z = 14.21 is greater than z.05 = 1.645, we reject H0: μ1 − μ2 = 0 in favor of Ha: μ1 − μ2 > 0. We conclude (at an α of .05) that μ1 − μ2 is greater than 0 and, therefore, that the new system reduces the mean customer waiting time. Furthermore, the point estimate says we estimate that the new system reduces mean waiting time by 3.65 minutes. The p-value for the test is the area under the standard normal curve to the right of z = 14.21. Because this p-value is less than .00003, it provides extremely strong evidence that H0 is false and that Ha is true. That is, we have extremely strong evidence that μ1 − μ2 is greater than 0 and, therefore, that the new system reduces the mean customer waiting time.
Next, suppose that because of cost considerations, the bank manager wants to implement the new system only if it reduces mean waiting time by more than three minutes. In order to demonstrate that μ1 − μ2 is greater than 3, the manager (setting D0 equal to 3) will attempt to reject the null hypothesis H0: μ1 − μ2 = 3 in favor of the alternative hypothesis Ha: μ1 − μ2 > 3 at the .05 level of significance. To perform the hypothesis test, we compute

Because z = 2.53 is greater than z.05 = 1.645, we can reject H0: μ1 − μ2 = 3 in favor of Ha: μ1 − μ2 > 3. The p-value for the test is the area under the standard normal curve to the right of z = 2.53. Table A.3 (page 863) tells us that this area is 1 − .9943 = .0057. Therefore, we have very strong evidence against H0: μ1 − μ2 = 3 and in favor of Ha: μ1 − μ2 > 3. In other words, we have very strong evidence that the new system reduces mean waiting time by more than three minutes.
Exercises for Section 10.1
CONCEPTS

10.1 Suppose we compare two population means, μ1 and μ2, and consider the difference μ1 − μ2. In each case, indicate how μ1 relates to μ2 (that is, is μ1 greater than, less than, equal to, or not equal to μ2)?
a μ1 − μ2 < 0 b μ1 − μ2 = 0 c μ1 − μ2 < −10 d μ1 − μ2 > 0
e μ1 − μ2 > 20
f μ1 − μ2 ≠ 0
10.2 Suppose we compute a 95 percent confidence interval for μ1 − μ2. If the interval is
a [3, 5], can we be 95 percent confident that μ1 is greater than μ2? Why or why not?
b [3, 5], can we be 95 percent confident that μ1 is not equal to μ2? Why or why not?
c [−20, − 10], can we be 95 percent confident that μ1 is not equal to μ2? Why or why not?
d [−20, − 10], can we be 95 percent confident that μ1 is greater than μ2? Why or why not?
e [−3, 2], can we be 95 percent confident that μ1 is not equal to μ2? Why or why not?
f [−10, 10], can we be 95 percent confident that μ1 is less than μ2? Why or why not?
g [−10, 10], can we be 95 percent confident that μ1 is greater than μ2? Why or why not?
10.3 In order to employ the formulas and tests of this section, the samples that have been randomly selected from the populations being compared must be independent of each other. In such a case, we say that we are performing an
independent samples experiment. In your own words, explain what it means when we say that samples are independent of each other.
10.4 Describe the assumptions that must be met in order to validly use the methods of Section 10.1.
METHODS AND APPLICATIONS
10.5 Suppose we randomly select two independent samples from populations having means μ1 and μ2. If , , σ1 = 3, σ2 = 4, n1 = 100, and n2 = 100:
a Calculate a 95 percent confidence interval for μ1 − μ2. Can we be 95 percent confident that μ1 is greater than μ2? Explain.
b Test the null hypothesis H0: μ1 − μ2 = 0 versus Ha: μ1 − μ2 > 0 by setting α = .05. What do you conclude about how μ1 compares to μ2?
c Find the p-value for testing H0: μ1 − μ2 = 4 versus Ha: μ1 − μ2 > 4. Use the p-value to test these hypotheses by setting α equal to .10, .05, .01, and .001.
10.6 Suppose we select two independent random samples from populations having means μ1 and μ2. If , , σ1 = 6, σ2 = 8, n1 = 625, and n2 = 625:
a Calculate a 95 percent confidence interval for μ1 − μ2. Can we be 95 percent confident that μ2 is greater than μ1? By how much? Explain.
b Test the null hypothesis H0: μ1 − μ2 = −10 versus Ha: μ1 − μ2 < −10 by setting α = .05. What do you conclude? c Test the null hypothesis H0: μ1 − μ2 = −10 versus Ha: μ1 − μ2 ≠ −10 by setting α equal to .01. What do you conclude? d Find the p-value for testing H0: μ1 − μ2 = −10 versus Ha: μ1 − μ2 ≠ −10. Use the p-value to test these hypotheses by setting α equal to .10, .05, .01, and .001. 10.7 In an article in Accounting and Business Research, Carslaw and Kaplan study the effect of control (owner versus manager control) on audit delay (the length of time from a company’s financial year-end to the date of the auditor’s report) for public companies in New Zealand. Suppose a random sample of 100 public owner-controlled companies in New Zealand gives a mean audit delay of days, while a random sample of 100 public manager-controlled companies in New Zealand gives a mean audit delay of days. Assuming the samples are independent and that σ1 = 32.83 and σ2 = 37.18: a Let μ1 be the mean audit delay for all public owner-controlled companies in New Zealand, and let μ2 be the mean audit delay for all public manager-controlled companies in New Zealand. Calculate a 95 percent confidence interval for μ1 − μ2. Based on this interval, can we be 95 percent confident that the mean audit delay for all public owner-controlled companies in New Zealand is less than that for all public manager-controlled companies in New Zealand? If so, by how much? b Consider testing the null hypothesis H0: μ1 − μ2 = 0 versus Ha: μ1 − μ2 < 0. Interpret (in writing) the meaning (in practical terms) of each of H0 and Ha. c Use a critical value to test the null hypothesis H0: μ1 − μ2 = 0 versus Ha: μ1 − μ2 < 0 at the .05 level of significance. Based on this test, what do you conclude about how μ1 and μ2 compare? Write your conclusion in practical terms. d Find the p-value for testing H0: μ1 − μ2 = 0 versus Ha: μ1 − μ2 < 0. Use the p-value to test H0 versus Ha by setting α equal to .10, .05, .025, .01, and .001. How much evidence is there that μ1 is less than μ2? 10.8 In an article in the Journal of Management, Wright and Bonett study the relationship between voluntary organizational turnover and such factors as work performance, work satisfaction, and company tenure. As part of the study, the authors compare work performance ratings for “stayers” (employees who stay in their organization) and “leavers” (employees who voluntarily quit their jobs). Suppose that a random sample of 175 stayers has a mean performance rating (on a 20-point scale) of , and that a random sample of 140 leavers has a mean performance rating of Assuming these random samples are independent and that σ1 = 3.7 and σ2 = 4.5: a Let μ1 be the mean performance rating for stayers, and let μ2 be the mean performance rating for leavers. Use the sample information to calculate a 99 percent confidence interval for μ1 − μ2. Based on this interval, can we be 99 percent confident that the mean performance rating for leavers is greater than the mean performance rating for stayers? What are the managerial implications of this result? b Set up the null and alternative hypotheses needed to try to establish that the mean performance rating for leavers is higher than the mean performance rating for stayers. c Use critical values to test the hypotheses you set up in part b by setting α equal to .10, .05, .01, and .001. How much evidence is there that leavers have a higher mean performance rating than do stayers? 10.9 An Ohio university wishes to demonstrate that car ownership is detrimental to academic achievement. A random sample of 100 students who do not own cars had a mean grade point average (GPA) of 2.68, while a random sample of 100 students who own cars had a mean GPA of 2.55. a Assuming that the independence assumption holds, and letting μ1 = the mean GPA for all students who do not own cars, and μ2 = the mean GPA for all students who own cars, use the above data to compute a 95 percent confidence interval for μ1 − μ2. Assume here that σ1 = .7 and σ2 = .6. b On the basis of the interval calculated in part a, can the university claim that car ownership is associated with decreased academic achievement? That is, can the university justify that μ1 is greater than μ2? Explain. c Set up the null and alternative hypotheses that should be used to attempt to justify that the mean GPA for non–car owners is higher than the mean GPA for car owners. d Test the hypotheses that you set up in part c with α = .05. Again assume that σ1 = .7 and σ2 = .6. Interpret the results of this test. That is, what do your results say about whether car ownership is associated with decreased academic achievement? 10.10 In the Journal of Marketing, Bayus studied differences between “early replacement buyers” and “late replacement buyers.” Suppose that a random sample of 800 early replacement buyers yields a mean number of dealers visited of , and that a random sample of 500 late replacement buyers yields a mean number of dealers visited of Assuming that these samples are independent: a Let μ1 be the mean number of dealers visited by early replacement buyers, and let μ2 be the mean number of dealers visited by late replacement buyers. Calculate a 95 percent confidence interval for μ2 − μ1. Assume here that σ1 = .71 and σ2 = .66. Based on this interval, can we be 95 percent confident that on average late replacement buyers visit more dealers than do early replacement buyers? b Set up the null and alternative hypotheses needed to attempt to show that the mean number of dealers visited by late replacement buyers exceeds the mean number of dealers visited by early replacement buyers by more than 1. c Test the hypotheses you set up in part b by using critical values and by setting α equal to .10, .05, .01, and .001. How much evidence is there that H0 should be rejected? d Find the p-value for testing the hypotheses you set up in part b. Use the p-value to test these hypotheses with α equal to .10, .05, .01, and .001. How much evidence is there that H0 should be rejected? Explain your conclusion in practical terms. e Do you think that the results of the hypothesis tests in parts c and d have practical significance? Explain and justify your answer. 10.11 In the book Essentials of Marketing Research, William R. Dillon, Thomas J. Madden, and Neil H. Firtle discuss a corporate image study designed to find out whether perceptions of technical support services vary depending on the position of the respondent in the organization. The management of a company that supplies telephone cable to telephone companies commissioned a media campaign primarily designed to (1) increase awareness of the company and (2) create favorable perceptions of the company’s technical support. The campaign was targeted to purchasing managers and technical managers at independent telephone companies with greater than 10,000 trunk lines. Perceptual ratings were measured with a nine-point agree–disagree scale. Suppose the results of a telephone survey of 175 technical managers and 125 purchasing managers reveal that the mean perception score for technical managers is 7.3 and that the mean perception score for purchasing managers is 8.2. a Let μ1 be the mean perception score for all purchasing managers, and let μ2 be the mean perception score for all technical managers. Set up the null and alternative hypotheses needed to establish whether the mean perception scores for purchasing managers and technical managers differ. Hint: If μ1 and μ2 do not differ, what does μ1 − μ2 equal? b Assuming that the samples of 175 technical managers and 125 purchasing managers are independent random samples, test the hypotheses you set up in part a by using a critical value with α = .05. Assume here that σ1 = 1.6 and σ2 = 1.4. What do you conclude about whether the mean perception scores for purchasing managers and technical managers differ? c Find the p-value for testing the hypotheses you set up in part a. Use the p-value to test these hypotheses by setting α equal to .10, .05, .01, and .001. How much evidence is there that the mean perception scores for purchasing managers and technical managers differ? d Calculate a 99 percent confidence interval for μ1 − μ2. Interpret this interval. 10.2: Comparing Two Population Means by Using Independent Samples: Variances Unknown Chapter 10 Suppose that (as is usually the case) the true values of the population variances and are not known. We then estimate and by using and , the variances of the samples randomly selected from the populations being compared. There are two approaches to doing this. The first approach assumes that the population variances and are equal. Denoting the common value of these variances as σ2, it follows that Because we are assuming that we do not need separate estimates of and . Instead, we combine the results of the two independent random samples to compute a single estimate of σ2. This estimate is called the pooled estimate of σ2, and it is a weighted average of the two sample variances and . Denoting the pooled estimate as it is computed using the formula Using the estimate of is and we form the statistic It can be shown that, if we have randomly selected independent samples from two normally distributed populations having equal variances, then the sampling distribution of this statistic is a t distribution having (n1 + n2 − 2) degrees of freedom. Therefore, we can obtain the following confidence interval for μ1 − μ2: A t-Based Confidence Interval for the Difference between Two Population Means: Equal Variances Suppose we have randomly selected independent samples from two normally distributed populations having equal variances. Then, a 100(1 − α) percent confidence interval for μ1 − μ2 is and tα/2 is based on (n1 + n2 − 2) degrees of freedom. EXAMPLE 10.3: The Catalyst Comparison Case A production supervisor at a major chemical company must determine which of two catalysts, catalyst XA-100 or catalyst ZB-200, maximizes the hourly yield of a chemical process. In order to compare the mean hourly yields obtained by using the two catalysts, the supervisor runs the process using each catalyst for five one-hour periods. The resulting yields (in pounds per hour) for each catalyst, along with the means, variances, and box plots3 of the yields, are given in Table 10.1. Assuming that all other factors affecting yields of the process have been held as constant as possible during the test runs, it seems reasonable to regard the five observed yields for each catalyst as a random sample from the population of all possible hourly yields for the catalyst. Furthermore, since the sample variances do not differ substantially (notice that s1 = 19.65 and s2 = 22.00 differ by even less), it might be reasonable to conclude that the population variances are approximately equal.4 It follows that the pooled estimate is a point estimate of the common variance σ2. We define μ1 as the mean hourly yield obtained by using catalyst XA-100, and we define μ2 as the mean hourly yield obtained by using catalyst ZB-200. If the populations of all possible hourly yields for the catalysts are normally distributed, then a 95 percent confidence interval for μ1 − μ2 is Here t.025 = 2.306 is based on n1 + n2 − 2 = 5 + 5 − 2 = 8 degrees of freedom. This interval tells us that we are 95 percent confident that the mean hourly yield obtained by using catalyst XA-100 is between 30.38 and 91.22 pounds higher than the mean hourly yield obtained by using catalyst ZB-200. Suppose we wish to test a hypothesis about μ1 − μ2. In the following box we describe how this can be done. Here we test the null hypothesis H0: μ1 − μ2 = D0, where D0 is a number whose value varies depending on the situation. Often D0 will be the number 0. In such a case, the null hypothesis H0: μ1 − μ2 = 0 says there is no difference between the population means μ1 and μ2. In this case, each alternative hypothesis in the box implies that the population means μ1 and μ2 differ in a particular way. A t Test about the Difference between Two Population Means: Equal Variances Define the test statistic and assume that the sampled populations are normally distributed with equal variances. Then, if the samples are independent of each other, we can test H0: μ1− μ2 = D0 versus a particular alternative hypothesis at level of significance α by using the appropriate critical value rule, or, equivalently, the corresponding p-value. Here tα, tα/2, and the p-values are based on n1 + n2 − 2 degrees of freedom. EXAMPLE 10.4: The Catalyst Comparison Case In order to compare the mean hourly yields obtained by using catalysts XA-100 and ZB-200, we will test H0: μ1 − μ2 = 0 versus Ha: μ1 − μ2 ≠ 0 at the .05 level of significance. To perform the hypothesis test, we will use the sample information in Table 10.1 to calculate the value of the test statistic t in the summary box. Then, because Ha: μ1 − μ2 ≠ 0 is of the form Ha: μ1 − μ2 ≠ D0, we will reject H0: μ1 − μ2 = 0 if the absolute value of t is greater than tα/2 = t.025 = 2.306. Here the tα/2 point is based on n1 + n2 − 2 = 5 + 5 − 2 = 8 degrees of freedom. Using the data in Table 10.1, the value of the test statistic is Table 10.1: Yields of a Chemical Process Obtained Using Two Catalysts Catalyst Because| t | = 4.6087 is greater than t.025 = 2.306, we can reject H0: μ1 − μ2 = 0 in favor of: Ha: μ1 − μ2 ≠ 0. We conclude (at an α of .05) that the mean hourly yields obtained by using the two catalysts differ. Furthermore, the point estimate says we estimate that the mean hourly yield obtained by using catalyst XA-100 is 60.8 pounds higher than the mean hourly yield obtained by using catalyst ZB-200. Figures 10.2(a) and (b) give the MegaStat and Excel outputs for testing H0 versus Ha. The outputs tell us that t = 4.61 and that the associated p-value is .001736 (rounded to .0017 on the MegaStat output). The very small p-value tells us that we have very strong evidence against H0: μ1 − μ2 = 0 and in favor of Ha: μ1 − μ2 ≠ 0. In other words, we have very strong evidence that the mean hourly yields obtained by using the two catalysts differ. Finally, notice that the MegaStat output gives the 95 percent confidence interval for μ1 − μ2, which is [30.378, 91.222]. When the sampled populations are normally distributed and the population variances and differ, the following can be shown. t-Based Confidence Intervals for μ1 − μ2, and t Tests about μ1 − μ2: Unequal Variances 1 When the sample sizes n1 and n2 are equal, the “equal variances” t-based confidence interval and hypothesis test given in the preceding two boxes are approximately valid even if the population variances and differ substantially. As a rough rule of thumb, if the larger sample variance is not more than three times the smaller sample variance when the sample sizes are equal, we can use the equal variances interval and test. 2 Suppose that the larger sample variance is more than three times the smaller sample variance when the sample sizes are equal or, suppose that both the sample sizes and the sample variances differ substantially. Then, we can use an approximate procedure that is sometimes called an “unequal variances” procedure. This procedure says that an approximate 100(1 − α) percent confidence interval for μ1 − μ2 is Furthermore, we can test H0: μ1 − μ2 = D0 by using the test statistic and by using the previously given critical value and p-value conditions. For both the interval and the test, the degrees of freedom are equal to Here, if df is not a whole number, we can round df down to the next smallest whole number. In general, both the “equal variances” and the “unequal variances” procedures have been shown to be approximately valid when the sampled populations are only approximately normally distributed (say, if they are mound-shaped). Furthermore, although the above summary box might seem to imply that we should use the unequal variances procedure only if we cannot use the equal variances procedure, this is not necessarily true. In fact, since the unequal variances procedure can be shown to be a very accurate approximation whether or not the population variances are equal and for most sample sizes (here, both n1 and n2 should be at least 5), many statisticians believe that it is best to use the unequal variances procedure in almost every situation. If each of n1 and n2 is large (at least 30), both the equal variances procedure and the unequal variances procedure are approximately valid, no matter what probability distributions describe the sampled populations. To illustrate the unequal variances procedure, consider the bank customer waiting time situation, and recall that μ1 − μ2 is the difference between the mean customer waiting time under the current system and the mean customer waiting time under the new system. Because of cost considerations, the bank manager wants to implement the new system only if it reduces the mean waiting time by more than three minutes. Therefore, the manager will test the null hypothesis H0: μ1 − μ2 = 3 versus the alternative hypothesis Ha: μ1 − μ2 > 3. If H0 can be rejected in favor of Ha at the .05 level of significance, the manager will implement the new system. Suppose that a random sample of n1 = 100 waiting times observed under the current system gives a sample mean and a sample variance . Further, suppose a random sample of n2 = 100 waiting times observed during the trial run of the new system yields a sample mean and a sample variance Since each sample is large, we can use the unequal variances test statistic t in the summary box. The degrees of freedom for this statistic are

which we will round down to 163. Therefore, because Ha: μ1 − μ2 > 3 is of the form Ha: μ1 − μ2 > D0, we will reject H0: μ1 = μ2 = 3 if the value of the test statistic t is greater than t
α =
t.05 = 1.65 (which is based on 163 degrees of freedom and has been found using a computer). Using the sample data, the value of the test statistic is

Because t
= 2.53 is greater than t.05 = 1.65, we reject H0: μ1 − μ2 = 3 in favor of Ha: μ1 − μ2 > 3. We conclude (at an α of .05) that μ1 − μ2 is greater than 3 and, therefore, that the new system reduces the mean customer waiting time by more than 3 minutes. Therefore, the bank manager will implement the new system. Furthermore, the point estimate says that we estimate that the new system reduces mean waiting time by 3.65 minutes.
Figure 10.3 gives the MegaStat output of using the unequal variances procedure to test H0: μ1 − μ2 = 3 versus Ha: μ1 − μ2 > 3. The output tells us that t = 2.53 and that the associated p-value is .0062. The very small p-value tells us that we have very strong evidence against H0: μ1 − μ2 = 3 and in favor of Ha: μ1 − μ2 > 3. That is, we have very strong evidence that μ1 − μ2 is greater than 3 and, therefore, that the new system reduces the mean customer waiting time by more than 3 minutes. To find a 95 percent confidence interval for μ1 − μ2, note that we can use a computer to find that t.025 based on 163 degrees of freedom is 1.97. It follows that the 95 percent confidence interval for μ1 − μ2 is
Figure 10.3: MegaStat Output of the Unequal Variances Procedure for the Bank Customer Waiting Time Situation

This interval is given on the MegaStat output and says that we are 95 percent confident that the new system reduces the mean customer waiting time by between 3.14 minutes and 4.16 minutes.
In general, the degrees of freedom for the unequal variances procedure will always be less than or equal to n1 + n2 − 2, the degrees of freedom for the equal variances procedure. For example, if we use the unequal variances procedure to analyze the catalyst comparison data in Table 10.1, we can calculate df to be 7.9. This is slightly less than n1 + n2 − 2 = 5 + 5 − 2 = 8, the degrees of freedom for the equal variances procedure. Figure 10.4 gives the MINITAB output of the unequal variances analysis of the catalyst comparison data. Note that MINITAB rounds df down to 7 and finds that a 95 percent confidence interval for μ1 − μ2 is [29.6049, 91.9951]. MINITAB also finds that the test statistic for testing H0: μ1 − μ2 = 0 versus Ha: μ1 − μ2 ≠ 0 is t = 4.61 and that the associated p-value is .002. These results do not differ by much from the results given by the equal variances procedure (see Figure 10.2).
Figure 10.2: MegaStat and Excel Outputs for Testing the Equality of Means in the Catalyst Comparison Case Assuming Equal Variances

Figure 10.4: MINITAB Output of the Unequal Variances Procedure for the Catalyst Comparison Case

To conclude this section, it is important to point out that if the sample sizes n1 and n2 are not large (at least 30), and if we fear that the sampled populations might be far from normally distributed, we can use a nonparametric method. One nonparametric method for comparing populations when using independent samples is the Wilcoxon rank sum test. This test is discussed in Section 18.2 (pages 806–810).
Exercises for Section 10.2
CONCEPTS

For each of the formulas described below, list all of the assumptions that must be satisfied in order to validly use the formula.
10.12 The confidence interval formula in the formula box on page 401.
10.13 The confidence interval formula in the formula box on page 404.
10.14 The hypothesis test described in the formula box on page 403.
10.15 The hypothesis test described in the formula box on page 404.
METHODS AND APPLICATIONS
Suppose we have taken independent, random samples of sizes n1 = 7 and n2 = 7 from two normally distributed populations having means μ1 and μ2, and suppose we obtain , , s1 = 5, and s2 = 6. Using the equal variances procedure do Exercises 10.16, 10.17, and 10.18.
10.16 Calculate a 95 percent confidence interval for μ1 − μ2. Can we be 95 percent confident that μ1 − μ2 is greater than 20? Explain why we can use the equal variances procedure here.
10.17 Use critical values to test the null hypothesis H0: μ1 − μ2 ≤ 20 versus the alternative hypothesis Ha: μ1 − μ2 > 20 by setting α equal to .10, .05, .01, and .001. How much evidence is there that the difference between μ1 and μ2 exceeds 20?
10.18 Use critical values to test the null hypothesis H0: μ1 − μ2 = 20 versus the alternative hypothesis Ha: μ1 − μ2 ≠ 20 by setting α equal to .10, .05, .01, and .001. How much evidence is there that the difference between μ1 and μ2 is not equal to 20?
10.19 Repeat Exercises 10.16 through 10.18 using the unequal variances procedure. Compare your results to those obtained using the equal variances procedure.
10.20 The October 7, 1991, issue of Fortune magazine reported on the rapid rise of fees and expenses charged by mutual funds. Assuming that stock fund expenses and municipal bond fund expenses are each approximately normally distributed, suppose a random sample of 12 stock funds gives a mean annual expense of 1.63 percent with a standard deviation of .31 percent, and an independent random sample of 12 municipal bond funds gives a mean annual expense of 0.89 percent with a standard deviation of .23 percent. Let μ1 be the mean annual expense for stock funds, and let μ2 be the mean annual expense for municipal bond funds. Do parts (a), (b), and (c) by using the equal variances procedure. Then repeat (a), (b), and (c) using the unequal variances procedure. Compare your results.
a Set up the null and alternative hypotheses needed to attempt to establish that the mean annual expense for stock funds is larger than the mean annual expense for municipal bond funds. Test these hypotheses at the .05 level of significance. What do you conclude?
b Set up the null and alternative hypotheses needed to attempt to establish that the mean annual expense for stock funds exceeds the mean annual expense for municipal bond funds by more than .5 percent. Test these hypotheses at the .05 level of significance. What do you conclude?
c Calculate a 95 percent confidence interval for the difference between the mean annual expenses for stock funds and municipal bond funds. Can we be 95 percent confident that the mean annual expense for stock funds exceeds that for municipal bond funds by more than .5 percent? Explain.
10.21 In the book Business Research Methods, Donald R. Cooper and C. William Emory (1995) discuss a manager who wishes to compare the effectiveness of two methods for training new salespeople. The authors describe the situation as follows:
The company selects 22 sales trainees who are randomly divided into two experimental groups—one receives type A and the other type B training. The salespeople are then assigned and managed without regard to the training they have received. At the year’s end, the manager reviews the performances of salespeople in these groups and finds the following results:

a Set up the null and alternative hypotheses needed to attempt to establish that type A training results in higher mean weekly sales than does type B training.
b Because different sales trainees are assigned to the two experimental groups, it is reasonable to believe that the two samples are independent. Assuming that the normality assumption holds, and using the equal variances procedure, test the hypotheses you set up in part a at levels of significance .10, .05, .01, and .001. How much evidence is there that type A training produces results that are superior to those of type B?
c Use the equal variances procedure to calculate a 95 percent confidence interval for the difference between the mean weekly sales obtained when type A training is used and the mean weekly sales obtained when type B training is used. Interpret this interval.
10.22 A marketing research firm wishes to compare the prices charged by two supermarket chains—Miller’s and Albert’s. The research firm, using a standardized one-week shopping plan (grocery list), makes identical purchases at 10 of each chain’s stores. The stores for each chain are randomly selected, and all purchases are made during a single week.
The shopping expenses obtained at the two chains, along with box plots of the expenses, are as follows: ShopExp

Because the stores in each sample are different stores in different chains, it is reasonable to assume that the samples are independent, and we assume that weekly expenses at each chain are normally distributed.
a Letting μM be the mean weekly expense for the shopping plan at Miller’s, and letting μA be the mean weekly expense for the shopping plan at Albert’s, Figure 10.5 gives the MINITAB output of the test of H0: μM − μA = 0 (that is, there is no difference between μM and μA) versus Ha: μM − μA≠ 0 (that is, μM and μA differ). Note that MINITAB has employed the equal variances procedure. Use the sample data to show that , sM = 1.40, , sA = 1.84, and t = 9.73.
Figure 10.5: MINITAB Output of Testing the Equality of Mean Weekly Expenses at Miller’s and Albert’s Supermarket Chains (for Exercise 10.22)

b Using the t statistic given on the output and critical values, test H0 versus Ha by setting α equal to .10, .05, .01, and .001. How much evidence is there that the mean weekly expenses at Miller’s and Albert’s differ?
c Figure 10.5 gives the p-value for testing H0: μM − μA = 0 versus Ha: μM − μA ≠ 0. Use the p-value to test H0 versus Ha by setting α equal to .10, .05, .01, and .001. How much evidence is there that the mean weekly expenses at Miller’s and Albert’s differ?
d Figure 10.5 gives a 95 percent confidence interval for μM − μA. Use this confidence interval to describe the size of the difference between the mean weekly expenses at Miller’s and Albert’s. Do you think that these means differ in a practically important way?
e Set up the null and alternative hypotheses needed to attempt to establish that the mean weekly expense for the shopping plan at Miller’s exceeds the mean weekly expense at Albert’s by more than $5. Test the hypotheses at the .10, .05, .01, and .001 levels of significance. How much evidence is there that the mean weekly expense at Miller’s exceeds that at Albert’s by more than $5?
10.23 A large discount chain compares the performance of its credit managers in Ohio and Illinois by comparing the mean dollar amounts owed by customers with delinquent charge accounts in these two states. Here a small mean dollar amount owed is desirable because it indicates that bad credit risks are not being extended large amounts of credit. Two independent, random samples of delinquent accounts are selected from the populations of delinquent accounts in Ohio and Illinois, respectively. The first sample, which consists of 10 randomly selected delinquent accounts in Ohio, gives a mean dollar amount of $524 with a standard deviation of $68. The second sample, which consists of 20 randomly selected delinquent accounts in Illinois, gives a mean dollar amount of $473 with a standard deviation of $22.
a Set up the null and alternative hypotheses needed to test whether there is a difference between the population mean dollar amounts owed by customers with delinquent charge accounts in Ohio and Illinois.
b Figure 10.6 gives the MegaStat output of using the unequal variances procedure to test the equality of mean dollar amounts owed by customers with delinquent charge accounts in Ohio and Illinois. Assuming that the normality assumption holds, test the hypotheses you set up in part a by setting α equal to .10, .05, .01, and .001. How much evidence is there that the mean dollar amounts owed in Ohio and Illinois differ?
Figure 10.6: MegaStat Output of Testing the Equality of Mean Dollar Amounts Owed for Ohio and Illinois (for Exercise 10.23)

c Assuming that the normality assumption holds, calculate a 95 percent confidence interval for the difference between the mean dollar amounts owed in Ohio and Illinois. Based on this interval, do you think that these mean dollar amounts differ in a practically important way?
10.24 A loan officer compares the interest rates for 48-month fixed-rate auto loans and 48-month variable-rate auto loans. Two independent, random samples of auto loan rates are selected. A sample of eight 48-month fixed-rate auto loans had the following loan rates: AutoLoan

while a sample of five 48-month variable-rate auto loans had loan rates as follows:

a Set up the null and alternative hypotheses needed to determine whether the mean rates for 48-month fixed-rate and variable-rate auto loans differ.
b Figure 10.7 gives the MegaStat output of using the equal variances procedure to test the hypotheses you set up in part a. Assuming that the normality and equal variances assumptions hold, use the MegaStat output and critical values to test these hypotheses by setting α equal to .10, .05, .01, and .001. How much evidence is there that the mean rates for 48-month fixed- and variable-rate auto loans differ?
Figure 10.7: MegaStat Output of Testing the Equality of Mean Loan Rates for Fixed and Variable 48-Month Auto Loans (for Exercise 10.24)

c Figure 10.7 gives the p-value for testing the hypotheses you set up in part a. Use the p-value to test these hypotheses by setting α equal to .10, .05, .01, and .001. How much evidence is there that the mean rates for 48-month fixed- and variable-rate auto loans differ?
d Calculate a 95 percent confidence interval for the difference between the mean rates for fixed- and variable-rate 48-month auto loans. Can we be 95 percent confident that the difference between these means is .4 percent or more? Explain.
e Use a hypothesis test to establish that the difference between the mean rates for fixed- and variable-rate 48-month auto loans exceeds .4 percent. Use α equal to .05.
10.3: Paired Difference Experiments
EXAMPLE 10.5: The Repair Cost Comparison Case
Home State Casualty, specializing in automobile insurance, wishes to compare the repair costs of moderately damaged cars (repair costs between $700 and $1,400) at two garages. One way to study these costs would be to take two independent samples (here we arbitrarily assume that each sample is of size n = 7). First we would randomly select seven moderately damaged cars that have recently been in accidents. Each of these cars would be taken to the first garage (garage 1), and repair cost estimates would be obtained. Then we would randomly select seven different moderately damaged cars, and repair cost estimates for these cars would be obtained at the second garage (garage 2). This sampling procedure would give us independent samples because the cars taken to garage 1 differ from those taken to garage 2. However, because the repair costs for moderately damaged cars can range from $700 to $1,400, there can be substantial differences in damages to moderately damaged cars. These differences might tend to conceal any real differences between repair costs at the two garages. For example, suppose the repair cost estimates for the cars taken to garage 1 are higher than those for the cars taken to garage 2. This difference might exist because garage 1 charges customers more for repair work than does garage 2. However, the difference could also arise because the cars taken to garage 1 are more severely damaged than the cars taken to garage 2.
To overcome this difficulty, we can perform a paired difference experiment. Here we could randomly select one sample of n = 7 moderately damaged cars. The cars in this sample would be taken to both garages, and a repair cost estimate for each car would be obtained at each garage. The advantage of the paired difference experiment is that the repair cost estimates at the two garages are obtained for the same cars. Thus, any true differences in the repair cost estimates would not be concealed by possible differences in the severity of damages to the cars.

Suppose that when we perform the paired difference experiment, we obtain the repair cost estimates in Table 10.2 (these estimates are given in units of $100). To analyze these data, we calculate the difference between the repair cost estimates at the two garages for each car. The resulting paired differences are given in the last column of Table 10.2. The mean of the sample of n = 7 paired differences is
Table 10.2: A Sample of n = 7 Paired Differences of the Repair Cost Estimates at Garages 1 and 2 (Cost Estimates in Hundreds of Dollars) Repair

which equals the difference between the sample means of the repair cost estimates at the two garages

Furthermore, (that is, − $80) is the point estimate of
μd = μ1 − μ2
the mean of the population of all possible paired differences of the repair cost estimates (for all possible moderately damaged cars) at garages 1 and 2—which is equivalent to μ1, the mean of all possible repair cost estimates at garage 1, minus μ2, the mean of all possible repair cost estimates at garage 2. This says we estimate that the mean of all possible repair cost estimates at garage 1 is $80 less than the mean of all possible repair cost estimates at garage 2.
In addition, the variance and standard deviation of the sample of n = 7 paired differences

and

are the point estimates of and, σd, the variance and standard deviation of the population of all possible paired differences.
In general, suppose we wish to compare two population means, μ1 and μ2. Also suppose that we have obtained two different measurements (for example, repair cost estimates) on the same n units (for example, cars), and suppose we have calculated the n paired differences between these measurements. Let and sd be the mean and the standard deviation of these n paired differences. If it is reasonable to assume that the paired differences have been randomly selected from a normally distributed (or at least mound-shaped) population of paired differences with mean μd and standard deviation σd, then the sampling distribution of

is a t distribution having n − 1 degrees of freedom. This implies that we have the following confidence interval for μd:
A Confidence Interval for the Mean, μd, of a Population of Paired Differences
Let μd be the mean of a normally distributed population of paired differences, and let and sd be the mean and standard deviation of a sample of n paired differences that have been randomly selected from the population. Then, a 100(1 − α) percent confidence interval for μd = μ1 − μ2 is

Here tα/2 is based on (n − 1) degrees of freedom.
EXAMPLE 10.6: The Repair Cost Comparison Case
Using the data in Table 10.2, and assuming that the population of paired repair cost differences is normally distributed, a 95 percent confidence interval for μd = μ1 − μ2 is

Here t.025 = 2.447 is based on n − 1 = 7 − 1 = 6 degrees of freedom. This interval says that Home State Casualty can be 95 percent confident that μd, the mean of all possible paired differences of the repair cost estimates at garages 1 and 2, is between −$126.54 and −$33.46. That is, we are 95 percent confident that μ1, the mean of all possible repair cost estimates at garage 1, is between $126.54 and $33.46 less than μ2, the mean of all possible repair cost estimates at garage 2.
We can also test a hypothesis about μd, the mean of a population of paired differences. We show how to test the null hypothesis
H0: μd = D0
in the following box. Here the value of the constant D0 depends on the particular problem. Often D0 equals 0, and the null hypothesis H0: μd = 0 says that μ1 and μ2 do not differ.
Testing a Hypothesis about the Mean, μd, of a Population of Paired Differences
Let μd, , and sd be defined as in the preceding box. Also, assume that the population of paired differences is normally distributed, and consider testing
H0: μd = D0
by using the test statistic

We can test H0: μd = D0 versus a particular alternative hypothesis at level of significance α by using the appropriate critical value rule, or, equivalently, the corresponding p-value.

Here tα, tα/2, and the p-values are based on n − 1 degrees of freedom.
EXAMPLE 10.7: The Repair Cost Comparison Case
Home State Casualty currently contracts to have moderately damaged cars repaired at garage 2. However, a local insurance agent suggests that garage 1 provides less expensive repair service that is of equal quality. Because it has done business with garage 2 for years, Home State has decided to give some of its repair business to garage 1 only if it has very strong evidence that μ1, the mean repair cost estimate at garage 1, is smaller than μ2, the mean repair cost estimate at garage 2—that is, if μd = μ1 − μ2 is less than zero. Therefore, we will test
H0 : μd = 0 or, equivalently, H0: μ1 − μ2 = 0, versus Ha: μd < 0 or, equivalently, Ha: μ1 − μ2 < 0, at the .01 level of significance. To perform the hypothesis test, we will use the sample data in Table 10.2 to calculate the value of the test statistic t in the summary box. Because Ha: μd < 0 is of the form Ha: μd < D0, we will reject H0: μd = 0 if the value of t is less than − tα = −t.01 = −3.143. Here the tα point is based on n − 1 = 7 − 1 = 6 degrees of freedom. Using the data in Table 10.2, the value of the test statistic is Because t = −4.2053 is less than −t.01 = −3.143, we can reject H0: μd = 0 in favor of Ha: μd < 0. We conclude (at an α of .01) that μ1, the mean repair cost estimate at garage 1, is less than μ2, the mean repair cost estimate at garage 2. As a result, Home State will give some of its repair business to garage 1. Furthermore, Figure 10.8(a), which gives the MINITAB output of this hypothesis test, shows us that the p-value for the test is .003. Since this p-value is very small, we have very strong evidence that H0 should be rejected and that μ1 is less than μ2. To demonstrate testing a “not equal to” alternative hypothesis, Figure 10.8(b) gives the MegaStat output of testing H0: μd = 0 versus Ha: μd ≠ 0. The output shows thatthe p-valuefor this two-tailed test is .0057. MegaStat will, of course, also perform the test of H0: μd = 0 versus Ha: μd < 0. For this test (output not shown), MegaStat finds that the p-value is .0028 (or .003 rounded). Finally, Figure 10.9 gives the Excel output of both the one- and two-tailed tests. The small p-value related to the one-tailed test tells us that Home State has very strong evidence that the mean repair cost at garage 1 is less than the mean repair cost at garage 2. Figure 10.8: MINITAB and MegaStat Outputs of Testing H0: μd = 0 Figure 10.9: Excel Output of Testing H0: μd = 0 In general, an experiment in which we have obtained two different measurements on the same n units is called a paired difference experiment. The idea of this type of experiment is to remove the variability due to the variable (for example, the amount of damage to a car) on which the observations are paired. In many situations, a paired difference experiment will provide more information than an independent samples experiment. As another example, suppose that we wish to assess which of two different machines produces a higher hourly output. If we randomly select 10 machine operators and randomly assign 5 of these operators to test machine 1 and the others to test machine 2, we would be performing an independent samples experiment. This is because different machine operators test machines 1 and 2. However, any difference in machine outputs could be obscured by differences in the abilities of the machine operators. For instance, if the observed hourly outputs are higher for machine 1 than for machine 2, we might not be able to tell whether this is due to (1) the superiority of machine 1 or (2) the possible higher skill level of the operators who tested machine 1. Because of this, it might be better to randomly select five machine operators, thoroughly train each operator to use both machines, and have each operator test both machines. We would then be pairing on the machine operator, and this would remove the variability due to the differing abilities of the operators. The formulas we have given for analyzing a paired difference experiment are based on the t distribution. These formulas assume that the population of all possible paired differences is normally distributed (or at least mound-shaped). If the sample size is large (say, at least 30), the t based interval and tests of this section are approximately valid no matter what the shape of the population of all possible paired differences. If the sample size is small, and if we fear that the population of all paired differences might be far from normally distributed, we can use a nonparametric method. One nonparametric method for comparing two populations when using a paired difference experiment is the Wilcoxon signed ranks test, discussed in Section 18.3. Exercises for Section 10.3 CONCEPTS 10.25 Explain how a paired difference experiment differs from an independent samples experiment in terms of how the data for these experiments are collected. 10.26 Why is a paired difference experiment sometimes more informative than an independent samples experiment? Give an example of a situation in which a paired difference experiment might be advantageous. 10.27 What assumptions must be satisfied to appropriately carry out a paired difference experiment? When can we carry out a paired difference experiment no matter what the shape of the population of all paired differences might be? 10.28 Suppose a company wishes to compare the hourly output of its employees before and after vacations. Explain how you would collect data for a paired difference experiment to make this comparison. METHODS AND APPLICATIONS 10.29 Suppose a sample of 11 paired differences that has been randomly selected from a normally distributed population of paired differences yields a sample mean of = 103.5 and a samplestandard deviation of sd = 5. a Calculate 95 percent and 99 percent confidence intervals for μd = μ1 − μ2. Can we be 95 per cent confident that the difference between μ1 and μ2 exceeds 100? Can we be 99 percent confident? b Test the null hypothesis H0: μd ≤ 100 versus Ha: μd > 100 by setting α equal to .05 and .01. How much evidence is there that μd = μ1 − μ2 exceeds 100?
c Test the null hypothesis H0: μd ≥ 110 versus Ha: μd < 110 by setting α equal to .05 and .01. How much evidence is there that μd = μ1 − μ2 is less than 110? 10.30 Suppose a sample of 49 paired differences that have been randomly selected from a normally distributed population of paired differences yields a sample mean of = 5 and a sample standard deviation of sd = 7. a Calculate a 95 percent confidence interval for μd = μ1 − μ2. Can we be 95 percent confident that the difference between μ1 and μ2 is greater than 0? b Test the null hypothesis H0: μd = 0 versus the alternative hypothesis Ha: μd ≠ 0 by setting α equal to .10, .05, .01, and .001. How much evidence is there that μd differs from 0? What does this say about how μ1 and μ2 compare? c The p-value for testing H0: μd ≤ 3 versus Ha: μd > 3 equals .0256. Use the p-value to test these hypotheses with α equal to .10, .05, .01, and .001. How much evidence is there that μd exceeds 3? What does this say about the size of the difference between μ1 and μ2?
10.31 On its website, the Statesman Journal newspaper (Salem, Oregon, 1999) reports mortgage loan interest rates for 30-year and 15-year fixed-rate mortgage loans for a number of Willamette Valley lending institutions. Of interest is whether there is any systematic difference between 30-year rates and 15-year rates (expressed as annual percentage rate or APR) and, if there is, what is the size of that difference. Table 10.3 displays mortgage loan rates and the difference between 30-year and 15-year rates for nine randomly selected lending institutions. Assuming that the population of paired differences is normally distributed: Mortgage99
Table 10.3: 1999 Mortgage Loan Interest Rates for Nine Randomly Selected Willamette Valley Lending Institutions Mortgage99

a Set up the null and alternative hypotheses needed to determine whether there is a difference between mean 30-year rates and mean 15-year rates.
b Figure 10.10 gives the MINITAB output for testing the hypotheses that you set up in part a. Use the output and critical values to test these hypotheses by setting α equal to .10, .05, .01, and .001. How much evidence is there that mean mortgage loan rates for 30-year and 15-year terms differ?
Figure 10.10: MINITAB Paired Difference t Test of the Mortgage Loan Rate Data (for Eercise 10.31)

c Figure 10.10 gives the p-value for testing the hypotheses that you set up in part a. Use the p-value to test these hypotheses by setting α equal to .10, .05, .01, and .001. How much evidence is there that mean mortgage loan rates for 30-year and 15-year terms differ?
d Calculate a 95 percent confidence interval for the difference between mean mortgage loan rates for 30-year rates versus 15-year rates. Interpret this interval.
10.32 In the book Essentials of Marketing Research, William R. Dillon, Thomas J. Madden, and Neil H. Firtle (1993) present preexposure and postexposure attitude scores from an advertising study involving 10 respondents. The data for the experiment are given in Table 10.4. Assuming that the differences between pairs of postexposure and preexposure scores are normally distributed: AdStudy
Table 10.4: Preexposure and Postexposure Attitude Scores (for Exercise 10.32) AdStudy

a Set up the null and alternative hypotheses needed to attempt to establish that the advertisement increases the mean attitude score (that is, that the mean postexposure attitude score is higher than the mean preexposure attitude score).
b Test the hypotheses you set up in part a at the .10, .05, .01, and .001 levels of significance. How much evidence is there that the advertisement increases the mean attitude score?
c Estimate the minimum difference between the mean postexposure attitude score and the mean preexposure attitude score. Justify your answer.
10.33 National Paper Company must purchase a new machine for producing cardboard boxes. The company must choose between two machines. The machines produce boxes of equal quality, so the company will choose the machine that produces (on average) the most boxes. It is known that there are substantial differences in the abilities of the company’s machine operators. Therefore National Paper has decided to compare the machines using a paired difference experiment. Suppose that eight randomly selected machine operators produce boxes for one hour using machine 1 and for one hour using machine 2, with the following results: BoxYield

a Assuming normality, perform a hypothesis test to determine whether there is a difference between the mean hourly outputs of the two machines. Use α = .05.
b Estimate the minimum and maximum differences between the mean outputs of the two machines. Justify your answer.
10.34 During 2004 a company implemented a number of policies aimed at reducing the ages of its customers’ accounts. In order to assess the effectiveness of these measures, the company randomly selects 10 customer accounts. The average age of each account is determined for the years 2003 and 2004. These data are given in Table 10.5. Assuming that the population of paired differences between the average ages in 2004 and 2003 is normally distributed: AcctAge
Table 10.5: Average Account Ages in 2003 and 2004 for 10 Randomly Selected Accounts (for Exercise 10.34) AcctAge

a Set up the null and alternative hypotheses needed to establish that the mean average account age has been reduced by the company’s new policies.
b Figure 10.11 gives the MegaStat and Excel outputs needed to test the hypotheses of part a. Use critical values to test these hypotheses by setting α equal to .10, .05, .01, and .001. How much evidence is there that the mean average account age has been reduced?
Figure 10.11: MegaStat and Excel Outputs of a Paired Difference Analysis of the Account Age Data (for Exercise 10.34)

c Figure 10.11 gives the p-value for testing the hypotheses of part a. Use the p-value to test these hypotheses by setting α equal to .10, .05, .01, and .001. How much evidence is there that the mean average account age has been reduced?
d Calculate a 95 percent confidence interval for the mean difference in the average account ages between 2004 and 2003. Estimate the minimum reduction in the mean average account ages from 2003 to 2004.
10.35 Do students reduce study time in classes where they achieve a higher midterm score? In a Journal of Economic Education article (Winter 2005), Gregory Krohn and Catherine O’Connor studied student effort and performance in a class over a semester. In an intermediate macroeconomics course, they found that “students respond to higher midterm scores by reducing the number of hours they subsequently allocate to studying for the course.”5 Suppose that a random sample of n = 8 students who performed well on the midterm exam was taken and weekly study time before and after the exam were compared. The resulting data are given in Table 10.6. Assume that the population of all possible paired differences is normally distributed.
Table 10.6: Weekly Study Time Data for Students Who Perform Well on the MidTerm StudyTime

a Set up the null and alternative hypotheses to test whether there is a difference in the true mean study time before and after the midterm exam.
b Below we present the MINITAB output for the paired differences test. Use the output and critical values to test the hypotheses at the .10, .05 and .01 levels of significance. Has the true mean study time changed?

c Use the p-value to test the hypotheses at the .10, .05, and .01 levels of significance. How much evidence is there against the null hypothesis?
10.4: Comparing Two Population Proportions by Using Large, Independent Samples
EXAMPLE 10.8: The Advertising Media Case
Suppose a new product was test marketed in the Des Moines, Iowa, and Toledo, Ohio, metropolitan areas. Equal amounts of money were spent on advertising in the two areas. However, different advertising media were employed in the two areas. Advertising in the Des Moines area was done entirely on television, while advertising in the Toledo area consisted of a mixture of television, radio, newspaper, and magazine ads. Two months after the advertising campaigns commenced, surveys are taken to estimate consumer awareness of the product. In the Des Moines area, 631 out of 1,000 randomly selected consumers are aware of the product, whereas in the Toledo area 798 out of 1,000 randomly selected consumers are aware of the product. We define p1 to be the true proportion of consumers in the Des Moines area who are aware of the product and p2 to be the true proportion of consumers in the Toledo area who are aware of the product. It follows that, since the sample proportions of consumers who are aware of the product in the Des Moines and Toledo areas are

and

then a point estimate of p1 − p2 is

This says we estimate that p1 is .167 less than p2. That is, we estimate that the percentage of consumers who are aware of the product in the Toledo area is 16.7 percentage points higher than the percentage in the Des Moines area.
In order to find a confidence interval for and to carry out a hypothesis test about p1 − p2, we need to know the properties of the sampling distribution of In general, therefore, consider randomly selecting n1 units from a population, and assume that a proportion p1 of all the units in the population fall into a particular category. Let denote the proportion of units in the sample that fall into the category. Also, consider randomly selecting a sample of n2 units from a second population, and assume that a proportion p2 of all the units in this population fall into the particular category. Let denote the proportion of units in the second sample that fall into the category.
The Sampling Distribution of
If the randomly selected samples are independent of each other, then the population of all possible values of :
1 Approximately has a normal distribution if each of the sample sizes n1 and n2 is large. Here n1 and n2 are large enough if n1p1, n1(1 − p2), n2p2, and n2(1 − p2) are all at least 5.
2 Has mean

3 Has standard deviation
If we estimate p1 by and p2 by in the expression for , then the sampling distribution of implies the following 100(1 − α) percent confidence interval for p1 − p2.
A Large Sample Confidence Interval for the Difference between Two Population Proportions6
Suppose we randomly select a sample of size n1 from a population, and let denote the proportion of units in this sample that fall into a category of interest. Also suppose we randomly select a sample of size n2 from another population, and let denote the proportion of units in this second sample that fall into the category of interest. Then, if each of the sample sizes n1 and n2 is large must all be at least 5), and if the random samples are independent of each other, a 100(1 − α) percent confidence interval p1 − p2 is

EXAMPLE 10.9: The Advertising Media Case
Recall that in the advertising media situation described at the beginning of this section, 631 of 1,000 randomly selected consumers in Des Moines are aware of the new product, while 798 of 1,000 randomly selected consumers in Toledo are aware of the new product. Also recall that

and

Because are all at least 5, both and n1 and n2 can be considered large. It follows that a 95 percent confidence interval for p1 − p2 is

This interval says we are 95 percent confident that p1, the proportion of all consumers in the Des Moines area who are aware of the product, is between .2059 and .1281 less than p2, the proportion of all consumers in the Toledo area who are aware of the product. Thus, we have substantial evidence that advertising the new product by using a mixture of television, radio, newspaper, and magazine ads (as in Toledo) is more effective than spending an equal amount of money on television commercials only.

and a 100(1 − α) percent confidence interval for p1 − p2 is . Because both n1 and n2 are large, there is little difference between the interval obtained by using this formula and those obtained by using the formula in the box above.
To test the null hypothesis H0: p1−p2=D0, we use the test statistic

A commonly employed special case of this hypothesis test is obtained by setting D0 equal to 0. In this case, the null hypothesis H0: p1 − p2 = 0 says there is no difference between the population proportions p1 and p2. When D0 = 0, the best estimate of the common population proportion p = p1 = p2 is obtained by computing

Therefore, the point estimate of is

For the case where D0 ≠ 0 the point estimate of is obtained by estimating p1 by and p2 by . With these facts in mind, we present the following procedure for testing H0: p1 − p2 = D0:
A Hypothesis Test about the Difference between Two Population Proportions
Let be as just defined, and let , and n1, and n2 be as defined in the preceding box. Furthermore, define the test statistic

and assume that each of the sample sizes n1 and n2 is large. Then, if the samples are independent of each other, we can test H0: p1−p2=D0 versus a particular alternative hypothesis at level of significance α by using the appropriate critical value rule, or, equivalently, the corresponding p-value.

Note:
1 If D0 = 0, we estimate by

2 If D0≠0, we estimate by

EXAMPLE 10.10: The Advertising Media Case
Recall that p1 is the proportion of all consumers in the Des Moines area who are aware of the new product and that p2 is the proportion of all consumers in the Toledo area who are aware of the new product. To test for the equality of these proportions, we will test
H0 : p1 − p2 = 0 versus Ha: p1 − p2 ≠ 0 at the .05 level of significance. Because both of the Des Moines and Toledo samples are large (see Example 10.9), we will calculate the value of the test statistic z in the summary box (where D0 = 0). Since Ha: p1 − p2 ≠ 0 is of the form Ha: p1 − p2 ≠ D0, we will reject H0: p1 − p2 = 0 if the absolute value of z is greater than zα/2 = z.05/2 = 1.96. Because 631 out of 1,000 randomly selected Des Moines residents were aware of the product and 798 out of 1,000 randomly selected Toledo residents were aware of the product, the estimate of p = p1 = p2 is

and the value of the test statistic is

Because | z | = 8.2673 is greater than 1.96, we can reject H0: p1 − p2 = 0 in favor of Ha: p1 − p2 ≠ 0. We conclude (at an a of .05) that the proportions of consumers who are aware of the product in Des Moines and Toledo differ. Furthermore, the point estimate − = .631 − .798 = −.167 says we estimate that the percentage of consumers who are aware of the product in Toledo is 16.7 percentage points higher than the percentage of consumers who are aware of the product in Des Moines. The p-value for this test is twice the area under the standard normal curve to the right of | z | = 8.2673. Since the area under the standard normal curve to the right of 3.99 is .00003, the p-value for testing H0 is less than 2(.00003) = .00006. It follows that we have extremely strong evidence that H0: p1 − p2 = 0 should be rejected in favor of Ha: p1 − p2 ≠0. That is, this small p-value provides extremely strong evidence that p1 and p2 differ. Figure 10.12 presents the MegaStat output of the hypothesis test of H0: p1 − p2 = 0 versus Ha: p1 − p2 ≠ 0 and of a 95 percent confidence interval for p1 − p2. A MINITAB output of the test and confidence interval is given in Appendix 10.1 on pages 435–436.
Figure 10.12: MegaStat Output of Statistical Inference in the Advertising Media Case

Exercises for Section 10.4
CONCEPTS

10.36 Explain what population is described by the sampling distribution of − .
10.37 What assumptions must be satisfied in order to use the methods presented in this section?
METHODS AND APPLICATIONS
In Exercises 10.38 through 10.40 we assume that we have selected two independent random samples from populations having proportions p1 and p2 and that = 800/1,000 = .8 and = 950/1,000 = .95.
10.38 Calculate a 95 percent confidence interval for p1 − p2. Interpret this interval. Can we be 95 percent confident that p1 − p2 is less than 0? That is, can we be 95 percent confident that p1 is less than p2? Explain.
10.39 Test H0: p1 − p2 = versus Ha: p1 − p2 ≠ 0 by using critical values and by setting α equal to .10, .05, .01, and .001. How much evidence is there that p1 and p2 differ? Explain. Hint: z.0005 = 3.29.
10.40 Test H0: p1 − p2 ≥ −.12 versus Ha: p1 − p2 < −.12 by using a p-value and by setting α equal to .10, .05, .01, and .001. How much evidence is there that p2 exceeds p1 by more than .12? Explain. 10.41 In an article in the Journal of Advertising, Weinberger and Spotts compare the use of humor in television ads in the United States and in the United Kingdom. Suppose that independent random samples of television ads are taken in the two countries. A random sample of 400 television ads in the United Kingdom reveals that 142 use humor, while a random sample of 500 television ads in the United States reveals that 122 use humor. a Set up the null and alternative hypotheses needed to determine whether the proportion of ads using humor in the United Kingdom differs from the proportion of ads using humor in the United States. b Test the hypotheses you set up in part a by using critical values and by setting α equal to .10, .05, .01, and .001. How much evidence is there that the proportions of U.K. and U.S. ads using humor are different? c Set up the hypotheses needed to attempt to establish that the difference between the proportions of U.K. and U.S. ads using humor is more than .05 (five percentage points). Test these hypotheses by using a p-value and by setting α equal to .10, .05, .01, and .001. How much evidence is there that the difference between the proportions exceeds .05? d Calculate a 95 percent confidence interval for the difference between the proportion of U.K. ads using humor and the proportion of U.S. ads using humor. Interpret this interval. Can we be 95 percent confident that the proportion of U.K. ads using humor is greater than the proportion of U.S. ads using humor? 10.42 In the book Essentials of Marketing Research, William R. Dillon, Thomas J. Madden, and Neil H. Firtle discuss a research proposal in which a telephone company wants to determine whether the appeal of a new security system varies between homeowners and renters. Independent samples of 140 homeowners and 60 renters are randomly selected. Each respondent views a TV pilot in which a test ad for the new security system is embedded twice. Afterward, each respondent is interviewed to find out whether he or she would purchase the security system. Results show that 25 out of the 140 homeowners definitely would buy the security system, while 9 out of the 60 renters definitely would buy the system. a Letting p1 be the proportion of homeowners who would buy the security system, and letting p2 be the proportion of renters who would buy the security system, set up the null and alternative hypotheses needed to determine whether the proportion of homeowners who would buy the security system differs from the proportion of renters who would buy the security system. b Find the test statistic z and the p-value for testing the hypotheses of part a. Use the p-value to test the hypotheses with α equal to .10, .05, .01, and .001. How much evidence is there that the proportions of homeowners and renters differ? c Calculate a 95 percent confidence interval for the difference between the proportions of homeowners and renters who would buy the security system. On the basis of this interval, can we be 95 percent confident that these proportions differ? Explain. Note: A MegaStat output of the hypothesis test and confidence interval in parts b and c is given in Appendix 10.3 on page 439. 10.43 In the book Cases in Finance, Nunnally and Plath (1995) present a case in which the estimated percentage of uncollectible accounts varies with the age of the account. Here the age of an unpaid account is the number of days elapsed since the invoice date. An accountant believes that the percentage of accounts that will be uncollectible increases as the ages of the accounts increase. To test this theory, the accountant randomly selects independent samples of 500 accounts with ages between 31 and 60 days and 500 accounts with ages between 61 and 90 days from the accounts receivable ledger dated one year ago. When the sampled accounts are examined, it is found that 10 of the 500 accounts with ages between 31 and 60 days were eventually classified as uncollectible, while 27 of the 500 accounts with ages between 61 and 90 days were eventually classified as uncollectible. Let p1 be the proportion of accounts with ages between 31 and 60 days that will be uncollectible, and let p2 be the proportion of accounts with ages between 61 and 90 days that will be uncollectible. Use the MINITAB output below to determine how much evidence there is that we should reject H0: p1 − p2 = 0 in favor of Ha: p1 − p2 ≠ 0. Also, identify a 95 percent confidence interval for p1 − p2, and estimate the smallest that the difference between p1 and p2 might be. 10.44 On January 7, 2000, the Gallup Organization released the results of a poll comparing the lifestyles of today with yesteryear. The survey results were based on telephone interviews with a randomly selected national sample of 1,031 adults, 18 years and older, conducted December 20–21, 1999. The poll asked several questions and compared the 1999 responses with the responses given in polls taken in previous years. Below we summarize some of the poll’s results.7 Percentage of respondents who Assuming that each poll was based on a randomly selected national sample of 1,031 adults and that the samples in different years are independent: a Let p1 be the December 1999 population proportion of U.S. adults who had taken a vacation lasting six days or more within the last 12 months, and let p2 be the December 1968 population proportion who had taken such a vacation. Calculate a 99 percent confidence interval for the difference between p1 and p2. Interpret what this interval says about how these population proportions differ. b Let p1 be the December 1999 population proportion of U.S. adults who took part in some sort of daily activity to keep physically fit, and let p2 be the September 1977 population proportion who did the same. Carry out a hypothesis test to attempt to justify that the proportion who took part in such daily activity increased from September 1977 to December 1999. Use α = .05 and explain your result. c Let p1 be the December 1999 population proportion of U.S. adults who watched TV more than four hours on an average weekday, and let p2 be the April 1981 population proportion who did the same. Carry out a hypothesis test to determine whether these population proportions differ. Use α = .05 and interpret the result of your test. d Let p1 be the December 1999 population proportion of U.S. adults who drove a car or truck to work, and let p2 be the April 1971 population proportion who did the same. Calculate a 95 percent confidence interval for the difference between p1 and p2. On the basis of this interval, can it be concluded that the 1999 and 1971 population proportions differ? 10.45 In the book International Marketing, Philip R. Cateora reports the results of an MTV-commissioned study of the lifestyles and spending habits of the 14–34 age group in six countries. The survey results are given in Table 10.7. PurchPct Table 10.7: Results of an MTV-Commissioned Survey of the Lifestyles and Spending Habits of the 14–34 Age Group in Six Countries PurchPct a As shown in Table 10.7, 10 percent of the 14- to 34-year-olds surveyed in the United States had purchased soft drinks in the last three months, while 90 percent of the 14- to 34-year-olds surveyed in Australia had done the same. Assuming that these results were obtained from independent random samples of 500 respondents in each country, carry out a hypothesis test that tests the equality of the population proportions of 14- to 34-year-olds in the United States and in Australia who have purchased soft drinks in the last three months. Also, calculate a 95 percent confidence interval for the difference between these two population proportions, and use this interval to estimate the largest and smallest values that the difference between these proportions might be. Based on your confidence interval, do you feel that this result has practical importance? b Again as shown in Table 10.7, percent of the 14- to 34-year-olds surveyed in Australia had purchased athletic footwear in the last three months, while 54 percent of the 14- to 34-year-olds surveyed in Brazil had done the same. Assuming that these results were obtained from independent random samples of 500 respondents in each country, carry out a hypothesis test that tests the equality of the population proportions of 14- to 34-year-olds in Australia and in Brazil who have purchased athletic footwear in the last three months. Also, calculate a 95 percent confidence interval for the difference between these two population proportions, and use this interval to estimate the largest and smallest values that the difference between these proportions might be. Based on your confidence interval, do you feel that this result has practical importance? 10.5: Comparing Two Population Variances by Using Independent Samples Chapter 10 We have seen (in Sections 10.1 and 10.2) that we often wish to compare two population means. In addition, it is often useful to compare two population variances. For example, in the bank waiting time situation of Example 10.1, we might compare the variance of the waiting times experienced under the current and new systems. Or, as another example, we might wish to compare the variance of the chemical yields obtained when using Catalyst XA-100 with that obtained when using Catalyst ZB-200. Here the catalyst that produces yields with the smaller variance is giving more consistent (or predictable) results. If and are the population variances that we wish to compare, one approach is to test the null hypothesis We might test H0 versus an alternative hypothesis of, for instance, Dividing by , we see that testing these hypotheses is equivalent to testing Intuitively, we would reject H0 in favor of Ha if is significantly larger than 1. Here is the variance of a random sample of n1 observations from the population with variance , and is the variance of a random sample of n2 observations from the population with variance . To decide exactly how large must be in order to reject H0, we need to consider the sampling distribution of 8 It can be shown that, if the null hypothesis is true, then the population of all possible values of is described by what is called an F distribution. In general, as illustrated in Figure 10.13, the curve of the F distribution is skewed to the right. Moreover, the exact shape of this curve depends on two parameters that are called the numerator degrees of freedom (denoted df1) and the denominator degrees of freedom (denoted df2). The values of df1 and df2 that describe the sampling distribution of are given in the following result: Figure 10.13: F Distribution Curves and F Points The Sampling Distribution of Suppose we randomly select independent samples from two normally distributed populations having variances and Then, if the null hypothesis H0: is true, the population of all possible values of has an F distribution with df1 = (n1 − 1) numerator degrees of freedom and with df2 = (n2 − 1) denominator degrees of freedom. In order to use the F distribution, we employ an F point, which is denoted Fα. As illustrated in Figure 10.13(a), Fα is the point on the horizontal axis under the curve of the F distribution that gives a right-hand tail area equal to α. The value of Fα in a particular situation depends on the size of the right-hand tail area (the size of) and on the numerator degrees of freedom (df1) and the denominator degrees of freedom (df2). Values of Fα are given in an F table. Tables A.5, A.6, A.7, and A.8 (pages 866–869) give values of F.10, F.05, F.025, and F.01, respectively. Each table tabulates values of Fα according to the appropriate numerator degrees of freedom (values listed across the top of the table) and the appropriate denominator degrees of freedom (values listed down the left side of the table). A portion of Table A.6, which gives values of F.05, is reproduced in this chapter as Table 10.8. For instance, suppose we wish to find the F point that gives a right-hand tail area of .05 under the curve of the F distribution having 4 numerator and 7 denominator degrees of freedom. To do this, we scan across the top of Table 10.8 until we find the column corresponding to 4 numerator degrees of freedom, and we scan down the left side of the table until we find the row corresponding to 7 denominator degrees of freedom. The table entry in this column and row is the desired F point. We find that the F.05 point is 4.12 [see Figure 10.13(b)]. Table 10.8: A Portion of an F Table: Values of F.05 We now present the procedure for testing the equality of two population variances when the alternative hypothesis is one-tailed. Testing the Equality of Population Variances versus a One-Tailed Alternative Hypothesis Suppose we randomly select independent samples from two normally distributed populations—populations 1 and 2. Let be the variance of the random sample of n1 observations from population 1, and let be the variance of the random sample of n2 observations from population 2. 1 In order to test , define the test statistic and define the corresponding p-value to be the area to the right of F under the curve of the F distribution having df1 = n1 − 1 numerator degrees of freedom and df2 = n2 − 1 denominator degrees of freedom. We can reject H0 at level of significance α if and only if a F > Fα or, equivalently,
b p-value < α. Here Fα is based on df1 = n1−1 and df2 = n2 − 1 degrees of freedom. 2 In order to test , define the test statistic and define the corresponding p-value to be the area to the right of F under the curve of the F distribution having df1 = n2 − 1 numerator degrees of freedom and df2 = n1 − 1 denominator degrees of freedom. We can reject H0 at level of significance α if and only if a F > Fα or, equivalently,
b p-value < α. Here Fα is based on df1 = n2 − 1 and df2 = n1 − 1 degrees of freedom. EXAMPLE 10.11: The Catalyst Comparison Case Again consider the catalyst comparison situation of Example 10.3, and suppose the production supervisor wishes to use the sample data in Table 10.1 to determine whether , the variance of the chemical yields obtained by using Catalyst XA-100, is smaller than , the variance of the chemical yields obtained by using Catalyst ZB-200. To do this, the supervisor will test the null hypothesis which says the catalysts produce yields having the same amount of variability, versus the alter native hypothesis which says Catalyst XA-100 produces yields that are less variable (that is, more consistent) than the yields produced by Catalyst ZB-200. Recall from Table 10.1 that n1 = n2 = 5, = 386, and = 484.2. In order to test H0 versus Ha, we compute the test statistic and we compare this value with Fα based on df1 = n2 − 1 = 5 − 1 = 4 numerator degrees of freedom and df2 = n1 − 1 = 5 − 1 = 4 denominator degrees of freedom. If we test H0 versus Hα at the .05 level of significance, then Table 10.8 tells us that when df1 = 4 and df2 = 4, we have F.05 = 6.39. Because F = 1.2544 is not greater than F.05 = 6.39, we cannot reject H0 at the .05 level of significance. That is, at the .05 level of significance we cannot conclude that is less than . This says that there is little evidence that Catalyst XA-100 produces yields that are more consistent than the yields produced by Catalyst ZB-200. The p-value for testing H0 versus Ha is the area to the right of F = 1.2544 under the curve of the F distribution having 4 numerator degrees of freedom and 4 denominator degrees of freedom. The Excel output in Figure 10.14(a) tells us that this p-value equals 0.415724. Since this p-value is large, we have little evidence to support rejecting H0 in favor of Ha. That is, there is little evidence that Catalyst XA-100 produces yields that are more consistent than the yields produced by Catalyst ZB-200. Figure 10.14: Excel and MINITAB Outputs for Testing in the Catalyst Comparison Case Again considering the catalyst comparison case, suppose we wish to test One way to carry out this test is to compute As illustrated in Figure 10.15, if we set α = .10, we compare F with the rejection points F.95 and F.05 under the curve of the F distribution having n1 − 1 = 4 numerator and n2 − 1 = 4 denominator degrees of freedom. We see that we can easily find the appropriate upper-tail rejection point to be F.05 = 6.39. In order to find the lower-tail rejection point, F.95, we use the following relationship: Figure 10.15: Rejection Points for Testing F(1 − α) with df1 numerator and df2 denominator degrees of freedom This says that for the F curve with 4 numerator and 4 denominator degrees of freedom, F(1 − .05) = F.95 = 1/F.05 = 1/6.39 = .1565. Therefore, because F = .797 is not greater than F.05 = 6.39 and since F = .797 is not less than F.95 = .1565, we cannot reject H0 in favor of Ha at the .10 level of significance. Although we can calculate the lower-tail rejection point for this hypothesis test as just illustrated, it is common practice to compute the test statistic F so that its value is always greater than 1. This means that we will always compare F with the upper-tail rejection point when carrying out the test. This can be done by always calculating F to be the larger of and divided by the smaller of and . We obtain the following result: Testing the Equality of Population Variances (Two Tailed Alternative) Suppose we randomly select independent samples from two normally distributed populations and define all notation as in the previous box. Then, in order to test H0: define the test statistic and let Also, define the corresponding p-value to be twice the area to the right of F under the curve of the F distribution having df1 numerator degrees of freedom and df2 denominator degrees of freedom. We can reject H0 at level of significance α if and only if 1 F > Fα/2 or, equivalently,
2 p-value < α. Here Fα/2 is based on df1 and df2 degrees of freedom. EXAMPLE 10.12: The Catalyst Comparison Case In the catalyst comparison situation, we can reject in favor of at the .05 level of significance if is greater than Fα/2 = F.05/2 = F.025. Here the degrees of freedom are and Table A.7 (page 868) tells us that the appropriate F.025 point equals 9.60. Because F = 1.2544 is not greater than 9.60, we cannot reject H0 at the .05 level of significance. Furthermore, the MegaStat output of Figure 10.2(a) (page 404) and the MINITAB output of Figure 10.14(b) tell us that the p-value for this hypothesis test is 0.831. Notice that the MegaStat output gives the F-statistic as defined in the preceding box—the larger of and divided by the smaller of and , whereas the MINITAB output gives the reciprocal of this value (as we calculated on page 427). Since the p-value is large, we have little evidence that the consistencies of the yields produced by Catalysts XA-100 and ZB-200 differ. It has been suggested that the F test of be used to choose between the equal variances and unequal variances t based procedures when comparing two means (as described in Section 10.2). Certainly the F test is one approach to making this choice. However, studies have shown that the validity of the F test is very sensitive to violations of the normality assumption—much more sensitive, in fact, than the equal variances procedure is to violations of the equal variances assumption. While opinions vary, some statisticians believe that this is a serious problem and that the F test should never be used to choose between the equal variances and unequal variances procedures. Others feel that performing the test for this purpose is reasonable if the test’s limitations are kept in mind. As an example for those who believe that using the F test is reasonable, we found in Example 10.12 that we do not reject at the .05 level of significance in the context of the catalyst comparison situation. Further, the p-value related to the F test, which equals 0.831, tells us that there is little evidence to suggest that the population variances differ. It follows that it might be reasonable to compare the mean yields of the catalysts by using the equal variances procedures (as we have done in Examples 10.3 and 10.4). Exercises for Section 10.5 CONCEPTS 10.46 Explain what population is described by the sampling distribution of 10.47 Intuitively explain why a value of / that is substantially greater than 1 provides evidence that is not equal to . METHODS AND APPLICATIONS 10.48 Use Table 10.8 to find the F.05 point for each of the following: a df1 = 3 numerator degrees of freedom and df2 = 14 denominator degrees of freedom. b df1 = 6 and df2 = 10. c df1 = 2 and df2 = 22. d df1 = 7 and df2 = 5. 10.49 Use Tables A.5, A.6, A.7, and A.8 (pages 866–869) to find the following Fα points: a F.10 with df1 = 4 numerator degrees of freedom and df2 = 7 denominator degrees of freedom. b F.01 with df1 = 3 and df2 = 25. c F.025 with df1 = 7 and df2 = 17. d F.05 with df1 = 9 and df2 = 3. 10.50 Suppose two independent random samples of sizes n1 = 9 and n2 = 7 that have been taken from two normally distributed populations having variances and give sample variances of = 100 and = 20. a Test versus with α = .05. What do you conclude? b Test versus with α = .05. What do you conclude? 10.51 Suppose two independent random samples of sizes n1 = 5 and n2 = 16 that have been taken from two normally distributed populations having variances and give sample standard deviations of s1 = 5 and s2 = 9. a Test versus with α = .05. What do you conclude? b Test versus with α = .01. What do you conclude? 10.52 Consider the situation of Exercise 10.23 (page 408). Use the sample information to test versus with α = .05. Based on this test, does it make sense to believe that the unequal variances procedure is appropriate? Explain. 10.53 Consider the situation of Exercise 10.24 (page 408). AutoLoan a Use the MegaStat output in Figure 10.7 (page 409) and a critical value to test versus with α= .05. What do you conclude? b Use a p-value on the MegaStat output in Figure 10.7 to test versus with α = .05. What do you conclude? c Does it make sense to use the equal variances procedure in this situation? d Hand calculate the value of the F statistic for testing . Show that your result turns out to be the same as the F statistic given in Figure 10.7. Chapter Summary This chapter has explained how to compare two populations by using confidence intervals and hypothesis tests. First we discussed how to compare two population means by using independent samples. Here the measurements in one sample are not related to the measurements in the other sample. We saw that in the unlikely event that the population variances are known, a z-based inference can be made. When these variances are unknown, t-based inferences are appropriate if the populations are normally distributed or the sample sizes are large. Both equal variances and unequal variances t-based procedures exist. We learned that, because it can be difficult to compare the population variances, many statisticians believe that it is almost always best to use the unequal variances procedure. Sometimes samples are not independent. We learned that one such case is what is called a paired difference experiment. Here we obtain two different measurements on the same sample units, and we can compare two population means by using a confidence interval or by conducting a hypothesis test that employs the differences between the pairs of measurements. We next explained how to compare two population proportions by using large, independent samples. Finally, we concluded this chapter by discussing how to compare two population variances by using independent samples, and we learned that this comparison is done by using a test based on the F distribution. Glossary of Terms F distribution: A continuous probability curve having a shape that depends on two parameters—the numerator degrees of freedom, df1, and the denominator degrees of freedom, df2. (pages 424–425) independent samples experiment: An experiment in which there is no relationship between the measurements in the different samples. (page 396) paired difference experiment: An experiment in which two different measurements are taken on the same units and inferences are made using the differences between the pairs of measurements. (page 413) sampling distribution of : The probability distribution that describes the population of all possible values of , where is the sample proportion for a random sample taken from one population and is the sample proportion for a random sample taken from a second population. (page 417) sampling distribution of : The probability distribution that describes the population of all possible values of /, where is the sample variance of a random sample taken from one population and is the sample variance of a random sample taken from a second population. (page 424) sampling distribution of : The probability distribution that describes the population of all possible values of , where is the sample mean of a random sample taken from one population and is the sample mean of a random sample taken from a second population. (page 396) Important Formulas and Tests Sampling distribution of (independent random samples): page 396 z-based confidence interval for μ1 − μ2: page 396 z test about μ1 − μ2: page 397 t-based confidence interval for μ1 − μ2 when: : page 401 t-based confidence interval for μ1 − μ2 when ≠ : page 404 t test about μ1 − μ2 when : page 403 t test about μ1 − μ2 when ≠ : page 404 Confidence interval for μd: page 411 A hypothesis test about μd: page 411 Sampling distribution of (independent random samples): page 417 Large sample confidence interval for p1 − p2: page 418 Large sample hypothesis test about p1 − p2: page 419 Sampling distribution of (independent random samples): page 424 A hypothesis test about the equality of and : pages 426 and 428 Supplementary Exercises 10.54 In its February 2, 1998, issue, Fortune magazine published the results of a Yankelovich Partners survey of 600 adults that investigated their ideas about marriage, divorce, and the contributions of the corporate wife. The survey results are shown in Figure 10.16. For each statement in the figure, the proportions of men and women who agreed with the statement are given. Assuming that the survey results were obtained from independent random samples of 300 men and 300 women: a For each statement, carry out a hypothesis test that tests the equality of the population proportions of men and women who agree with the statement. Use α equal to .10, .05, .01, and .001. How much evidence is there that the population proportions of men and women who agree with each statement differ? b For each statement, calculate a 95 percent confidence interval for the difference between the population proportion of men who agree with the statement and the population proportion of women who agree with the statement. Use the interval to help assess whether you feel that the difference between population proportions has practical significance. Figure 10.16: The Results of a Yankelovich Partners Survey of 600 Adults on Marriage, Divorce, and the Contributions of the Corporate Wife (All Respondents with Income $50,000 or More) Source: Reprinted from the February 2, 1998, issue of Fortune. Copyright 1998 Time, Inc. Reprinted by permission. Exercises 10.55 and deal with the following situation: In an article in the Journal of Retailing, Kumar, Kerwin, and Pereira study factors affecting merger and acquisition activity in retailing by comparing “target firms” and “bidder firms” with respect to several financial and marketing-related variables. If we consider two of the financial variables included in the study, suppose a random sample of 36 “target firms” gives a mean earnings per share of $1.52 with a standard deviation of $0.92, and that this sample gives a mean debt-to-equity ratio of 1.66 with a standard deviation of 0.82. Furthermore, an independent random sample of 36 “bidder firms” gives a mean earnings per share of $1.20 with a standard deviation of $0.84, and this sample gives a mean debt-to-equity ratio of 1.58 with a standard deviation of 0.81. 10.55 a Set up the null and alternative hypotheses needed to test whether the mean earnings per share for all “target firms” differs from the mean earnings per share for all “bidder firms.” Test these hypotheses at the .10, .05, .01, and .001 levels of significance. How much evidence is there that these means differ? Explain. b Calculate a 95 percent confidence interval for the difference between the mean earnings per share for “target firms” and “bidder firms.” Interpret the interval. 10.56 a Set up the null and alternative hypotheses needed to test whether the mean debt-to-equity ratio for all “target firms” differs from the mean debt-to-equity ratio for all “bidder firms.” Test these hypotheses at the .10, .05, .01, and .001 levels of significance. How much evidence is there that these means differ? Explain. b Calculate a 95 percent confidence interval for the difference between the mean debt-to-equity ratios for “target firms” and “bidder firms.” Interpret the interval. c Based on the results of this exercise and Exercise 10.55, does a firm’s earnings per share or the firm’s debt-to-equity ratio seem to have the most influence on whether a firm will be a “target” or a “bidder”? Explain. 10.57 What impact did the September 11 terrorist attack have on U.S. airline demand? An analysis was conducted by Ito and Lee, “Assessing the impact of the September 11 terrorist attacks on U.S. airline demand,” in the Journal of Economics and Business (January-February 2005). They found a negative short-term effect of over 30% and an ongoing negative impact of over 7%. Suppose that we wish to test the impact by taking a random sample of 12 airline routes before and after 9/11. Passenger miles (millions of passenger miles) for the same routes were tracked for the 12 months prior to and the 12 months immediately following 9/11. Assume that the population of all possible paired differences is normally distributed. a Set up the null and alternative hypotheses needed to determine whether there was a reduction in mean airline passenger demand. b Below we present the MINITAB output for the paired differences test. Use the output and critical values to test the hypotheses at the .10, .05, and .01 levels of significance. Has the true mean airline demand been reduced? c Use the p-value to test the hypotheses at the .10, .05, and .01 levels of significance. How much evidence is there against the null hypothesis? 10.58 In the book Essentials of Marketing Research, William R. Dillon, Thomas J. Madden, and Neil H. Firtle discuss evaluating the effectiveness of a test coupon. Samples of 500 test coupons and 500 control coupons were randomly delivered to shoppers. The results indicated that 35 of the 500 control coupons were redeemed, while 50 of the 500 test coupons were redeemed. a In order to consider the test coupon for use, the marketing research organization required that the proportion of all shoppers who would redeem the test coupon be statistically shown to be greater than the proportion of all shoppers who would redeem the control coupon. Assuming that the two samples of shoppers are independent, carry out a hypothesis test at the .01 level of significance that will show whether this requirement is met by the test coupon. Explain your conclusion. b Use the sample data to find a point estimate and a 95 percent interval estimate of the difference between the proportions of all shoppers who would redeem the test coupon and the control coupon. What does this interval say about whether the test coupon should be considered for use? Explain. c Carry out the test of part a at the .10 level of significance. What do you conclude? Is your result statistically significant? Compute a 90 percent interval estimate instead of the 95 percent interval estimate of part b. Based on the interval estimate, do you feel that this result is practically important? Explain. 10.59 A marketing manager wishes to compare the mean prices charged for two brands of CD players. The manager conducts a random survey of retail outlets and obtains independent random samples of prices with the following results: Assuming normality and equal variances: a Use an appropriate hypothesis test to determine whether the mean prices for the two brands differ. How much evidence is there that the mean prices differ? b Use an appropriate 95 percent confidence interval to estimate the difference between the mean prices of the two brands of CD players. Do you think that the difference has practical importance? c Use an appropriate hypothesis test to provide evidence supporting the claim that the mean price of the Onkyo CD player is more than $30 higher than the mean price for the JVC CD player. Set α equal to .05. 10.60 Consider the situation of Exercise 10.59. Use the sample information to test H0: versus Ha: with α = .05. Based on this test, does it make sense to use the equal variances procedure? Explain. 10.61: Internet Exercise a A prominent issue of the 2000 U.S. presidential campaign was campaign finance reform. A Washington Post/ABC News poll (reported April 4, 2000) found that 63 percent of 1,083 American adults surveyed believed that stricter campaign finance laws would be effective (a lot or somewhat) in reducing the influence of money in politics. Was this view uniformly held or did it vary by gender, race, or political party affiliation? A summary of survey responses, broken down by gender, is given in the table below. [Source: Washington Post website: http://www.washingtonpost.com/wp-srv/politics/polls/vault/vault.htm. Click on the data link under Gore Seen More Able to Reform Education, April 4, 2000, then click again on the date link, 04/04/2000, to the right of the campaign finance question. For a gender breakdown, select sex in the Results By: box and click Go. Note that the survey report does not include numbers of males and females questioned. These values were estimated using 1990 U.S. Census figures showing that males made up 48 percent of the U.S. adult population.] Is there sufficient evidence in this survey to conclude that the proportion of individuals who believed that campaign finance laws can reduce the influence of money in politics differs between females and males? Set up the appropriate null and alternative hypotheses. Conduct your test at the .05 and .01 levels of significance and calculate the p-value for your test. Make sure your conclusion is clearly stated. b Search the World Wide Web for an interesting recent political poll dealing with an issue or political candidates, where responses are broken down by gender or some other two-category classification. (A list of high-potential websites is given below.) Use a difference in proportions test to determine whether political preference differs by gender or other two-level grouping. Appendix 10.1: Two-Sample Hypothesis Testing Using MINITAB The instruction blocks in this section each begin by describing the entry of data into the Minitab data window. Alternatively, the data may be loaded directly from the data disk included with the text. The appropriate data file name is given at the top of each instruction block. Please refer to Appendix 1.1 for further information about entering data, saving data, and printing results when using MINITAB. Test for the difference between means, unequal variances, in Figure 10.4 on page 406 (data file: Catalyst.MTW): • In the data window, enter the data from Table 10.1 (page 402) into two columns with variable names XA-100 and ZB-200. • Select Stat: Basic Statistics : 2-Sample t. • In the “2-Sample t (Test and Confidence Interval)” dialog box, select the “Samples in different columns” option. • Select the XA-100 variable into the First window. • Select the ZB-200 variable into the Second window. • Click on the Options… button, enter the desired level of confidence (here, 95.0) in the “Confidence level” window, enter 0.0 in the “Test difference” window, and select “not equal” from the Alternative pull-down menu. Click OK in the “2-Sample t—Options” dialog box. • To produce yield by catalyst type boxplots, click the Graphs… button, check the “Boxplots of data” checkbox, and click OK in the “2 Sample t—Graphs” dialog box. • Click OK in the “2-Sample t (Test and Confidence Interval)” dialog box. • The results of the two-sample t test (including the t statistic and p-value) and the confidence interval for the difference between means appear in the Session window, while the boxplots will be displayed in a graphics window. • A test for the difference between two means when the variances are equal can be performed by placing a checkmark in the “Assume Equal Variances” checkbox in the “2-Sample t (Test and Confidence Interval)” dialog box. Test for paired differences in Figure 10.8(a) on page 412 (data file: Repair.MTW): • In the Data window, enter the data from Table 10.2 (page 410) into two columns with variable names Garage1 and Garage2. • Select Stat: Basic Statistics : Paired t. • In the “Paired t (Test and Confidence Interval)” dialog box, select the “Samples in columns” option. • Select Garage1 into the “First sample” window and Garage2 into the “Second sample” window. • Click the Options… button. • In the “Paired t—Options” dialog box, enter the desired level of confidence (here, 95.0) in the “Confidence level” window, enter 0.0 in the Test mean” window, select “less than” from the Alternative pull-down menu, and click OK. • To produce a boxplot of differences with a graphical summary of the test, click the Graphs… button, check the “Boxplot of differences” checkbox, and click OK in the “Paired t—Graphs” dialog box. • Click OK in the “Paired t (Test and Confidence Interval)” dialog box. The results of the paired t-test are given in the Session window, and graphical output is displayed in a graphics window. Hypothesis test and confidence interval for two Independent proportions in the advertising media situation of Example 10.9 and 10.10 on pages 418 to 420: • Select Stat: Basic Statistics : 2 Proportions. • In the “2 Proportions (Test and Confidence Interval)” dialog box, select the “Summarized data” option. • Enter the sample size for Des Moines (equal to 1000) into the “First—Trials” window, and enter the number of successes for Des Moines (equal to 631) into the “First—Events” window. • Enter the sample size for Toledo (equal to 1000) into the “Second—Trials” window, and enter the number of successes for Toledo (equal to 798) into the “Second—Events” window. • Click on the Options… button. • In the “2 Proportions—Options” dialog box, enter the desired level of confidence (here 95.0) in the “Confidence level” window. • Enter 0.0 into the “Test difference” window because we are testing that the difference between the two proportions equals zero. • Select the desired alternative hypothesis (here “not equal”) from the Alternative drop-down menu. • Check the “Use pooled estimate of p for test” checkbox because “Test difference” equals zero. Do not check this box in cases where “Test difference” does not equal zero. • Click OK in the “2 Proportions—Options” dialog box. • Click OK in the “2 Proportions (Test and Confidence Interval)” dialog box to obtain results for the test in the Session window. Test for equality of variances in Figure 10.14(b) on page 427 (data file: Catalyst.MTW): • The MINITAB equality of variance test requires that the yield data be entered in a single column with sample identifiers in a second column: • In the Data window, enter the yield data from Table 10.1 (page 402) into a single column with variable name Yield. In a second column with variable name Catalyst, enter the corresponding identifying tag, XA-100 or ZB-200, for each yield figure. • Select Stat: ANOVA: Test for Equal Variances. • In the “Test for Equal Variances” dialog box, select the Yield variable into the Response window. • Select the Catalyst variable into the Factors window. • Enter the desired level of confidence (here, 95.0) in the Confidence Level window. • Click OK in the “Test for Equal Variances” dialog box. • The reciprocal of the F-statistic (as described in the text) and the p-value will be displayed in the session window (along with additional output that we do not describe in this book). A graphical summary of the test is shown in a graphics window. Appendix 10.2: Two-Sample Hypothesis Testing Using Excel The instruction blocks in this section each begin by describing the entry of data into an Excel spreadsheet. Alternatively, the data may be loaded directly from the data disk included with the text. The appropriate data file name is given at the top of each instruction block. Please refer to Appendix 1.2 for further information about entering data, saving data, and printing results when using Excel. Test for the difference between means, equal variances, in Figure 10.2(b) on page 404 (data file: Catalyst.xlsx): • Enter the data from Table 10.1 (page 402) into two columns: yields for catalyst XA-100 in column A and yields for catalyst ZB-200 in column B, with labels XA-100 and ZB-200. • Select Data: Data Analysis : t-Test: Two-Sample Assuming Equal Variances and click OK in the Data Analysis dialog box. • In the t-Test dialog box, enter A1.A6 in the “Variable 1 Range” window. • Enter B1.B6 in the “Variable 2 Range” window. • Enter 0 (zero) in the “Hypothesized Mean Difference” box. • Place a checkmark in the Labels checkbox. • Enter 0.05 into the Alpha box. • Under output options, select “New Worksheet Ply” to have the output placed in a new worksheet and enter the name Output for the new worksheet. • Click OK in the t-Test dialog box. • The output will be displayed in a new worksheet. Test for equality of variances similar to Figure 10.14(a) on page 427 (data file: Catalyst.xlsx): • Enter the data from Table 10.1 (page 402) into two columns: yields for catalyst XA-100 in column A and yields for catalyst ZB-200 in column B, with labels XA-100 and ZB-200. • Select Data: Data Analysis : F-Test Two-Sample for Variances and click OK in the Data Analysis dialog box. • In the F-Test dialog box, enter A1.A6 in the “Variable 1 Range” window. • Enter B1.B6 in the “Variable 2 Range” window. • Place a checkmark in the Labels checkbox. • Enter 0.05 into the Alpha box. • Under output options, select “New Worksheet Ply” to have the output placed in a new worksheet and enter the name Output for the new worksheet. • Click OK in the F-Test dialog box. • The output will be displayed in a new worksheet. Test for paired differences in Figure 10.9 on page 412 (data file: Repair.xls): • Enter the data from Table 10.2 (page 410) into two columns: costs for Garage 1 in column A and costs for Garage 2 in column B, with labels Garage 1 and Garage 2. • Select Data: Data Analysis : t-Test: Paired Two Sample for Means and click OK in the Data Analysis dialog box. • In the t-Test dialog box, enter A1.A8 into the “Variable 1 Range” window. • Enter B1.B8 into the “Variable 2 Range” window. • Enter 0 (zero) in the “Hypothesized Mean Difference” box. • Place a checkmark in the Labels checkbox. • Enter 0.05 into the Alpha box. • Under output options, select “New Worksheet Ply” to have the output placed in a new worksheet and enter the name Output for the new worksheet. • Click OK in the t-Test dialog box. • The output will be displayed in a new worksheet. Appendix 10.3: Two-Sample Hypothesis Testing Using MegaStat The instructions in this section begin by describing the entry of data into an Excel worksheet. Alternatively, the data may be loaded directly from the data disk included with the text. The appropriate data file name is given at the top of each instruction block. Please refer to Appendix 1.2 for further information about entering data and saving and printing results in Excel. Please refer to Appendix 1.3 for more information about using MegaStat. Test for the difference between means, equal variances, in Figure 10.2(a) on page 404 (data file: Catalyst.xlsx): • Enter the data from Table 10.1 (page 402) into two columns: yields for catalyst XA-100 in column A and yields for catalyst ZB-200 in column B, with labels XA-100 and ZB-200. • Select MegaStat: Hypothesis Tests : Compare Two Independent Groups • In the “Hypothesis Test: Compare Two Independent Groups” dialog box, click on “data input.” • Click in the Group 1 window and use the autoexpand feature to enter the range A1.A6. • Click in the Group 2 window and use the autoexpand feature to enter the range B1.B6. • Enter the Hypothesized Difference (here equal to 0) into the so labeled window. • Select an Alternative (here “not equal”) from the drop-down menu in the Alternative box. • Click on “t-test (pooled variance)” to request the equal variances test described on page 403. • Check the “Display confidence interval” checkbox, and select or type a desired level of confidence. • Check the “Test for equality of variances” checkbox to request the F test described on page 428. • Click OK in the “Hypothesis Test: Compare Two Independent Groups” dialog box. • The t test assuming unequal variances described on page 404 can be done by clicking “t-test (unequal variances)”. Test for paired differences in Figure 10.8(b) on page 412 (data file: Repair.xlsx): • Enter the data from Table 10.2 (page 410) into two columns: costs for Garage 1 in column A and costs for Garage 2 in column B, with labels Garage1 and Garage2. • Select Add-Ins: MegaStat: Hypothesis Tests : Paired Observations. • In the “Hypothesis Test: Paired Observations” dialog box, click on “data input.” • Click in the Group 1 window, and use the autoexpand feature to enter the range A1.A8. • Click in the Group 2 window, and use the autoexpand feature to enter the range B1.B8. • Enter the Hypothesized difference (here equal to 0) into the so labeled window. • Select an Alternative (here “not equal”) from the drop-down menu in the Alternative box. • Click on “t-test.” • Click OK in the “Hypothesis Test: Paired Observations” dialog box. • If the sample sizes are large, a test based on the normal distribution can be done by clicking on “z-test.” Hypothesis Test and Confidence Interval for Two Independent Proportions in Exercise 10.42 on page 421: • Select Add-Ins: MegaStat: Hypothesis Tests : Compare Two Independent Proportions. • In the “Hypothesis Test: Compare Two Proportions” dialog box, enter the number of successes x (here equal to 25) and the sample size n (here equal to 140) for homeowners in the “x” and “n” Group 1 windows. • Enter the number of successes x (here equal to 9) and the sample size n (here equal to 60) for renters in the “x” and “n” Group 2 windows. • Enter the Hypothesized difference (here equal to 0) into the so labeled window. • Select an Alternative (here “not equal”) from the drop-down menu in the Alternative box. • Check the “Display confidence interval” checkbox, and select or type a desired level of confidence (here equal to 95%). • Click OK in the “Hypothesis Test: Compare Two Proportions” dialog box. 1 Each sample in this chapter is a random sample. As has been our practice throughout this book, for brevity we sometimes refer to “random samples” as “samples.” 2 This means that there is no relationship between the measurements in one sample and the measurements in the other sample. 3 All of the box plots presented in this chapter and in Chapter 11 have been obtained using MINITAB. 4 We describe how to test the equality of two variances in Section 10.5 (although, as we will explain, this test has drawbacks). 5 Source: “Student Effort and Performance over the Semester,” Journal of Economic Education, Winter 2005, pages 3–28. 6 More correctly, because are unbiased point estimates of p1(1 − p1)/n1 and p2(1 − p2)/n2, a point estimate of is and a 100(1 − α) percent confidence interval for p1 − p2 is . Because both n1 and n2 are large, there is little difference between the interval obtained by using this formula and those obtained by using the formula in the box above. 7 Source: World Wide Web, http://www.gallup.com/poll/releases/, PR991230.ASP. The Gallup Poll, December 30, 1999. © 1999 The Gallup Organization. All rights reserved. 8 Note that we divide by to form a null hypothesis of the form rather than subtracting to form a null hypothesis of the form . This is because the population of all possible values of − has no known sampling distribution. (Bowerman 394) Bowerman, Bruce L. Business Statistics in Practice, 5th Edition. McGraw-Hill Learning Solutions, 022008. .

CHAPTER 11: Experimental Design and Analysis of Variance
Chapter Outline

11.1
Basic Concepts of Experimental Design

11.2
One-Way Analysis of Variance

11.3
The Randomized Block Design

11.4
Two-Way Analysis of Variance
In Chapter 10 we learned that business improvement often involves making comparisons. In that chapter we presented several confidence intervals and several hypothesis testing procedures for comparing two population means. However, business improvement often requires that we compare more than two population means. For instance, we might compare the mean sales obtained by using three different advertising campaigns in order to improve a company’s marketing process. Or, we might compare the mean production output obtained by using four different manufacturing process designs to improve productivity.
In this chapter we extend the methods presented in Chapter 10 by considering statistical procedures for comparing two or more population means. Each of the methods we discuss is called an analysis of variance (ANOVA) procedure. We also present some basic concepts of experimental design, which involves deciding how to collect data in a way that allows us to most effectively compare population means.
We explain the methods of this chapter in the context of four cases:

The Gasoline Mileage Case: An oil company wishes to develop a reasonably priced gasoline that will deliver improved mileages. The company uses one-way analysis of variance to compare the effects of three types of gasoline on mileage in order to find the gasoline type that delivers the highest mean mileage.
The Commercial Response Case: Firms that run commercials on television want to make the best use of their advertising dollars. In this case, researchers use one-way analysis of variance to compare the effects of varying program content on a viewer’s ability to recall brand names after watching TV commercials.
The Defective Cardboard Box Case: A paper company performs an experiment to investigate the effects of four production methods on the number of defective cardboard boxes produced in an hour. The company uses a randomized block ANOVA to determine which production method yields the smallest mean number of defective boxes.
The Shelf Display Case: A commercial bakery supplies many supermarkets. In order to improve the effectiveness of its supermarket shelf displays, the company wishes to compare the effects of shelf display height (bottom, middle, or top) and width (regular or wide) on monthly demand. The bakery employs two-way analysis of variance to find the display height and width combination that produces the highest monthly demand.
11.1: Basic Concepts of Experimental Design
In many statistical studies a variable of interest, called the
response variable
(or dependent variable), is identified. Then data are collected that tell us about how one or more
factors
(or independent variables) influence the variable of interest. If we cannot control the factor(s) being studied, we say that the data obtained are observational. For example, suppose that in order to study how the size of a home relates to the sales price of the home, a real estate agent randomly selects 50 recently sold homes and records the square footages and sales prices of these homes. Because the real estate agent cannot control the sizes of the randomly selected homes, we say that the data are observational.
If we can control the factors being studied, we say that the data are experimental. Furthermore, in this case the values, or levels, of the factor (or combination of factors) are called
treatments. The purpose of most experiments is to compare and estimate the effects of the different treatments on the response variable. For example, suppose that an oil company wishes to study how three different gasoline types (A, B, and C) affect the mileage obtained by a popular midsized automobile model. Here the response variable is gasoline mileage, and the company will study a single factor—gasoline type. Since the oil company can control which gasoline type is used in the midsized automobile, the data that the oil company will collect are experimental. Furthermore, the treatments—the levels of the factor gasoline type—are gasoline types A, B, and C.
In order to collect data in an experiment, the different treatments are assigned to objects (people, cars, animals, or the like) that are called
experimental units. For example, in the gasoline mileage situation, gasoline types A, B, and C will be compared by conducting mileage tests using a midsized automobile. The automobiles used in the tests are the experimental units.
In general, when a treatment is applied to more than one experimental unit, it is said to be
replicated. Furthermore, when the analyst controls the treatments employed and how they are applied to the experimental units, a designed experiment is being carried out. A commonly used, simple experimental design is called the
completely randomized experimental design.
In a
completely randomized experimental design, independent random samples of experimental units are assigned to the treatments.
Suppose we assign three experimental units to each of five treatments. We can achieve a completely randomized experimental design by assigning experimental units to treatments as follows. First, randomly select three experimental units and assign them to the first treatment. Next, randomly select three different experimental units from those remaining and assign them to the second treatment. That is, select these units from those not assigned to the first treatment. Third, randomly select three different experimental units from those not assigned to either the first or second treatment. Assign these experimental units to the third treatment. Continue this procedure until the required number of experimental units have been assigned to each treatment.
Once experimental units have been assigned to treatments, a value of the response variable is observed for each experimental unit. Thus we obtain a sample of values of the response variable for each treatment. When we employ a completely randomized experimental design, we assume that each sample has been randomly selected from the population of all values of the response variable that could potentially be observed when using its particular treatment. We also assume that the different samples of response variable values are independent of each other. This is usually reasonable because the completely randomized design ensures that each different sample results from different measurements being taken on different experimental units. Thus we sometimes say that we are conducting an independent samples experiment.

EXAMPLE 11.1: The Gasoline Mileage Case
North American Oil Company is attempting to develop a reasonably priced gasoline that will deliver improved gasoline mileages. As part of its development process, the company would like to compare the effects of three types of gasoline (A, B, and C) on gasoline mileage. For testing purposes, North American Oil will compare the effects of gasoline types A, B, and C on the gasoline mileage obtained by a popular midsized model called the Fire-Hawk. Suppose the company has access to 1,000 Fire-Hawks that are representative of the population of all Fire-Hawks, and suppose the company will utilize a completely randomized experimental design that employs samples of size five. In order to accomplish this, five Fire-Hawks will be randomly selected from the 1,000 available Fire-Hawks. These autos will be assigned to gasoline type A. Next, five different Fire-Hawks will be randomly selected from the remaining 995 available Fire-Hawks. These autos will be assigned to gasoline type B. Finally, five different Fire-Hawks will be randomly selected from the remaining 990 available Fire-Hawks. These autos will be assigned to gasoline type C.
Each randomly selected Fire-Hawk is test driven using the appropriate gasoline type (treatment) under normal conditions for a specified distance, and the gasoline mileage for each test drive is measured. We let xij denote the jth mileage obtained when using gasoline type i. The mileage data obtained are given in Table 11.1. Here we assume that the set of gasoline mileage observations obtained by using a particular gasoline type is a sample randomly selected from the infinite population of all Fire-Hawk mileages that could be obtained using that gasoline type. Examining the box plots shown next to the mileage data, we see some evidence that gasoline type B yields the highest gasoline mileages.1
Table 11.1: The Gasoline Mileage Data GasMile2

EXAMPLE 11.2: The Shelf Display Case
The Tastee Bakery Company supplies a bakery product to many supermarkets in a metropolitan area. The company wishes to study the effect of the shelf display height employed by the supermarkets on monthly sales (measured in cases of 10 units each) for this product. Shelf display height, the factor to be studied, has three levels—bottom (B), middle (M), and top (T)—which are the treatments. To compare these treatments, the bakery uses a completely randomized experimental design. For each shelf height, six supermarkets (the experimental units) of equal sales potential are randomly selected, and each supermarket displays the product using its assigned shelf height for a month. At the end of the month, sales of the bakery product (the response variable) at the 18 participating stores are recorded, giving the data in Table 11.2. Here we assume that the set of sales amounts for each display height is a sample randomly selected from the population of all sales amounts that could be obtained (at supermarkets of the given sales potential) at that display height. Examining the box plots that are shown next to the sales data, we seem to have evidence that a middle display height gives the highest bakery product sales.
Table 11.2: The Bakery Product Sales Data BakeSale

EXAMPLE 11.3: The Commercial Response Case
Advertising research indicates that when a television program is involving (such as the 2002 Super Bowl between the St. Louis Rams and New England Patriots, which was very exciting), individuals exposed to commercials tend to have difficulty recalling the names of the products advertised. Therefore, in order for companies to make the best use of their advertising dollars, it is important to show their most original and memorable commercials during involving programs.
In an article in the Journal of Advertising Research, Soldow and Principe (1981) studied the effect of program content on the response to commercials. Program content, the factor studied, has three levels—more involving programs, less involving programs, and no program (that is, commercials only)—which are the treatments. To compare these treatments, Soldow and Principe employed a completely randomized experimental design. For each program content level, 29 subjects were randomly selected and exposed to commercials in that program content level. Then a brand recall score (measured on a continuous scale) was obtained for each subject. The 29 brand recall scores for each program content level are assumed to be a sample randomly selected from the population of all brand recall scores for that program content level. Although we do not give the results in this example, the reader will analyze summary statistics describing these results in the exercises of Section 11.2.
Exercises for Section 11.1
CONCEPTS

11.1 Define the meaning of the terms response variable, factor, treatments, and experimental units.

11.2 What is a completely randomized experimental design?
METHODS AND APPLICATIONS
11.3 A study compared three different display panels for use by air traffic controllers. Each display panel was tested in a simulated emergency condition; 12 highly trained air traffic controllers took part in the study. Four controllers were randomly assigned to each display panel. The time (in seconds) needed to stabilize the emergency condition was recorded. The results of the study are given in Table 11.3. For this situation, identify the response variable, factor of interest, treatments, and experimental units. Display
Table 11.3: Display Panel Study Data Display

11.4 A consumer preference study compares the effects of three different bottle designs (A, B, and C) on sales of a popular fabric softener. A completely randomized design is employed. Specifically, 15 supermarkets of equal sales potential are selected, and 5 of these supermarkets are randomly assigned to each bottle design. The number of bottles sold in 24 hours at each supermarket is recorded. The data obtained are displayed in Table 11.4. For this situation, identify the response variable, factor of interest, treatments, and experimental units. BottleDes
Table 11.4: Bottle Design Study Data BottleDes

11.2: One-Way Analysis of Variance

Chapter 12

Suppose we wish to study the effects of p
treatments
(treatments 1, 2,…, p) on a
response variable. For any particular treatment, say treatment i, we define μi and σi to be the mean and standard deviation of the population of all possible values of the response variable that could potentially be observed when using treatment i. Here we refer to μi as treatment mean i. The goal of one-way analysis of variance (often called
one-way ANOVA
) is to estimate and compare the effects of the different treatments on the response variable. We do this by estimating and comparing the treatment means μ1, μ2,…, μp. Here we assume that a sample has been randomly selected for each of the p treatments by employing a completely randomized experimental design. We let ni denote the size of the sample that has been randomly selected for treatment i, and we let xij denote the jth value of the response variable that is observed when using treatment i. It then follows that the point estimate of μi is , the average of the sample of ni values of the response variable observed when using treatment i. It further follows that the point estimate of σi is si, the standard deviation of the sample of ni values of the response variable observed when using treatment i.
EXAMPLE 11.4: The Gasoline Mileage Case
Consider the gasoline mileage situation. We let μA, μB, and μC denote the means and σA, σB, and σC denote the standard deviations of the populations of all possible gasoline mileages using gasoline types A, B, and C. To estimate these means and standard deviations, North American Oil has employed a completely randomized experimental design and has obtained the samples of mileages in Table 11.1. The means of these samples— = 34.92, = 36.56, and = 33.98—are the point estimates of μA, μB, and μC. The standard deviations of these samples—sA = .7662, sB = .8503, and sC = .8349—are the point estimates of σA, σB, and σC. Using these point estimates, we will (later in this section) test to see whether there are any statistically significant differences between the treatment means μA, μB, and μC. If such differences exist, we will estimate the magnitudes of these differences. This will allow North American Oil to judge whether these differences have practical importance.
The one-way ANOVA formulas allow us to test for significant differences between treatment means and allow us to estimate differences between treatment means. The validity of these formulas requires that the following assumptions hold:
Assumptions for One-Way Analysis of Variance
1 Constant variance—the p populations of values of the response variable associated with the treatments have equal variances.
2 Normality—the p populations of values of the response variable associated with the treatments all have normal distributions.
3 Independence—the samples of experimental units associated with the treatments are ran domly selected, independent samples.
The one-way ANOVA results are not very sensitive to violations of the equal variances assumption. Studies have shown that this is particularly true when the sample sizes employed are equal (or nearly equal). Therefore, a good way to make sure that unequal variances will not be a problem is to take samples that are the same size. In addition, it is useful to compare the sample standard deviations s1, s2,…, sp to see if they are reasonably equal. As a general rule, the one-way ANOVA results will be approximately correct if the largest sample standard deviation is no more than twice the smallest sample standard deviation. The variations of the samples can also be compared by constructing a box plot for each sample (as we have done for the gasoline mileage data in Table 11.1). Several statistical texts also employ the sample variances to test the equality of the population variances [see Bowerman and O’Connell (1990) for two of these tests]. However, these tests have some drawbacks—in particular, their results are very sensitive to violations of the normality assumption. Because of this, there is controversy as to whether these tests should be performed.
The normality assumption says that each of the p populations is normally distributed. This assumption is not crucial. It has been shown that the one-way ANOVA results are approximately valid for mound-shaped distributions. It is useful to construct a box plot and/or a stem-and-leaf display for each sample. If the distributions are reasonably symmetric, and if there are no outliers, the ANOVA results can be trusted for sample sizes as small as 4 or 5. As an example, consider the gasoline mileage study of Examples 11.1 and 11.4. The box plots of Table 11.1 suggest that the variability of the mileages in each of the three samples is roughly the same. Furthermore, the sample standard deviations sA = .7662, sB = .8503, and sC = .8349 are reasonably equal (the largest is not even close to twice the smallest). Therefore, it is reasonable to believe that the constant variance assumption is satisfied. Moreover, because the sample sizes are the same, unequal variances would probably not be a serious problem anyway. Many small, independent factors influence gasoline mileage, so the distributions of mileages for gasoline types A, B, and C are probably mound-shaped. In addition, the box plots of Table 11.1 indicate that each distribution is roughly symmetric with no outliers. Thus, the normality assumption probably approximately holds. Finally, because North American Oil has employed a completely randomized design, the independence assumption probably holds. This is because the gasoline mileages in the different samples were obtained for different Fire-Hawks.
Testing for significant differences between treatment means
As a preliminary step in one-way ANOVA, we wish to determine whether there are any statistically significant differences between the treatment means μ1, μ2,…, μp. To do this, we test the null hypothesis
H0: μ1 = μ2 = ··· = μp
This hypothesis says that all the treatments have the same effect on the mean response. We test H0 versus the alternative hypothesis
Ha: At least two of μ1, μ2,…, μp differ
This alternative says that at least two treatments have different effects on the mean response.
To carry out such a test, we compare what we call the between-treatment variability to the within-treatment variability. For instance, suppose we wish to study the effects of three gasoline types (A, B, and C) on mean gasoline mileage, and consider Figure 11.1(a). This figure depicts three independent random samples of gasoline mileages obtained using gasoline types A, B, and C. Observations obtained using gasoline type A are plotted as blue dots (), observations obtained using gasoline type B are plotted as red dots (), and observations obtained using gasoline type C are plotted as green dots (). Furthermore, the sample treatment means are labeled as “type A mean,” “type B mean,” and “type C mean.” We see that the variability of the sample treatment means—that is, the between-treatment variability—is not large compared to the variability within each sample (the within-treatment variability). In this case, the differences between the sample treatment means could quite easily be the result of sampling variation. Thus we would not have sufficient evidence to reject
Figure 11.1: Comparing Between-Treatment Variability and Within-Treatment Variability

H0: μA = μB = μC
Next look at Figure 11.1(b), which depicts a different set of three independent random samples of gasoline mileages. Here the variability of the sample treatment means (the between-treatment variability) is large compared to the variability within each sample. This would probably provide enough evidence to tell us to reject
H0: μA = μB = μC
in favor of
Ha: At least two of μA, μB, and μC differ
We would conclude that at least two of gasoline types A, B, and C have different effects on mean mileage.
In order to numerically compare the between-treatment and within-treatment variability, we can define several sums of squares and mean squares. To begin, we define n to be the total number of experimental units employed in the one-way ANOVA, and we define to be the overall mean of all observed values of the response variable. Then we define the following:
The treatment sum of squares is

In order to compute SST, we calculate the difference between each sample treatment mean and the overall mean , we square each of these differences, we multiply each squared difference bythe number of observations for that treatment, and we sum over all treatments. The SST measures the variability of the sample treatment means. For instance, if all the sample treatment means ( values) were equal, then the treatment sum of squares would be equal to 0. The more the values vary, the larger will be SST. In other words, the treatment sum of squares measures the amount of between-treatment variability.
As an example, consider the gasoline mileage data in Table 11.1. In this experiment we employ a total of
n = nA + nB + nC = 5 + 5 + 5 = 15
experimental units. Furthermore, the overall mean of the 15 observed gasoline mileages is

Then

In order to measure the within-treatment variability, we define the following quantity:
The error sum of squares is

Here x1j is the jth observed value of the response in the first sample, x2j is the jth observed value of the response in the second sample, and so forth. The formula above says that we compute SSE by calculating the squared difference between each observed value of the response and its corresponding treatment mean and by summing these squared differences over all the observations in the experiment.
The SSE measures the variability of the observed values of the response variable around their respective treatment means. For example, if there were no variability within each sample, the error sum of squares would be equal to 0. The more the values within the samples vary, the larger will be SSE.
As an example, in the gasoline mileage study, the sample treatment means are = 34.92, = 36.56, and = 33.98. It follows that

Finally, we define a sum of squares that measures the total amount of variability in the observed values of the response:
The total sum of squares is
SSTO = SST + SSE
The variability in the observed values of the response must come from one of two sources—the between-treatment variability or the within-treatment variability. It follows that the total sum of squares equals the sum of the treatment sum of squares and the error sum of squares. Therefore, the
SST and SSE are said to partition the total sum of squares.
In the gasoline mileage study, we see that
SSTO = SST + SSE = 17.0493 + 8.028 = 25.0773
Using the treatment and error sums of squares, we next define two mean squares:
The treatment mean square is

The error mean square is

In order to decide whether there are any statistically significant differences between the treatment means, it makes sense to compare the amount of between-treatment variability to the amount of within-treatment variability. This comparison suggests the following F test:
An F Test for Differences between Treatment Means
Suppose that we wish to compare p treatment means μ1, μ2,…, μp and consider testing

Define the F statistic

and its p-value to be the area under the F curve with p − 1 and n − p degrees of freedom to the right of F. We can reject H0 in favor of Ha at level of significance α if either of the following equivalent conditions holds:
1 F > Fα

2 p-value < α Here the Fα point is based on p − 1 numerator and n − p denominator degrees of freedom. A large value of F results when SST, which measures the between-treatment variability, is large compared to SSE, which measures the within-treatment variability. If F is large enough, this implies that H0 should be rejected. The rejection point Fα tells us when F is large enough to allow us to reject H0 at level of significance α. When F is large, the associated p-value is small. If this p-value is less than α, we can reject H0 at level of significance α. EXAMPLE 11.5: The Gasoline Mileage Case Consider the North American Oil Company data in Table 11.1. The company wishes to determine whether any of gasoline types A, B, and C have different effects on mean Fire-Hawk gasoline mileage. That is, we wish to see whether there are any statistically significant differences between μA, μB, and μC. To do this, we test the null hypothesis H0: μA = μB = μC which says that gasoline types A, B, and C have the same effects on mean gasoline mileage. We test H0 versus the alternative Ha: At least two of μA, μB and μC differ which says that at least two of gasoline types A, B, and C have different effects on mean gasoline mileage. Since we have previously computed SST to be 17.0493 and SSE to be 8.028, and because we are comparing p = 3 treatment means, we have and It follows that In order to test H0 at the .05 level of significance, we use F.05 with p − 1 = 3 − 1 = 2 numerator and n − p = 15 − 3 = 12 denominator degrees of freedom. Table A.6 (page 867) tells us that this F point equals 3.89, so we have F = 12.74 > F.05 = 3.89
Therefore, we reject H0 at the .05 level of significance. This says we have strong evidence that at least two of the treatment means μA, μB, and μC differ. In other words, we conclude that at least two of gasoline types A, B, and C have different effects on mean gasoline mileage.
Figure 11.2 gives the MINITAB and Excel output of an analysis of variance of the gasoline mileage data. Note that each output gives the value F = 12.74 and the related p-value, which equals .001(rounded). Since this p-value is less than .05, we reject H0 at the .05 level of significance.
Figure 11.2: MINITAB and Excel Output of an Analysis of Variance of the Gasoline Mileage Data in Table 11.1

The results of an analysis of variance are often summarized in what is called an
analysis of variance table. This table gives the sums of squares (SST, SSE, SSTO), the mean squares (MST and MSE), and the F statistic and its related p-value for the ANOVA. The table also gives the degrees of freedom associated with each source of variation—treatments, error, and total. Table 11.5 gives the ANOVA table for the gasoline mileage problem. Notice that in the column labeled “Sums of Squares,” the values of SST and SSE sum to SSTO. Also notice that the upper portion of the MINITAB output and the lower portion of the Excel output give the ANOVA table of Table 11.5.
Table 11.5: Analysis of Variance Table for Testing H0: μA = μB = μC in the Gasoline Mileage Problem (p = 3 Gasoline Types, n = 15 Observations)

Before continuing, note that if we use the ANOVA F statistic to test the equality of two population means, it can be shown that
1 F equals t2, where t is the equal variances t statistic discussed in Section 10.2 (pages 395–401) used to test the equality of the two population means and
2 The critical value Fα, which is based on p − 1 = 2 − 1 = 1 and n − p = n1 + n2 − 2 degrees of freedom, equals , where tα/2 is the critical value for the equal variances t test and is based on n1 + n2 − 2 degrees of freedom.
Hence, the rejection conditions

are equivalent. It can also be shown that in this case the p-value related to F equals the p-value related to t. Therefore, the ANOVA F test of the equality of p treatment means can be regarded as a generalization of the equal variances t test of the equality of two treatment means.
Pairwise comparisons
If the one-way ANOVA F test says that at least two treatment means differ, then we investigate which treatment means differ and we estimate how large the differences are. We do this by making what we call pairwise comparisons (that is, we compare treatment means two at a time). One way to make these comparisons is to compute point estimates of and confidence intervals for pairwise differences. For example, in the gasoline mileage case we might estimate the pairwise differences μA − μB, μA − μC, and μB − μC. Here, for instance, the pairwise difference μA − μB can be interpreted as the change in mean mileage achieved by changing from using gasoline type B to using gasoline type A.
There are two approaches to calculating confidence intervals for pairwise differences. The first involves computing the usual, or individual, confidence interval for each pairwise difference. Here, if we are computing 100(1 − α) percent confidence intervals, we are 100(1 − α) percent confident that each individual pairwise difference is contained in its respective interval. That is, the confidence level associated with each (individual) comparison is 100(1 − α) percent, and we refer to α as the comparisonwise error rate. However, we are less than 100(1 − α) percent confident that all of the pairwise differences are simultaneously contained in their respective intervals. A more conservative approach is to compute simultaneous confidence intervals. Such intervals make us 100(1 − α) percent confident that all of the pairwise differences are simultaneously contained in their respective intervals. That is, when we compute simultaneous intervals, the overall confidence level associated with all the comparisons being made in the experiment is 100(1 − α) percent, and we refer to a as the experimentwise error rate.
Several kinds of simultaneous confidence intervals can be computed. In this book we present what is called the Tukey formula for simultaneous intervals. We do this because, if we are interested in studying all pairwise differences between treatment means, the Tukey formula yields the most precise (shortest) simultaneous confidence intervals. In general, a Tukey simultaneous 100(1 − α) percent confidence interval is longer than the corresponding individual 100(1 − α) percent confidence interval. Thus, intuitively, we are paying a penalty for simultaneous confidence by obtaining longer intervals. One pragmatic approach to comparing treatment means is to first determine if we can use the more conservative Tukey intervals to make meaningful pairwise comparisons. If we cannot, then we might see what the individual intervals tell us. In the following box we present both individual and Tukey simultaneous confidence intervals for pairwise differences. We also present the formula for a confidence interval for a single treatment mean, which we might use after we have used pairwise comparisons to determine the “best” treatment.
Estimation in One-Way ANOVA
1 Consider the pairwise difference
μi

μh
, which can be interpreted to be the change in the mean value of the response variable associated with changing from using treatment h to using treatment i. Then, a point estimate of the difference
μi

μh
is − , where and are the sample treatment means associated with treatments i and h.
2 An individual 100(1 − α) percent confidence interval for
μi

μh
is

Here the tα/2 point is based on n − p degrees of freedom, and MSE is the previously defined error mean square found in the ANOVA table.
3 A Tukey simultaneous 100(1 − α) percent confidence interval for
μi

μh
is

Here the value qα is obtained from Table A.9 (pages 870–872), which is a table of percentage points of the studentized range. In this table qα is listed corresponding to values of p and n − p. Furthermore, we assume that the sample sizes ni and nh are equal to the same value, which we denote as m. If ni and nh are not equal, we replace by .
4 A point estimate of the treatment mean
μi
is and an individual 100(1 − α) percent confidence interval for
μi
is

Here the tα/2 point is based on n − p degrees of freedom.
EXAMPLE 11.6: The Gasoline Mileage Case
In the gasoline mileage study, we are comparing p = 3 treatment means (μA, μB, and μC). Furthermore, each sample is of size m = 5, there are a total of n = 15 observed gas mileages, and the MSE found in Table 11.5 is .669. Because q.05 = 3.77 is the entry found in Table A.9 (page 871) corresponding to p = 3 and n − p = 12, a Tukey simultaneous 95 percent confidence interval for μB − μA is

Similarly, Tukey simultaneous 95 percent confidence intervals for μA − μC and μB − μC are, respectively,

These intervals make us simultaneously 95 percent confident that (1) changing from gasoline type A to gasoline type B increases mean mileage by between .261 and 3.019 mpg, (2) changing from gasoline type C to gasoline type A might decrease mean mileage by as much as .439 mpg or might increase mean mileage by as much as 2.319 mpg, and (3) changing from gasoline type C to gasoline type B increases mean mileage by between 1.201 and 3.959 mpg. The first and third of these intervals make us 95 percent confident that μB is at least .261 mpg greater than μA and at least 1.201 mpg greater than μC. Therefore, we have strong evidence that gasoline type B yields the highest mean mileage of the gasoline types tested. Furthermore, noting that t.025 based on n − p = 12 degrees of freedom is 2.179, it follows that an individual 95 percent confidence interval for μB is

This interval says we can be 95 percent confident that the mean mileage obtained by using gasoline type B is between 35.763 and 37.357 mpg. Notice that this confidence interval is graphed on the MINITAB output of Figure 11.2. This output also shows the 95 percent confidence intervals for μA and μC and gives Tukey simultaneous 95 percent intervals. For example, consider finding the Tukey interval for μB − μA on the MINITAB output. To do this, we look in the table corresponding to “Type A subtracted from” and find the row in this table labeled “Type B.” This row gives the interval for “Type A subtracted from Type B”—that is, the interval for μB − μA. This interval is [.261, 3.019], as calculated above. Finally, note that the half-length of the individual 95 percent confidence interval for a pairwise comparison is (because nA = nB = nC = 5)

This half-length implies that the individual intervals are shorter than the previously constructed Tukey intervals, which have a half-length of 1.379. Recall, however, that the Tukey intervals are short enough to allow us to conclude with 95 percent confidence that μB is greater than μA and μC.
We next consider testing H0: μi − μh = 0 versus Ha: μi − μh ≠ 0. The test statistic t for performing this test is calculated by dividing − by . For example, consider testing H0: μB − μA = 0 versus Ha: μB − μA ≠ 0. Since − = 36.56 − 34.96 = 1.64 and , the test statistic t equals 1.64/.5173 = 3.17. This test statistic value is given in the leftmost table of the following MegaStat output, as is the test statistic value for testing H0: μB − μC = 0 (t = 4.99) and the test statistic value for testing H0 : μA − μC = 0 (t = 1.82):

If we wish to use the Tukey simultaneous comparison procedure having an experimentwise error rate of α, we reject H0: μi − μh = 0 in favor of Ha: μi − μh ≠ 0 if the absolute value of t is greater than the rejection point . Table A.9 tells us that q.05 is 3.77 and q.01 is 5.04. Therefore, the rejection points for experimentwise error rates of .05 and .01 are, respectively, and (see the MegaStat output). Suppose we set α equal to .05. Then, since the test statistic value for testing H0: μB − μA = 0 (t = 3.17) and the test statistic value for testing H0: μB − μC = 0 (t = 4.99) are greater than the rejection point 2.67, we reject both null hypotheses. This, along with the fact that = 36.56 is greater than = 34.92 and = 33.98, leads us to conclude that gasoline type B yields the highest mean mileage of the gasoline types tested (note that the MegaStat output conveniently arranges the sample means in increasing order). Finally, note that the rightmost table of the MegaStat output gives the p-values for individual (rather than simultaneous) pairwise hypothesis tests. For example, the individ ual p-value for testing H0: μB − μC = 0 is .0003, and the individual p-value for testing H0: μB − μA = 0 is .0081.
In general, when we use a completely randomized experimental design, it is important to compare the treatments by using experimental units that are essentially the same with respect to the characteristic under study. For example, in the gasoline mileage case we have used cars of the same type (Fire-Hawks) to compare the different gasoline types, and in the shelf display case we have used grocery stores of the same sales potential for the bakery product to compare the shelf display heights (the reader will analyze the data for this case in the exercises). Sometimes, however, it is not possible to use experimental units that are essentially the same with respect to the characteristic under study. For example, suppose a chain of stores that sells audio and video equipment wishes to compare the effects of street, mall, and downtown locations on the sales volume of its stores. The experimental units in this situation are the areas where the stores are located, but these areas are not of the same sales potential because each area is populated by a different number of households. In such a situation we must explicitly account for the differences in the experimental units. One way to do this is to use regression analysis, which is discussed in Chapters 13–15. When we use regression analysis to explicitly account for a variable (such as the number of households in the store’s area) that causes differences in the experimental units, we call the variable a covariate. Furthermore, we say that we are performing an analysis of covariance. Finally, another way to deal with differing experimental units is to employ a
randomized block design. This experimental design is discussed in Section 11.3.
To conclude this section, we note that if we fear that the normality and/or equal variances assumptions for one-way analysis of variance do not hold, we can use a nonparametric approach to compare several populations. One such approach is the Kruskal–Wallis H test, which is discussed in Section 18.4.
Exercises for Section 11.2
CONCEPTS

11.5 Explain the assumptions that must be satisfied in order to validly use the one-way ANOVA formulas.
11.6 Explain the difference between the between-treatment variability and the within-treatment variability when performing a one-way ANOVA.
11.7 Explain why we conduct pairwise comparisons of treatment means.
11.8 Explain the difference between individual and simultaneous confidence intervals for a set of several pairwise differences.
METHODS AND APPLICATIONS
11.9 THE SHELF DISPLAY CASE BakeSale
Consider Example 11.2, and let μB, μM, and μT represent the mean monthly sales when using the bottom, middle, and top shelf display heights, respectively. Figure 11.3 gives the MINITAB output of a one-way ANOVA of the bakery sales study data in Table 11.2 (page 443).
Figure 11.3: MINITAB Output of a One-Way ANOVA of the Bakery Sales Study Data in Table 11.2

a Test the null hypothesis that μB, μM, and μT are equal by setting α = .05. On the basis of this test, can we conclude that the bottom, middle, and top shelf display heights have different effects on mean monthly sales?
b Consider the pairwise differences μM − μB, μT − μB, and μT − μM. Find a point estimate of and a Tukey simultaneous 95 percent confidence interval for each pairwise difference. Interpret the meaning of each interval in practical terms. Which display height maximizes mean sales?
c Find an individual 95 percent confidence interval for each pairwise difference in part b. Interpret each interval.
d Find 95 percent confidence intervals for μB, μM, and μT. Interpret each interval.
11.10 Consider the display panel situation in Exercise 11.3, and let μA, μB, and μC represent the mean times to stabilize the emergency condition when using display panels A, B, and C, respectively. Figure 11.4 gives the MINITAB output of a one-way ANOVA of the display panel data in Table 11.3 (page 444). Display
Figure 11.4: MINITAB Output of a One-Way ANOVA of the Display Panel Study Data in Table 11.3

a Test the null hypothesis that μA, μB, and μC are equal by setting α = .05. On the basis of this test, can we conclude that display panels A, B, and C have different effects on the mean time to stabilize the emergency condition?
b Consider the pairwise differences μB − μA, μC − μA, and μC − μB. Find a point estimate of and a Tukey simultaneous 95 percent confidence interval for each pairwise difference. Interpret the results by describing the effects of changing from using each display panel to using each of the other panels. Which display panel minimizes the time required to stabilize the emergency condition?
c Find an individual 95 percent confidence interval for each pairwise difference in part b. Interpret the results.
11.11 Consider the bottle design study situation in Exercise 11.4, and let μA, μB, and μC represent mean daily sales using bottle designs A, B, and C, respectively. Figure 11.5 gives the Excel output of a one-way ANOVA of the bottle design study data in Table 11.4 (page 444). BottleDes
Figure 11.5: Excel Output of a One-Way ANOVA of the Bottle Design Study Data in Table 11.4

a Test the null hypothesis that μA, μB, and μC are equal by setting α = .05. That is, test for statistically significant differences between these treatment means at the .05 level of significance. Based on this test, can we conclude that bottle designs A, B, and C have different effects on mean daily sales?
b Consider the pairwise differences μB − μA, μC − μA, and μC − μB. Find a point estimate of and a Tukey simultaneous 95 percent confidence interval for each pairwise difference. Interpret the results in practical terms. Which bottle design maximizes mean daily sales?
c Find an individual 95 percent confidence interval for each pairwise difference in part b. Interpret the results in practical terms.
d Find a 95 percent confidence interval for each of the treatment means μA, μB, and μC. Interpret these intervals.
11.12 In order to compare the durability of four different brands of golf balls (ALPHA, BEST, CENTURY, and DIVOT), the National Golf Association randomly selects five balls of each brand and places each ball into a machine that exerts the force produced by a 250-yard drive. The number of simulated drives needed to crack or chip each ball is recorded. The results are given in Table 11.6. The MegaStat output of a one-way ANOVA of this data is shown in Figure 11.6. Test for statistically significant differences between the treatment means μALPHA, μBEST, μCENTURY, and μDIVOT. Set α = .05. GolfBall
Table 11.6: Golf Ball Durability Test Results and a MegaStat Plot of the Results GolfBall

Figure 11.6: MegaStat Output of a One-Way ANOVA of the Golf Ball Durability Data

11.13 Perform pairwise comparisons of the treatment means in Exercise 11.12. Which brand(s) are most durable? Find a 95 percent confidence interval for each of the treatment means.
11.14 THE COMMERCIAL RESPONSE CASE
Recall from Example 11.3 that (1) 29 randomly selected subjects were exposed to commercials shown in more involving programs, (2) 29 randomly selected subjects were exposed to com mercials shown in less involving programs, and (3) 29 randomly selected subjects watched commercials only (note: this is called the control group). The mean brand recall scores for these three groups were, respectively, = 1.21, = 2.24, and = 2.28. Furthermore, a one-way ANOVA of the data shows that SST = 21.40 and SSE = 85.56.
a Define appropriate treatment means μ1, μ2, and μ3. Then test for statistically significant differences between these treatment means. Set α = .05.
b Perform pairwise comparisons of the treatment means by computing a Tukey simultaneous 95 percent confidence interval for each of the pairwise differences μ1 − μ2, μ1 − μ3, and μ2 − μ3. Which type of program content results in the worst mean brand recall score?
11.3: The Randomized Block Design
Not all experiments employ a completely randomized design. For instance, suppose that when we employ a completely randomized design, we fail to reject the null hypothesis of equality of treatment means because the within-treatment variability (which is measured by the SSE) is large. This could happen because differences between the experimental units are concealing true differences between the treatments. We can often remedy this by using what is called a
randomized block design.

EXAMPLE 11.7: The Defective Cardboard Box Case
The Universal Paper Company manufactures cardboard boxes. The company wishes to investigate the effects of four production methods (methods 1, 2, 3, and 4) on the number of defective boxes produced in an hour. To compare the methods, the company could utilize a completely randomized design. For each of the four production methods, the company would select several (say, as an example, three) machine operators, train each operator to use the production method to which he or she has been assigned, have each operator produce boxes for one hour, and record the number of defective boxes produced. The three operators using any one production method would be different from those using any other production method. That is, the completely randomized design would utilize a total of 12 machine operators. However, the abilities of the machine operators could differ substantially. These differences might tend to conceal any real differences between the production methods. To overcome this disadvantage, the company will employ a randomized block experimental design. This involves randomly selecting three machine operators and training each operator thoroughly to use all four production methods. Then each operator will produce boxes for one hour using each of the four production methods. The order in which each operator uses the four methods should be random. We record the number of defective boxes produced by each operator using each method. The advantage of the randomized block design is that the defective rates obtained by using the four methods result from employing the same three operators. Thus any true differences in the effectiveness of the methods would not be concealed by differences in the operators’ abilities.
When Universal Paper employs the randomized block design, it obtains the 12 defective box counts in Table 11.7. We let xij denote the number of defective boxes produced by machine operator j using production method i. For example, x32 = 5 says that 5 defective boxes were produced by machine operator 2 using production method 3 (see Table 11.7). In addition to the 12 defective box counts, Table 11.7 gives the sample mean of these 12 observations, which is = 7.5833, and also gives sample treatment means and sample block means. The sample treatment means are the average defective box counts obtained when using production methods 1, 2, 3, and 4. Denoting these sample treatment means as , , , and , we see from Table 11.7 that = 10.3333, = 10.3333. = 5.0, and = 4.6667. Because and are less than and , we estimate that the mean number of defective boxes produced per hour by production method 3 or 4 is less than the mean number of defective boxes produced per hour by production method 1 or 2. The sample block means are the average defective box counts obtained by machine operators 1, 2, and 3. Denoting these sample block means as , , and , we see from Table 11.7 that = 6.0, = 7.75, and = 9.0. Because , , and differ, we have evidence that the abilities of themachine operators differ and thus that using the machine operators as blocks is reasonable.
Table 11.7: Numbers of Defective Cardboard Boxes Obtained by Production Methods 1, 2, 3, and 4 and Machine Operators 1, 2, and 3 CardBox

I general, a
randomized block design
compares p treatments (for example, production methods) by using b blocks (for example, machine operators). Each block is used exactly once to measure the effect of each and every treatment. The advantage of the randomized block design over the completely randomized design is that we are comparing the treatments by using the same experimental units. Thus any true differences in the treatments will not be concealed by differences in the experimental units.
In some experiments a block consists of similar or matched sets of experimental units. For example, suppose we wish to compare the performance of business majors, science majors, and fine arts majors on a graduate school admissions test. Here the blocks might be matched sets of students. Each matched set (block) would consist of a business major, a science major, and a fine arts major selected so that each is in his or her senior year, attends the same university, and has the same grade point average. By selecting blocks in this fashion, any true differences between majors would not be concealed by differences between college classes, universities, or grade point averages.
In order to analyze the data obtained in a randomized block design, we define

The ANOVA procedure for a randomized block design partitions the total sum of squares (SSTO) into three components: the treatment sum of squares (SST), the block sum of squares (SSB), and the error sum of squares (SSE). The formula for this partitioning is
SSTO = SST + SSB + SSE
The steps for calculating these sums of squares, as well as what is measured by the sums of squares, can be summarized as follows:
Step 1: Calculate SST, which measures the amount of between-treatment variability:

Step 2: Calculate SSB, which measures the amount of variability due to the blocks:

Step 3: Calculate SSTO, which measures the total amount of variability:

Step 4: Calculate SSE, which measures the amount of variability due to the error:
SSE = SSTO − SST − SSB
These sums of squares are shown in Table 11.8, which is the ANOVA table for a randomized block design. This table also gives the degrees of freedom associated with each source of variation—treatments, blocks, error, and total—as well as the mean squares and F statistics used to test the hypotheses of interest in a randomized block experiment.
Table 11.8: ANOVA Table for the Randomized Block Design with p Treatments and b Blocks

Before discussing these hypotheses, we will illustrate how the entries in the ANOVA table are calculated. The sums of squares in the defective cardboard box case are calculated as follows (note that p = 4 and b = 3):
Step 1:

Step 2:

Step 3:

Step 4:

Figure 11.7 gives the MINITAB output of a randomized block ANOVA of the defective box data. This figure shows the above calculated sums of squares, as well as the degrees of freedom (recall that p = 4 and b = 3), the mean squares, and the F statistics (and associated p-values) used to test the hypotheses of interest.
Figure 11.7: MINITAB Output of a Randomized Block ANOVA of the Defective Box Data

Of main interest is the test of the null hypothesis
H0 that no differences exist between the treatment effects on the mean value of the response variable versus the alternative hypoth esis
Ha
that at least two treatment effects differ. We can reject H0 in favor of Ha at level of significance α if

is greater than the Fα point based on p − 1 numerator and (p − 1)(b − 1) denominator degrees of freedom. In the defective cardboard box case, F.05 based on p − 1 = 3 numerator and (p − 1)(b − 1) = 6 denominator degrees of freedom is 4.76 (see Table A.6, page 867). Because

is greater than F.05 = 4.76, we reject H0 at the .05 level of significance. Therefore, we have strong evidence that at least two production methods have different effects on the mean number of defective boxes produced per hour. Alternatively, we can reject H0 in favor of Ha at level of significance α if the p-value is less than α. Here the p-value is the area under the curve of the F distribution [having p − 1 and (p − 1)(b − 1) degrees of freedom] to the right of F(treatments). The MINITAB output in Figure 11.7 tells us that this p-value is 0.000 (that is, less than .001) for the defective box data. Therefore, we have extremely strong evidence that at least two production methods have different effects on the mean number of defective boxes produced per hour.
It is also of interest to test the null hypothesis
H0 that no differences exist between the block effects on the mean value of the response variable versus the alternative hypothesis
Ha
that at least two block effects differ. We can reject H0 in favor of Ha at level of significance α if

is greater than the Fα point based on b − 1 numerator and (p − 1)(b − 1) denominator degrees of freedom. In the defective cardboard box case, F.05 based on b − 1 = 2 numerator and (p − 1)(b − 1) = 6 denominator degrees of freedom is 5.14 (see Table A.6, page 867). Because

is greater than F.05 = 5.14, we reject H0 at the .05 level of significance. Therefore, we have strong evidence that at least two machine operators have different effects on the mean number of defective boxes produced per hour. Alternatively, we can reject H0 in favor of Ha at level of significance α if the p-value is less than α. Here the p-value is the area under the curve of the F distribution [having b − 1 and (p − 1)(b − 1) degrees of freedom] to the right of F(blocks). The MINITAB output tells us that this p-value is .005 for the defective box data. Therefore, we have very strong evidence that at least two machine operators have different effects on the mean number of defective boxes produced per hour. This implies that using the machine operators as blocks is reasonable.
If, in a randomized block design, we conclude that at least two treatment effects differ, we can perform pairwise comparisons to determine how they differ.
Point Estimates and Confidence Intervals in a Randomized Block ANOVA
Consider the difference between the effects of treatments i and h on the mean value of the response variable. Then:
1 A point estimate of this difference is −

2 An individual 100(1 − α) percent confidence interval for this difference is

Here tα/2 is based on (p − 1)(b − 1) degrees of freedom, and s is the square root of the MSE found in the randomized block ANOVA table.
3 A Tukey simultaneous 100(1 −
α) percent confidence interval for this difference is

Here the value qα is obtained from Table A.9 (pages 870–872), which is a table of percentage points of the studentized range. In this table qα is listed corresponding to values of p and (p − 1)(b − 1).
EXAMPLE 11.8: The Defective Cardboard Box Case
We have previously concluded that we have extremely strong evidence that at least two production methods have different effects on the mean number of defective boxes produced per hour. We have also seen that the sample treatment means are = 10.3333, = 10.3333, = 5.0, and = 4.6667. Since is the smallest sample treatment mean, we will use Tukey simultaneous 95 percent confidence intervals to compare the effect of production method 4 with the effects of production methods 1, 2, and 3. To compute these intervals, we first note that q.05 = 4.90 is the entry in Table A.9 (page 871) corresponding to p = 4 and (p − 1)(b − 1) = 6. Also, note that the MSE found in the randomized block ANOVA table is .639 (see Figure 11.7), which implies that . It follows that a Tukey simultaneous 95 percent confidence interval for the difference between the effects of production methods 4 and 1 on the mean number of defective boxes produced per hour is

Furthermore, it can be verified that a Tukey simultaneous 95 percent confidence interval for the difference between the effects of production methods 4 and 2 on the mean number of defective boxes produced per hour is also [−7.9281, −3.4051]. Therefore, we can be 95 percent confident that changing from production method 1 or 2 to production method 4 decreases the mean number of defective boxes produced per hour by a machine operator by between 3.4051 and 7.9281 boxes. A Tukey simultaneous 95 percent confidence interval for the difference between the effects of production methods 4 and 3 on the mean number of defective boxes produced per hour is

This interval tells us (with 95 percent confidence) that changing from production method 3 to production method 4 might decrease the mean number of defective boxes produced per hour by as many as 2.5948 boxes or might increase this mean by as many as 1.9282 boxes. In other words, because this interval contains 0, we cannot conclude that the effects of production methods 4 and 3 differ.
Exercises for Section 11.3
CONCEPTS

11.15 In your own words, explain why we sometimes employ the randomized block design.
11.16 How can we test to determine if the blocks we have chosen are reasonable?
METHODS AND APPLICATIONS
11.17 A marketing organization wishes to study the effects of four sales methods on weekly sales of a product. The organization employs a randomized block design in which three salesman use each sales method. The results obtained are given in Table 11.9. Figure 11.8 gives the Excel output of a randomized block ANOVA of the sales method data. SaleMeth
Table 11.9: Results of a Sales Method Experiment Employing a Randomized Block Design SaleMeth

Figure 11.8: Excel Output of a Randomized Block ANOVA of the Sales Method Data Given in Table 11.9

a Test the null hypothesis H0 that no differences exist between the effects of the sales methods (treatments) on mean weekly sales. Set α = .05. Can we conclude that the different sales methods have different effects on mean weekly sales?
b Test the null hypothesis H0 that no differences exist between the effects of the salesmen (blocks) on mean weekly sales. Set α = .05. Can we conclude that the different salesmen have different effects on mean weekly sales?
c Use Tukey simultaneous 95 percent confidence intervals to make pairwise comparisons of the sales method effects on mean weekly sales. Which sales method(s) maximize mean weekly sales?
11.18 A consumer preference study involving three different bottle designs (A, B, and C) for the jumbo size of a new liquid laundry detergent was carried out using a randomized block experimental design, with supermarkets as blocks. Specifically, four supermarkets were supplied with all three bottle designs, which were priced the same. Table 11.10 gives the number of bottles of each design sold in a 24-hour period at each supermarket. If we use these data, SST, SSB, and SSE can be calculated to be 586.1667, 421.6667, and 1.8333, respectively. BottleDes2
Table 11.10: Results of a Bottle Design Experiment BottleDes2

a Test the null hypothesis H0 that no differences exist between the effects of the bottle designs on mean daily sales. Set α = .05. Can we conclude that the different bottle designs have different effects on mean sales?
b Test the null hypothesis H0 that no differences exist between the effects of the supermarkets on mean daily sales. Set α = .05. Can we conclude that the different supermarkets have different effects on mean sales?
c Use Tukey simultaneous 95 percent confidence intervals to make pairwise comparisons of the bottle design effects on mean daily sales. Which bottle design(s) maximize mean sales?
11.19 To compare three brands of computer keyboards, four data entry specialists were randomly selected. Each specialist used all three keyboards to enter the same kind of text material for 10 minutes, and the number of words entered per minute was recorded. The data obtained are given in Table 11.11. If we use these data, SST, SSB, and SSE can be calculated to be 392.6667, 143.5833, and 2.6667, respectively. Keyboard
Table 11.11: Results of a Keyboard Experiment Keyboard

a Test the null hypothesis H0 that no differences exist between the effects of the keyboard brands on the mean number of words entered per minute. Set α = .05.
b Test the null hypothesis H0 that no differences exist between the effects of the data entry specialists on the mean number of words entered per minute. Set α = .05.
c Use Tukey simultaneous 95 percent confidence intervals to make pairwise comparisons of the keyboard brand effects on the mean number of words entered per minute. Which keyboard brand maximizes the mean number of words entered per minute?
11.20 In an advertisement in a local newspaper, Best Food supermarket attempted to convince consumers that it offered them the lowest total food bill. To do this, Best Food presented the following comparison of the prices of 60 grocery items purchased at three supermarkets—Best Food, Public, and Cash’ N Carry—on a single day. BestFood

If we use these data to compare the mean prices of grocery items at the three supermarkets, then we have a randomized block design where the treatments are the three supermarkets and the blocks are the 60 grocery items. Figure 11.9 gives the MegaStat output of a randomized block ANOVA of the supermarket data.
Figure 11.9: MegaStat Output of a Randomized Block ANOVA of the Supermarket Data for Exercise 11.20

a Test the null hypothesis H0 that no differences exist between the mean prices of grocery items at the three supermarkets. Do the three supermarkets differ with respect to mean grocery prices?
b Make pairwise comparisons of the mean prices of grocery items at the three supermarkets. Which supermarket has the lowest mean prices?
11.21 The Coca-Cola Company introduced new Coke in 1985. Within three months of this introduction, negative consumer reaction forced Coca-Cola to reintroduce the original formula of Coke as Coca-Cola classic. Suppose that two years later, in 1987, a marketing research firm in Chicago compared the sales of Coca-Cola classic, new Coke, and Pepsi in public building vending machines. To do this, the marketing research firm randomly selected 10 public buildings in Chicago having both a Coke machine (selling Coke classic and new Coke) and a Pepsi machine. The data—in number of cans sold over a given period of time—and a MegaStat randomized block ANOVA of the data are as follows: Coke

a Test the null hypothesis H0 that no differences exist between the mean sales of Coca-Cola classic, new Coke, and Pepsi in Chicago public building vending machines. Set α = .05.
b Make pairwise comparisons of the mean sales of Coca-Cola classic, new Coke, and Pepsi in Chicago public building vending machines.
c By the mid-1990s the Coca-Cola Company had discontinued making new Coke and had returned to making only its original product. Is there evidence in the 1987 study that this might happen? Explain your answer.
11.4: Two-Way Analysis of Variance
Many response variables are affected by more than one factor. Because of this we must often conduct experiments in which we study the effects of several factors on the response. In this section we consider studying the effects of two factors on a response variable. To begin, recall that in Example 11.2 we discussed an experiment in which the Tastee Bakery Company investigated the effect of shelf display height on monthly demand for one of its bakery products. This one-factor experiment is actually a simplification of a two-factor experiment carried out by the Tastee Bakery Company. We discuss this two-factor experiment in the following example.
EXAMPLE 11.9: The Shelf Display Case
The Tastee Bakery Company supplies a bakery product to many metropolitan supermarkets. The company wishes to study the effects of two factors—shelf display height and shelf display width—on monthly demand (measured in cases of 10 units each) for this product. The factor “display height” is defined to have three levels: B (bottom), M (middle), and T (top). The factor “display width” is defined to have two levels: R (regular) and W (wide). The
treatments
in this experiment are display height and display width combinations. These treatments are

Here, for example, the notation BR denotes the treatment “bottom display height and regular display width.” For each display height and width combination the company randomly selects a sample of m = 3 metropolitan area supermarkets (all supermarkets used in the study will be of equal sales potential). Each supermarket sells the product for one month using its assigned display height and width combination, and the month’s demand for the product is recorded. The six samples obtained in this experiment are given in Table 11.12. We let xij,k denote the monthly demand obtained at the kth supermarket that used display height i and display width j. For example, xMW,2 = 78.4 is the monthly demand obtained at the second supermarket that used a middle display height and a wide display.
Table 11.12: Six Samples of Monthly Demands for a Bakery Product BakeSale2

In addition to giving the six samples, Table 11.12 gives the sample treatment mean for each display height and display width combination. For example, = 55.9 is the mean of the sample of three demands observed at supermarkets using a bottom display height and a regular display width. The table also gives the sample mean demand for each level of display height (B, M, and T) and for each level of display width (R and W). Specifically,

Finally, Table 11.12 gives = 61.5, which is the overall mean of the total of 18 demands observed in the experiment. Because = 77.2 is considerably larger than = 55.8 and = 51.5, we estimate that mean monthly demand is highest when using a middle display height. Since = 60.8 and = 62.2 do not differ by very much, we estimate there is little difference between the effects of a regular display width and a wide display on mean monthly demand.
Figure 11.10 presents a graphical analysis of the bakery demand data. In this figure we plot, for each display width (R and W), the change in the sample treatment mean demand associated with changing the display height from bottom (B) to middle (M) to top (T). Note that, for either a regular display width (R) or a wide display (W), the middle display height (M) gives the highest mean monthly demand. Also, note that, for either a bottom, middle, or top display height, there is little difference between the effects of a regular display width and a wide display on mean monthly demand. This sort of graphical analysis is useful in determining whether a condition called
interaction
exists. We explain the meaning of interaction in the following discussion.
Figure 11.10: Graphical Analysis of the Bakery Demand Data

In general, suppose we wish to study the effects of two factors on a response variable. We assume that the first factor, which we refer to as factor 1, has
a levels (levels 1, 2, …, a). Further, we assume that the second factor, which we will refer to as factor 2, has
b levels (levels 1, 2,…, b). Here a
treatment
is considered to be a combination of a level of factor 1 and a level of factor 2. It follows that there are a total of ab treatments, and we assume that we will employ a completely randomized experimental design in which we will assign m experimental units to each treatment. This procedure results in our observing m values of the response variable for each of the ab treatments, and in this case we say that we are performing a
two-factor factorial experiment.
The method we will explain for analyzing the results of a two-factor factorial experiment is called two-way analysis of variance or
two-way ANOVA. This method assumes that we have obtained a random sample corresponding to each and every treatment, and that the sample sizes are equal (as described above). Further, we can assume that the samples are independent because we have employed a completely randomized experimental design. In addition, we assume that the populations of values of the response variable associated with the treatments have normal distributions with equal variances.
In order to understand the various ways in which factor 1 and factor 2 might affect the mean response, consider Figure 11.11. It is possible that only factor 1 significantly affects the mean response [see Figure 11.11(a)]. On the other hand, it is possible that only factor 2 significantly affects the mean response [see Figure 11.11(b)]. It is also possible that both factors 1 and 2 significantly affect the mean response. If this is so, these factors might affect the mean response independently [see Figure 11.11(c)], or these factors might interact as they affect the mean response [see Figure 11.11(d)]. In general, we say that there is
interaction
between factors 1 and 2 if the relationship between the mean response and one of the factors depends upon the level of the other factor. This is clearly true in Figure 11.11(d). Note here that at levels 1 and 3 of factor 1, level 1 of factor 2 gives the highest mean response, whereas at level 2 of factor 1, level 2 of factor 2 gives the highest mean response. On the other hand, the parallel line plots in Figure 11.11(a), (b), and (c) indicate a lack of interaction between factors 1 and 2. To graphically check for interaction, we can plot the sample treatment means, as we have done in Figure 11.10. If we obtain essentially parallel line plots, then it might be reasonable to conclude that there is little or no interaction between factors 1 and 2 (this is true in Figure 11.10). On the other hand, if the line plots are not parallel, then it might be reasonable to conclude that factors 1 and 2 interact.
Figure 11.11: Different Possible Treatment Effects in Two-Way ANOVA

In addition to graphical analysis, analysis of variance is a useful tool for analyzing the data from a two-factor factorial experiment. To explain the ANOVA approach for analyzing such an experiment, we define

The ANOVA procedure for a two-factor factorial experiment partitions the total sum of squares (SSTO) into four components: the factor 1 sum of squares-SS(1), the factor 2 sum of squares-SS(2), the interaction sum of squares-SS(int), and the error sum of squares-SSE. The formula for this partitioning is as follows:
SSTO = SS (1) + SS (2) + SS ( int ) + SSE
The steps for calculating these sums of squares, as well as what is measured by the sums of squares, can be summarized as follows:
Step 1: Calculate SSTO, which measures the total amount of variability:

Step 2: Calculate SS(1), which measures the amount of variability due to the different levels of factor 1:

Step 3: Calculate SS(2), which measures the amount of variability due to the different levels of factor 2:

Step 4: Calculate SS(interaction), which measures the amount of variability due to the interaction between factors 1 and 2:

Step 5: Calculate SSE, which measures the amount of variability due to the error:
SSE = SSTO − SS (1) − SS (2) − SS ( int )
These sums of squares are shown in Table 11.13, which is called a two-way analysis of variance (ANOVA) table. This table also gives the degrees of freedom associated with each source of variation—factor 1, factor 2, interaction, error, and total—as well as the mean squares and F statistics used to test the hypotheses of interest in a two-factor factorial experiment.
Before discussing these hypotheses, we will illustrate how the entries in the ANOVA table are calculated. The sums of squares in the shelf display case are calculated as follows (note that a = 3, b = 2, and m = 3):
Table 11.13: Two-Way ANOVA Table

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Figure 11.12 gives the MINITAB output of a two-way ANOVA for the shelf display data. This figure shows the above calculated sums of squares, as well as the degrees of freedom (recall that a = 3, b = 2, and m = 3), mean squares, and F statistics used to test the hypotheses of interest.
Figure 11.12: MINITAB Output of a Two-Way ANOVA of the Shelf Display Data

We first test the null hypothesis
H0 that no interaction exists between factors 1 and 2 versus the alternative hypothesis
Ha
that interaction does exist. We can reject H0 in favor of Ha at level of significance α if

is greater than the Fα point based on (a − 1)(b − 1) numerator and ab(m − 1) denominator degrees of freedom. In the shelf display case, F.05 based on (a − 1)(b − 1) = 2 numerator and ab(m − 1) = 12 denominator degrees of freedom is 3.89 (see Table A.6, page 867). Because

is less than F.05 = 3.89, we cannot reject H0 at the .05 level of significance. We conclude that little or no interaction exists between shelf display height and shelf display width. That is, we conclude that the relationship between mean demand for the bakery product and shelf display height depends little (or not at all) on the shelf display width. Further, we conclude that the relationship between mean demand and shelf display width depends little (or not at all) on the shelf display height. Notice that these conclusions are suggested by the previously given plots of Figure 11.10 (page 465).
In general, when we conclude that little or no interaction exists between factors 1 and 2, we can (separately) test the significance of each of factors 1 and 2. We call this testing the significance of the main effects (what we do if we conclude that interaction does exist between factors 1 and 2 will be discussed at the end of this section).
To test the significance of factor 1, we test the null hypothesis
H0 that no differences exist between the effects of the different levels of factor 1 on the mean response versus the alternative hypothesis
Ha that at least two levels of factor 1 have different effects. We can reject H0 in favor of Ha at level of significance α if

is greater than the Fα point based on a − 1 numerator and ab(m − 1) denominator degrees of freedom. In the shelf display case, F.05 based on a − 1 = 2 numerator and ab(m − 1) = 12 denominator degrees of freedom is 3.89. Because

is greater than F.05 = 3.89, we can reject H0 at the .05 level of significance. Therefore, we have strong evidence that at least two of the bottom, middle, and top display heights have different effects on mean monthly demand.
To test the significance of factor 2, we test the null hypothesis
H0 that no differences exist between the effects of the different levels of factor 2 on the mean response versus the alternative hypothesis
Ha that at least two levels of factor 2 have different effects. We can reject H0 in favor of Ha at level of significance α if

is greater than the Fα point based on b − 1 numerator and ab(m − 1) denominator degrees of freedom. In the shelf display case, F.05 based on b − 1 = 1 numerator and ab(m − 1) = 12 denominator degrees of freedom is 4.75. Because

is less than F.05 = 4.75, we cannot reject H0 at the .05 level of significance. Therefore, we do not have strong evidence that the regular display width and the wide display have different effects on mean monthly demand.
If, in a two-factor factorial experiment, we conclude that at least two levels of factor 1 have different effects or at least two levels of factor 2 have different effects, we can make pairwise comparisons to determine how the effects differ.
Point Estimates and Confidence Intervals in Two-Way ANOVA
1 Consider the difference between the effects of levels i and i′ of factor 1 on the mean value of the response variable.

a A point estimate of this difference is −

b An individual 100(1 − α) percent confidence interval for this difference is

where the tα/2 point is based on ab(m − 1) degrees of freedom, and MSE is the error mean square found in the two-way ANOVA table.
c A Tukey simultaneous 100(1 − α
) percent confidence interval for this difference (in the set of all possible paired differences between the effects of the different levels of factor 1) is

where qα is obtained from Table A.9 (pages 870–872), which is a table of percentage points of the studentized range. Here qα is listed corresponding to values of a and ab(m − 1).
2 Consider the difference between the effects of levels j and j′ of factor 2 on the mean value of the response variable.

a A point estimate of this difference is −

b An individual 100(1 − α
) percent confidence interval for this difference is

where the tα/2 point is based on ab(m −1) degrees of freedom.
c A Tukey simultaneous 100(1 − α
) percent confidence interval for this difference (in the set of all possible paired differences between the effects of the different levels of factor 2) is

where qα is obtained from Table A.9 and is listed corresponding to values of b and ab(m − 1).
3 Let μij denote the mean value of the response variable obtained when using level i of factor 1 and level j of factor 2. A point estimate of μ
ij
is and an individual 100(1 − α
) percent confidence interval for μij is

where the tα/2 point is based on ab(m − 1) degrees of freedom.
EXAMPLE 11.10: The Shelf Display Case
We have previously concluded that at least two of the bottom, middle, and top display heights have different effects on mean monthly demand. Since = 77.2 is greater than = 55.8 and = 51.5, we will use Tukey simultaneous 95 percent confidence intervals to compare the effect of a middle display height with the effects of the bottom and top display heights. To compute these intervals, we first note that q.05 = 3.77 is the entry in Table A.9 (page 871) corresponding to a = 3 and ab(m − 1) = 12. Also note that the MSE found in the two-way ANOVA table is 6.12 (see Figure 11.12). It follows that a Tukey simultaneous 95 percent confidence interval for the difference between the effects of a middle and bottom display height on mean monthly demand is

This interval says we are 95 percent confident that changing from a bottom display height to a middle display height will increase the mean demand for the bakery product by between 17.5925 and 25.2075 cases per month. Similarly, a Tukey simultaneous 95 percent confidence interval for the difference between the effects of a middle and top display height on mean monthly demand is

This interval says we are 95 percent confident that changing from a top display height to a middle display height will increase mean demand for the bakery product by between 21.8925 and 29.5075 cases per month. Together, these intervals make us 95 percent confident that a middle shelf display height is, on average, at least 17.5925 cases sold per month better than a bottom shelf display height and at least 21.8925 cases sold per month better than a top shelf display height.
Next, recall that previously conducted F tests suggest that there is little or no interaction between display height and display width and that there is little difference between using a regular display width and a wide display. However, intuitive and graphical analysis should always be used to supplement the results of hypothesis testing. In this case, note from Table 11.12 (page 464) that = 75.5 and = 78.9. This implies that we estimate that, when we use a middle display height, changing from a regular display width to a wide display increases mean monthly demand by 3.4 cases (or 34 units). This slight increase can be seen in Figure 11.10 (page 465) and suggests that it might be best (depending on what supermarkets charge for different display heights and widths) for the bakery to use a wide display with a middle display height. Since t.025 based on ab(m − 1) = 12 degrees of freedom is 2.179, an individual 95 percent confidence interval for μMW, the mean demand obtained when using a middle display height and a wide display, is

This interval says that, when we use a middle display height and a wide display, we can be 95 percent confident that mean demand for the bakery product will be between 75.7878 and 82.0122 cases per month.
If we conclude that (substantial) interaction exists between factors 1 and 2, the effects of changing the level of one factor will depend on the level of the other factor. In this case, we cannot separate the analysis of the effects of the levels of the two factors. One simple alternative procedure is to use one-way ANOVA (see Section 11.2) to compare all of the treatment means (the μij’s) with the possible purpose of finding the best combination of levels of factors 1 and 2. For example, if there had been (substantial) interaction in the shelf display case, we could have used one-way ANOVA to compare the six treatment means—μBR, μBW, μMR, μMW, μTR, and μTW—to find the best combination of display height and width. Alternatively, we could study the effects of the different levels of one factor at a specified level of the other factor. This is what we did at the end of the shelf display case, when we noticed that at a middle display height, a wide display seemed slightly more effective than a regular display width.
Finally, we might wish to study the effects of more than two factors on a response variable of interest. The ideas involved in such a study are an extension of those involved in a two-way ANOVA. Although studying more than two factors is beyond the scope of this text, a good reference is Neter, Kutner, Nachtsheim, and Wasserman (1996).
Exercises for Section 11.4
CONCEPTS

11.22 What is a treatment in the context of a two-factor factorial experiment?
11.23 Explain what we mean when we say that
a Interaction exists between factor 1 and factor 2.
b No interaction exists between the factors.
METHODS AND APPLICATIONS
11.24 An experiment is conducted to study the effects of two sales approaches—high-pressure (H) and low-pressure (L)—and to study the effects of two sales pitches (1 and 2) on the weekly sales of a product. The data in Table 11.14 are obtained by using a completely randomized design, and Figure 11.13 gives the Excel output of a two-way ANOVA of the sales experiment data. SaleMeth2
Table 11.14: Results of the Sales Approach Experiment SaleMeth2

Figure 11.13: Excel Output of a Two-Way ANOVA of the Sales Approach Data

a Perform graphical analysis to check for interaction between sales pressure and sales pitch.
b Test for interaction by setting α = .05.
c Test for differences in the effects of the levels of sales pressure by setting α = .05. That is, test the significance of sales pressure effects with α = .05.
d Calculate and interpret a 95 percent individual confidence interval for μ H• − μ L•

e Test for differences in the effects of the levels of sales pitch by setting α = .05. That is, test the significance of sales pitch effects with α = .05.
f Calculate and interpret a 95 percent individual confidence interval for μ•1 − μ•2.
g Calculate a 95 percent (individual) confidence interval for mean sales when using high sales pressure and sales pitch 1. Interpret this interval.
11.25 A study compared three display panels used by air traffic controllers. Each display panel was tested for four different simulated emergency conditions. Twenty-four highly trained air traffic controllers were used in the study. Two controllers were randomly assigned to each display panel–emergency condition combination. The time (in seconds) required to stabilize the emergency condition was recorded. The data in Table 11.15 were observed. Figure 11.14 presents the MegaStat output of a two-way ANOVA of the display panel data. Display2
Table 11.15: Results of a Two-Factor Display Panel Experiment Display2

Figure 11.14: MegaStat Output of a Two-Way ANOVA of the Display Panel Data

a Interpret the MegaStat interaction plot in Figure 11.14. Then test for interaction with α = .05.
b Test the significance of display panel effects with α = .05.
c Test the significance of emergency condition effects with α = .05.
d Make pairwise comparisons of display panels A, B, and C.
e Make pairwise comparisons of emergency conditions 1, 2, 3, and 4.
f Which display panel minimizes the time required to stabilize an emergency condition? Does your answer depend on the emergency condition? Why?
g Calculate a 95 percent (individual) confidence interval for the mean time required to stabilize emergency condition 4 using display panel B.
11.26 A telemarketing firm has studied the effects of two factors on the response to its television advertisements. The first factor is the time of day at which the ad is run, while the second is the position of the ad within the hour. The data in Table 11.16, which were obtained by using a completely randomized experimental design, give the number of calls placed to an 800 number following a sample broadcast of the advertisement. If we use MegaStat to analyze these data, we obtain the output in Figure 11.15. TelMktResp
Table 11.16: Results of a Two-Factor Telemarketing Response Experiment TelMktResp

Figure 11.15: MegaStat Output of a Two-Way ANOVA of the Telemarketing Data

a Perform graphical analysis to check for interaction between time of day and position of advertisement. Explain your conclusion. Then test for interaction with α = .05.
b Test the significance of time of day effects with α = .05.
c Test the significance of position of advertisement effects with α = .05.
d Make pairwise comparisons of the morning, afternoon, and evening times.
e Make pairwise comparisons of the four ad positions.
f Which time of day and advertisement position maximizes consumer response? Compute a 95 percent (individual) confidence interval for the mean number of calls placed for this time of day/ad position combination.
11.27 A small builder of speculative homes builds three basic house designs and employs two foremen. The builder has used each foreman to build two houses of each design and has obtained the profits given in Table 11.17 (the profits are given in thousands of dollars). Figure 11.16 presents the MINITAB output of a two-way ANOVA of the house profitability data. HouseProf
Figure 11.16: MINITAB Output of a Two-Way ANOVA of the House Profitability Data

Table 11.17: Results of the House Profitability Study HouseProf

a Interpret the MINITAB interaction plot in Figure 11.16. Then test for interaction with α = .05. Can we (separately) test for the significance of house design and foreman effects? Explain why or why not.
b Which house design/foreman combination gets the highest profit? When we analyze the six house design/foreman combinations using one-way ANOVA, we obtain MSE = .390. Compute a 95 percent (individual) confidence interval for mean profit when the best house design/foreman combination is employed.
11.28 In the article “Humor in American, British, and German ads” (Industrial Marketing Management, vol. 22, 1993), L. S. McCullough and R. K. Taylor study humor in trade magazine advertisements. A sample of 665 ads was categorized according to two factors: nationality (American, British, or German) and industry (29 levels, ranging from accounting to travel). A panel of judges ranked the degree of humor in each ad on a five-point scale. When the resulting data were analyzed using two-way ANOVA, the p-values for testing the significance of nationality, industry, and the interaction between nationality and industry were, respectively, .087, .000, and .046. Discuss why these p-values agree with the following verbal conclusions of the authors: “British ads were more likely to be humorous than German or American ads in the graphics industry. German ads were least humorous in the grocery and mining industries, but funnier than American ads in the medical industry and funnier than British ads in the packaging industry.”
Chapter Summary
We began this chapter by introducing some basic concepts of experimental design. We saw that we carry out an experiment by setting the values of one or more
factors
before the values of the
response variable
are observed. The different values (or levels) of a factor are called
treatments, and the purpose of most experiments is to compare and estimate the effects of the various treatments on the response variable. We saw that the different treatments are assigned to
experimental units, and we discussed the
completely randomized experimental design. This design assigns independent, random samples of experimental units to the treatments.
We began studying how to analyze experimental data by discussing one-way analysis of variance (one-way ANOVA). Here we study how one factor (having p levels) affects the response variable. In particular, we learned how to use this methodology to test for differences between the
treatment means
and to estimate the size of pairwise differences between the treatment means.
Sometimes, even if we randomly select the experimental units, differences between the experimental units conceal differences between the treatments. In such a case, we learned that we can employ a
randomized block design. Each block (experimental unit or set of experimental units) is used exactly once to measure the effect of each and every treatment. Because we are comparing the treatments by using the same experimental units, any true differences between the treatments will not be concealed by differences between the experimental units.
The last technique we studied in this chapter was two-way analysis of variance (two-way ANOVA). Here we study the effects of two factors by carrying out a
two-factor factorial experiment. If there is little or no interaction between the two factors, then we are able to separately study the significance of each of the two factors. On the other hand, if substantial interaction exists between the two factors, we study the nature of the differences between the treatment means.
Glossary of Terms
analysis of variance table:
A table that summarizes the sums of squares, mean squares, F statistic(s), and p-value(s) for an analysis of variance. (pages 450, 458, and 467)
completely randomized experimental design:
An experimental design in which independent, random samples of experimental units are assigned to the treatments. (page 442)
experimental units:
The entities (objects, people, and so on) to which the treatments are assigned. (page 441)
factor:
A variable that might influence the response variable; an independent variable. (page 441)
interaction:
When the relationship between the mean response and one factor depends on the level of the other factor. (page 466)
one-way ANOVA:
Amethod used to estimate and compare the effects of the different levels of a single factor on a response variable. (page 444)
randomized block design:
An experimental design that compares p treatments by using b blocks (experimental units or sets of experimental units). Each block is used exactly once to measure the effect of each and every treatment. (page 455)
replication:
When a treatment is applied to more than one experimental unit. (page 442)
response variable:
The variable of interest in an experiment; the dependent variable. (page 441)
treatment:
A value (or level) of a factor (or combination of factors). (page 441)
treatment mean:
The mean value of the response variable obtained by using a particular treatment. (page 444)
two-factor factorial experiment:
An experiment in which we randomly assign m experimental units to each combination of levels of two factors. (page 465)
two-way ANOVA:
A method used to study the effects of two factors on a response variable. (page 465)
Important Formulas and Tests
One-way ANOVA sums of squares: pages 446–447
One-way ANOVA F test: page 448
One-way ANOVA table: page 450
Estimation in one-way ANOVA: page 451
Randomized block sums of squares: page 457
Randomized block ANOVA table: page 458
Estimation in a randomized block experiment: page 459
Two-way ANOVA sums of squares: page 467
Two-way ANOVA table: page 467
Estimation in two-way ANOVA: page 470
Supplementary Exercises

11.29 A drug company wishes to compare the effects of three different drugs (X, Y, and Z) that are being developed to reduce cholesterol levels. Each drug is administered to six patients at the recommended dosage for six months. At the end of this period the reduction in cholesterol level is recorded for each patient. The results are given in Table 11.18. Completely analyze these data using one-way ANOVA. Use the MegaStat output in Figure 11.17. CholRed
Figure 11.17: MegaStat Output of an ANOVA of the Cholesterol Reduction Data

Table 11.18: Reduction of Cholesterol Levels CholRed

11.30 In an article in Accounting and Finance (the journal of the Accounting Association of Australia and New Zealand), Church and Schneider (1993) report on a study concerning auditor objectivity. Asample of 45 auditors was randomly divided into three groups: (1) the 15 auditors in group 1 designed an audit program for accounts receivable and evaluated an audit program for accounts payable designed by somebody else; (2) the 15 auditors in group 2 did the reverse; (3) the 15auditors in group 3 (the control group) evaluated the audit programs for both accounts. All 45 auditors were then instructed to spend an additional 15 hours investigating suspected irregularities in either or both of the audit programs. The mean additional number of hours allocated to the accounts receivable audit program by the auditors in groups 1, 2, and 3 were 1 = 6.7, 2 = 9.7, and 3 = 7.6. Furthermore, a one-way ANOVA of the data shows that SST = 71.51 and SSE = 321.3.
a Define appropriate treatment means μ1, μ2, and μ3. Then test for statistically significant differences between these treatment means. Set α = .05. Can we conclude that the different auditor groups have different effects on the mean additional time allocated to investigating the accounts receivable audit program?
b Perform pairwise comparisons of the treatment means by computing a Tukey simultaneous 95 percent confidence interval for each of the pairwise differences μ1 − μ2, μ1 − μ3, and μ2 − μ3. Interpret the results. What do your results imply about the objectivity of auditors? What are the practical implications of this result?
11.31 The loan officers at a large bank can use three different methods for evaluating loan applications. Loan decisions can be based on (1) the applicant’s balance sheet (B), (2) examination of key financial ratios (F), or (3) use of a new decision support system (D). In order to compare these three methods, four of the bank’s loan officers are randomly selected. Each officer employs each of the evaluation methods for one month (the methods are employed in randomly selected orders). After a year has passed, the percentage of bad loans for each loan officer and evaluation method is determined. The data obtained by using this randomized block design are given in Table 11.19. Completely analyze the data using randomized block ANOVA. LoanEval
Table 11.19: Results of a Loan Evaluation Experiment LoanEval

11.32 In an article in the Accounting Review (1991), Brown and Solomon study the effects of two factors—confirmation of accounts receivable and verification of sales transactions—on account misstatement risk by auditors. Both factors had two levels—completed or not completed—and a line plot of the treatment mean misstatement risks is shown in Figure 11.18. This line plot makes it appear that interaction exists between the two factors. In your own words, explain what the nature of the interaction means in practical terms.
Figure 11.18: Line Plot for Exercise 11.32

Source: C. E. Brown and I. Solomon, “Configural Information Processing in Auditing: The Role of Domain-Specific Knowledge,” The Accounting Review 66, no. 1 (January 1991), p. 105 (Figure 1). Copyright © 1991 American Accounting Association. Used with permission.
11.33 In an article in the Academy of Management Journal (1987), W. D. Hicks and R. J. Klimoski studied the effects of two factors—degree of attendance choice and prior information—on managers’ evaluation of a two-day workshop concerning performance reviews. Degree of attendance choice had two levels: high (little pressure from supervisors to attend) and low (mandatory attendance). Prior information also had two levels: realistic preview and traditional announcement. Twenty-one managers were randomly assigned to the four treatment combinations. At the end of the program, each manager was asked to rate the workshop on a seven-point scale (1 = no satisfaction, 7 = extreme satisfaction). The following sample treatment means were obtained:

In addition, SS(1), SS(2), SS(int), and SSE were calculated to be, respectively, 22.26, 1.55, .61, and 114.4. Here factor 1 is degree of choice and factor 2 is prior information. Completely analyze this situation using two-way ANOVA.
11.34 An information systems manager wishes to compare the execution speed (in seconds) for a tandard statistical software package using three different compilers. The manager tests each compiler using three different computer models, and the data in Table 11.20 are obtained. Completely analyze the data (using a computer package if you wish). In particular, test for compiler effects and computer model effects, and also perform pairwise comparisons. ExecSpd
Table 11.20: Results of an Execution Speed Experiment for Three Compilers (Seconds) ExecSpd

11.35 A research team at a school of agriculture carried out an experiment to study the effects of two fertilizer types (A and B) and four wheat types (M, N, O, and P) on crop yields (in bushels per one-third acre plot). The data in Table 11.21 were obtained by using a completely randomized experimental design. Analyze these data by using the following MegaStat output: Wheat
Table 11.21: Results of a Two-Factor Wheat Yield Experiment Wheat

11.36: Internet Exercise
In an article from the Journal of Statistics Education, Robin Lock describes a rich set of interesting data on selected attributes for a sample of 1993-model new cars. These data support a wide range of analyses. Indeed, the analysis possibilities are the subject of Lock’s article. Here our interest is in comparing mean highway gas mileage figures among the six identified vehicle types—compact, small, midsize, large, sporty, and van.
Go to the Journal of Statistics Education Web archive and retrieve the 1993-cars data set and related documentation: http://www.amstat.org/publications/jse/archive.htm. Click on 93cars.dat for data, 93cars.txt for documentation, and article associated with this data set for a full text of the article. Excel and MINITAB data files are also included on the CD-ROM ( 93Cars). Construct box plots of Highway MPG by Vehicle Type (if MINITAB or other suitable statistical software is available). Describe any apparent differences in gas mileage by vehicle type. Conduct an analysis of variance to test for differences in mean gas mileage by vehicle type. Prepare a brief report of your analysis and conclusions.
Appendix 11.1: Experimental Design and Analysis of Variance Using MINITAB
The instruction blocks in this section each begin by describing the entry of data into the MINITAB data window. Alternatively, the data may be loaded directly from the data disk included with the text. The appropriate data file name is given at the top of each instruction block. Please refer to Appendix 1.1 for further information about entering data, saving data, and printing results when using MINITAB.

One-way ANOVA
in Figure 11.2(a) on page 449 (data file: GasMile2.MTW):
• In the Data window, enter the data from Table 11.1 (page 442) into three columns with variable names Type A, Type B, and Type C.
• Select Stat: ANOVA: One-way (Unstacked).
• In the “One-Way Analysis of Variance” dialog box, select ‘Type A’ ‘Type B’ ‘Type C’ into the “Responses (in separate columns)” window. (The single quotes are necessary because of the blank spaces in the variable names. The quotes will be added automatically if the names are selected from the variable list or if they are selected by double clicking.)
• Click OK in the “One-Way Analysis of Variance” dialog box.
To produce mileage by gasoline type boxplots similar to those shown in Table 11.1 (page 442):
• Click the Graphs… button in the “One-Way Analysis of Variance” dialog box.
• Check the “Boxplots of data” checkbox and click OK in the “One-Way Analysis of Variance—Graphs” dialog box.
• Click OK in the “One-Way Analysis of Variance” dialog box.
To produce Tukey pairwise comparisons:
• Click on the Comparisons… button in the “One-Way Analysis of Variance” dialog box.
• Check the “Tukey’s family error rate” checkbox.
• In the “Tukey’s family error rate” box, enter the desired experimentwise error rate (here we have entered 5, which denotes 5%—alternatively, we could enter the decimal fraction .05).
• Click OK in the “One-Way Multiple Comparisons” dialog box.
• Click OK in the “One-Way Analysis of Variance” dialog box.
• The one-way ANOVA output and the Tukey multiple comparisons will be given in the Session window, and the box plots will appear in a graphics window.

Randomized Block ANOVA in Figure 11.7 on page 458 (data File: CardBox.MTW):
• In the data window, enter the observed number of defective boxes from Table 11.7 into column C1 with variable name Rejects; enter the corresponding production method (1, 2, 3, or 4) into column C2 with variable name Method; and enter the corresponding machine operator (1, 2, or 3) into column C3 with variable name Operator.
• Select Stat: ANOVA: Two-way.
• In the “Two-way Analysis of Variance” dialog box, select Rejects into the Response window.
• Select Method into the Row Factor window and check the “Display Means” checkbox.
• Select Operator into the Column Factor window and check the “Display Means” checkbox.
• Check the “Fit additive model” checkbox.
• Click OK in the “Two-way Analysis of Variance” dialog box to display the randomized block ANOVA in the Session window.

Table of row, column, and cell means in Figure 11.12 on page 468 (data file: BakeSale2.MTW):
• In the data window, enter the observed demands from Table 11.12 (page 464) into column C1 with variable name Demand, enter the corresponding shelf display heights (Bottom, Middle, or Top) into column C2 with variable name Height, and enter the corresponding shelf display widths (Regular or Wide) into column C3 with variable name Width.
• Select Stat : Tables : Descriptive Statistics.
• In the “Table of Descriptive Statistics” dialog box, select Height into the “Categorical variables: For rows” window and select Width into the “Categorical variables: For columns” window.
• Click on the “Display summaries for Associated Variables…” button.
• In the “Descriptive Statistics—Summaries for Associated Variables” dialog box, select Demand into the “Associated variables” window, check the “Display Means” checkbox, and click OK.
• If cell frequencies are desired in addition to the row, column, and cell means, click OK in the “Table of Descriptive Statistics” dialog box.
• If cell frequencies are not desired, click on the “Display summaries for Categorical Variables…” button, uncheck the “Display Counts” checkbox, and click OK in the “Descriptive Statistics—Summaries for Categorical Variables” dialog box. Then, click OK in the “Table of Descriptive Statistics” dialog box.
• The row, column, and cell means are displayed in the Session window.

Two-way ANOVA
in Figure 11.12 on page 468 (data file: BakeSale2.MTW):
• In the data window, enter the observed demands from Table 11.12 (page 464) into column C1 with variable name Demand; enter the corresponding shelf display heights (Bottom, Middle, or Top) into column C2 with variable name Height; and enter the corresponding shelf display widths (Regular or Wide) into column C3 with variable name Width.
• Select Stat: ANOVA: Two-Way.
• In the “Two-Way Analysis of Variance” dialog box, select Demand into the Response window.
• Select Height into the “Row Factor” window.
• Select Width into the “Column Factor” window.
• To produce tables of means by Height and Width, check the “Display means” checkboxes next to the “Row factor” and “Column factor” windows. This will also produce individual confidence intervals for each level of the row factor and each level of the column factor—these intervals are not shown in Figure 11.12.
• Enter the desired level of confidence for the individual confidence intervals in the “Confidence level” box.
• Click OK in the “Two-Way Analysis of Variance” dialog box.

To produce Demand by Height and Demand by Width boxplots similar to those displayed in Table 11.12 on page 464:
• Select Graph: Boxplot.
• In the Boxplots dialog box, select “One Y With Groups” and click OK.
• In the “Boxplot—One Y, With Groups” dialog box, select Demand into the Graph variables window.
• Select Height into the “Categorical variables for grouping” window.
• Click OK in the “Boxplot—One Y, With Groups” dialog box to obtain boxplots of demand by levels of height in a graphics window.
• Repeat the steps above using Width as the “Categorical variable for grouping” to obtain boxplots of demand by levels of width in a separate graphics window.

To produce an interaction plot similar to that displayed in Figure 11.10(b) on page 465:
• Select Stat : ANOVA : Interactions plot.
• In the Interactions Plot dialog box, select Demand into the Responses window.
• Select Width and Height into the Factors window.
• Click OK in the Interactions Plot dialog box to obtain the plot in a graphics window.

Appendix 11.2: Experimental Design and Analysis of Variance Using Excel
The instruction blocks in this section each begin by describing the entry of data into an Excel spreadsheet. Alternatively, the data may be loaded directly from the data disk included with the text. The appropriate data file name is given at the top of each instruction block. Please refer to Appendix 1.2 for further information about entering data, saving data, and printing results when using Excel.

One-way ANOVA
in Figure 11.2(b) on page 449 (data file: GasMile2.xlsx):
• Enter the gasoline mileage data from Table 11.1 (page 442) as follows: type the label “Type A” in cell A1 with its five mileage values in cells A2 to A6; type the label “Type B” in cell B1 with its five mileage values in cells B2 to B6; type the label “Type C” in cell C1 with its five mileage values in cells C2 to C6.
• Select Data : Data Analysis : Anova : Single Factor and click OK in the Data Analysis dialog box.
• In the “Anova: Single Factor” dialog box, enter A1.C6 into the “Input Range” window.
• Select the “Grouped by: Columns” option.
• Place a checkmark in the “Labels in first row” checkbox.
• Enter 0.05 into the Alpha box
• Under output options, select “New Worksheet Ply” to have the output placed in a new worksheet and enter the name Output for the new worksheet.
• Click OK in the “Anova: Single Factor” dialog box.

Randomized block ANOVA in Figure 11.8 on page 461 (data file: SaleMeth.xlsx):
• Enter the sales methods data from Table 11.9 (page 460) as shown in the screen.
• Select Data : Data Analysis : Anova: Two-Factor Without Replication and click OK in the Data Analysis dialog box.
• In the “Anova: Two Factor Without Replication” dialog box, enter A1.D5 into the “Input Range” window.
• Place a checkmark in the “Labels” checkbox.
• Enter 0.05 in the Alpha box.
• Under output options, select “New Worksheet Ply” to have the output placed in a new worksheet and enter the name Output for the new worksheet.
• Click OK in the “Anova: Two Factor Without Replication” dialog box.

Two-way ANOVA
in Figure 11.13 on page 472 (data file: SaleMeth2.xlsx):
• Enter the sales approach experiment data from Table 11.14 (page 472) as shown in the screen.
• Select Data: Data Analysis : Anova: Two-Factor With Replication and click OK in the Data Analysis dialog box.
• In the “Anova: Two Factor With Replication” dialog box, enter A1.C7 into the “Input Range” window.
• Enter the value 3 into the “Rows per Sample” box (this indicates the number of replications).
• Enter 0.05 in the Alpha box.
• Under output options, select “New Worksheet Ply” to have the output placed in a new worksheet and enter the name Output for the new worksheet.
• Click OK in the “Anova: Two Factor With Replication” dialog box.

Appendix 11.3: Experimental Design and Analysis of Variance Using MegaStat
The instructions in this section begin by describing the entry of data into an Excel worksheet. Alternatively, the data may be loaded directly from the data disk included with the text. The appropriate data file name is given at the top of each instruction block. Please refer to Appendix 1.2 for further information about entering data, saving data, and printing results in Excel. Please refer to Appendix 1.3 for more information about using MegaStat.

One-way ANOVA
similar to Figure 11.2(b) on page 449 (data file: GasMile2.xlsx):
• Enter the gas mileage data in Table 11.1 (page 442) into columns A, B, and C—Type A mileages in column A (with label Type A), Type B mileages in column B (with label Type B), and Type C mileages in column C (with label Type C). Note that the input columns for the different groups must be side by side. However, the number of observations in each group can be different.
• Select Add-Ins: MegaStat: Analysis of Variance: One-Factor ANOVA.
• In the One-Factor ANOVA dialog box, use the autoexpand feature to enter the range A1.C6 into the Input Range window.
• If desired, request “Post-hoc Analysis” to obtain Tukey simultaneous comparisons and pairwise t tests. Select from the options: “Never,” “Always,” or “When p < .05.” The option “When p < .05” gives post-hoc analysis when the p-value for the F statistic is less than .05. • Check the Plot Data checkbox to obtain a plot comparing the groups. • Click OK in the One-Factor ANOVA dialog box. Randomized block ANOVA similar to Figure 11.7 on page 458 (data file: CardBox.xlsx): • Enter the cardboard box data in Table 11.7 (page 456) in the arrangement shown in the screen. Here each column corresponds to a treatment (in this case, a production method) and each row corresponds to a block (in this case, a machine operator). Identify the production methods using the labels Method 1, Method 2, Method 3, and Method 4 in cells B1, C1, D1, and E1. Identify the blocks using the labels Operator 1, Operator 2, and Operator 3 in cells A2, A3, and A4. • Select Add-Ins: MegaStat: Analysis of Variance: Randomized Blocks ANOVA. • In the Randomized Blocks ANOVA dialog box, click in the Input Range window and enter the range A1.E4. • If desired, request “Post-hoc Analysis” to obtain Tukey simultaneous comparisons and pairwise t-tests. Select from the options: “Never,” “Always,” or “When p < .05.” The option “When p < .05” gives post-hoc analysis when the p-value related to the F statistic for the treatments is less than .05. • Check the Plot Data checkbox to obtain a plot comparing the treatments. Two-way ANOVA similar to Figure 11.12 on page 468 (data file: BakeSale2.xlsx): • Enter the bakery demand data in Table 11.12 (page 464) in the arrangement shown in the screen. Here the row labels Bottom, Middle, and Top are the levels of factor 1 (in this case, shelf display height) and the column labels Regular and Wide are the levels of factor 2 (in this case, shelf display width). The arrangement of the data is as laid out in Table 11.12. • Select Add-Ins: MegaStat: Analysis of Variance: Two-Factor ANOVA. • In the Two-Factor ANOVA dialog box, enter the range A1.C10 into the Input Range window. • Type 3 into the “Replications per Cell” window. • Check the “Interaction Plot by Factor 1” and “Interaction Plot by Factor 2” checkboxes to obtain interaction plots. • If desired, request “Post-hoc Analysis” to obtain Tukey simultaneous comparisons and pairwise t-tests. Select from the options: “Never,” “Always,” and “When p < .05.” The option “When p < .05” gives Post-hoc analysis when the p-value related to the F statistic for a factor is less than .05. Here we have selected “Always.” • Click OK in the Two-Factor ANOVA dialog box. 1 All of the box plots presented in this chapter have been obtained using MINITAB. (Bowerman 440) Bowerman, Bruce L. Business Statistics in Practice, 5th Edition. McGraw-Hill Learning Solutions, 022008. .
CHAPTER 12: Chi-Square Tests
Chapter Outline

12.1
Chi-Square Goodness of Fit Tests

12.2
A Chi-Square Test for Independence
In this chapter we present two useful hypothesis tests based on the chi-square distribution (we have discussed the chi-square distribution in Section 9.6). First, we consider the chi-square test of goodness of fit. This test evaluates whether data falling into several categories do so with a hypothesized set of probabilities. Second, we discuss the
chi-square test for independence. Here data are classified on two dimensions and are summarized in a
contingency table. The test for independence then evaluates whether the cross-classified variables are independent of each other. If we conclude that the variables are not independent, then we have established that the variables in question are related, and we must then investigate the nature of the relationship.
12.1: Chi-Square Goodness of Fit Tests
Multinomial probabilities

Chapter 13

Sometimes we collect count data in order to study how the counts are distributed among several categories or cells. As an example, we might study consumer preferences for four different brands of a product. To do this, we select a random sample of consumers, and we ask each survey participant to indicate a brand preference. We then count the number of consumers who prefer each of the four brands. Here we have four categories (brands), and we study the distribution of the counts in each category in order to see which brands are preferred.
We often use categorical data to carry out a statistical inference. For instance, suppose that a major wholesaler in Cleveland, Ohio, carries four different brands of microwave ovens. Historically, consumer behavior in Cleveland has resulted in the market shares shown in Table 12.1. The wholesaler plans to begin doing business in a new territory—Milwaukee, Wisconsin. To study whether its policies for stocking the four brands of ovens in Cleveland can also be used in Milwaukee, the wholesaler compares consumer preferences for the four ovens in Milwaukee with the historical market shares observed in Cleveland. A random sample of 400 consumers in Milwaukee gives the preferences shown in Table 12.2.
Table 12.1: Market Shares for Four Microwave Oven Brands in Cleveland, Ohio MicroWav

Table 12.2: Brand Preferences for Four Microwave Ovens in Milwaukee, Wisconsin MicroWav

To compare consumer preferences in Cleveland and Milwaukee, we must consider a
multinomial experiment. This is similar to the binomial experiment. However, a binomial experiment concerns count data that can be classified into two categories, while a multinomial experiment concerns count data that are classified into more than two categories. Specifically, the assumptions for the multinomial experiment are as follows:
The Multinomial Experiment
1 We perform an experiment in which we carry out n identical trials and in which there are k possible outcomes on each trial.
2 The probabilities of the k outcomes are denoted p1, p2, …, pk where p1 + p2 + · · · + pk = 1. These probabilities stay the same from trial to trial.
3 The trials in the experiment are independent.
4 The results of the experiment are observed frequencies (counts) of the number of trials that result in each of the k possible outcomes. The frequencies are denoted f1, f2, …, fk. That is, f1 is the number of trials resulting in the first possible outcome, f2 is the number of trials resulting in the second possible outcome, and so forth.
Notice that the scenario that defines a multinomial experiment is similar to that which defines a binomial experiment. In fact, a binomial experiment is simply a multinomial experiment where k equals 2 (there are two possible outcomes on each trial).
In general, the probabilities p1, p2, …, pk are unknown, and we estimate their values. Or, we compare estimates of these probabilities with a set of specified values. We now look at such an example.
EXAMPLE 12.1: The Microwave Oven Preference Case
Suppose the microwave oven wholesaler wishes to compare consumer preferences in Milwaukee with the historical market shares in Cleveland. If the consumer preferences in Milwaukee are substantially different, the wholesaler will consider changing its policies for stocking the ovens. Here we will define

Remembering that the historical market shares for brands 1, 2, 3, and 4 in Cleveland are 20 percent, 35 percent, 30 percent, and 15 percent, we test the null hypothesis
H0: p1 = .20,     p2 = .35,     p3 = .30,     and     p4 = .15
which says that consumer preferences in Milwaukee are consistent with the historical market shares in Cleveland. We test H0 versus
Ha: the previously stated null hypothesis is not true
To test H0 we must compare the “observed frequencies” given in Table 12.2 with the “expected frequencies” for the brands calculated on the assumption that H0 is true. For instance, if H0 is true, we would expect 400(.20) = 80 of the 400 Milwaukee consumers surveyed to prefer brand 1. Denoting this expected frequency for brand 1 as E1, the expected frequencies for brands 2, 3, and 4 when H0 is true are E2 = 400(.35) = 140, E3 = 400(.30) = 120, and E4 = 400(.15) = 60. Recalling that Table 12.2 gives the observed frequency for each brand, we have f1 = 102, f2 = 121, f3 = 120, and f4 = 57. We now compare the observed and expected frequencies by computing a chi-square statistic as follows:

Clearly, the more the observed frequencies differ from the expected frequencies, the larger χ2 will be and the more doubt will be cast on the null hypothesis. If the chi-square statistic is large enough (beyond a rejection point), then we reject H0.

To find an appropriate rejection point, it can be shown that, when the null hypothesis is true, the sampling distribution of χ2 is approximately a χ2 distribution with k − 1 = 4 − 1 = 3 degrees of freedom. If we wish to test H0 at the .05 level of significance, we reject H0 if and only if

Since Table A.17 (page 878) tells us that the point corresponding to k − 1 = 3 degrees of freedom equals 7.81473, we find that

and we reject H0 at the .05 level of significance. Alternatively, the p-value for this hypothesis test is the area under the curve of the chi-square distribution having 3 degrees of freedom to the right of χ2 = 8.7786. This p-value can be calculated to be .0323845. Since this p-value is less than .05, we can reject H0 at the .05 level of significance. Although there is no single MINITAB dialog box that produces a chi-square goodness of fit test, Figure 12.1 shows the output of a MINITAB session that computes the chi-square statistic and its related p-value for the oven wholesaler problem.
Figure 12.1: Output of a MINITAB Session That Computes the Chi-Square Statistic and Its Related p-Value for the Oven Wholesaler Example

We conclude that consumer preferences in Milwaukee for the four brands of ovens are not consistent with the historical market shares in Cleveland. Based on this conclusion, the wholesaler should consider changing its stocking policies for microwave ovens when it enters the Milwaukee market. To study how to change its policies, the wholesaler might compute a 95percent confidence interval for, say, the proportion of consumers in Milwaukee who prefer brand 2. Since , this interval is (see Section 8.4, page 325)

Since this entire interval is below .35, it suggests that (1) the market share for brand 2 ovens in Milwaukee will be smaller than the 35 percent market share that this brand commands in Cleveland, and (2) fewer brand 2 ovens (on a percentage basis) should be stocked in Milwaukee. Notice here that by restricting our attention to one particular brand (brand 2), we are essentially combining the other brands into a single group. It follows that we now have two possible outcomes—“brand 2” and “all other brands.” Therefore, we have a binomial experiment, and we can employ the methods of Section 8.4, which are based on the binomial distribution.
In the following box we give a general chi-square goodness of fit test for multinomial probabilities:
A Goodness of Fit Test for Multinomial Probabilities
Consider a of n randomly multinomial experiment selected items is classified in which each into one of k groups. We let
fi = the number of items classified into group i (that is, the ith observed frequency)
Ei = npi
= the expected number of items that would be classified into group i if pi is the probability of a randomly selected item being classified into group i (that is, the ith expected frequency)
If we wish to test
H0: the values of the multinomial probabilities are p1, p2, …, pk—that is, the probability of a randomly selected item being classified into group 1 is p1, the probability of a randomly selected item being classified into group 2 is p2, and so forth
versus
Ha: at least one of the multinomial probabilities is not equal to the value stated in H0
we define the chi-square goodness of fit statistic to be

Also, define the p-value related to χ2 to be the area under the curve of the chi-square distribution having k − 1 degrees of freedom to the right of χ2.
Then, we can reject H0 in favor of Ha at level of significance α if either of the following equivalent conditions holds:
1

2 p-value < α Here the point is based on k − 1 degrees of freedom. This test is based on the fact that it can be shown that, when H0 is true, the sampling distribution of χ2 is approximately a chi-square distribution with k −1 degrees of freedom, if the sample size n is large. It is generally agreed that n should be considered large if all of the “expected cell frequencies” (Ei values) are at least 5. Furthermore, recent research implies that this condition on the Ei values can be somewhat relaxed. For example, Moore and McCabe (1993) indicate that it is reasonable to use the chi-square approximation if the number of groups (k) exceeds 4, the average of the Ei values is at least 5, and the smallest Ei value is at least 1. Notice that in Example 12.1 all of the Ei values are much larger than 5. Therefore, the chi-square test is valid. A special version of the chi-square goodness of fit test for multinomial probabilities is called a test for homogeneity. This involves testing the null hypothesis that all of the multinomial probabilities are equal. For instance, in the microwave oven situation we would test H0: p1 = p2 = p3 = p4 = .25 which would say that no single brand of microwave oven is preferred to any of the other brands (equal preferences). If this null hypothesis is rejected in favor of Ha: At least one of p1, p2, p3, and p4 exceeds .25 we would conclude that there is a preference for one or more of the brands. Here each of the expected cell frequencies equals .25(400) = 100. Remembering that the observed cell frequencies are f1 = 102, f2 = 121, f3 = 120, and f4 = 57, the chi-square statistic is Since χ2 = 26.94 is greater than (see Table A.17 on page 878 with k − 1 = 4 − 1 = 3 degrees of freedom), we reject H0 at level of significance .05. We conclude that preferences for the four brands are not equal and that at least one brand is preferred to the others. Normal distributions We have seen that many statistical methods are based on the assumption that a random sample has been selected from a normally distributed population. We can check the validity of the normality assumption by using frequency distributions, stem-and-leaf displays, histograms, and normal plots. Another approach is to use a chi-square goodness of fit test to check the normality assumption. We show how this can be done in the following example. EXAMPLE 12.2: The Car Mileage Case Consider the sample of 50 gas mileages given in Table 1.4 (page 10). A histogram of these mileages (see Figure 2.10, page 60) is symmetrical and bell-shaped. This suggests that the sample of mileages has been randomly selected from a normally distributed population. In this example we use a chi-square goodness of fit test to check the normality of the mileages. To perform this test, we first divide the number line into intervals (or categories). One way to do this is to use the class boundaries of the histogram in Figure 2.10. Table 12.3 gives these intervals and also gives observed frequencies (counts of the number of mileages in each interval), which have been obtained from the histogram of Figure 2.10. The chi-square test is done by comparing these observed frequencies with the expected frequencies in the rightmost column of Table 12.3. To explain how the expected frequencies are calculated, we first use the sample mean x = 31.56 and the sample standard deviation s = .798 of the 50 mileages as point estimates of the population mean μ and population standard deviation σ. Then, for example, consider p1, the probability that a randomly selected mileage will be in the first interval (less than 30.0) in Table 12.3, if the population of all mileages is normally distributed. We estimate p1 to be Table 12.3: Observed and Expected Cell Frequencies for a Chi-Square Goodness of Fit Test for Testing the Normality of the 50 Gasoline Mileages in Table 1.4 GasMiles It follows that E1 = 50p1 = 50(.0256) = 1.28 is the expected frequency for the first interval under the normality assumption. Next, if we consider p2, the probability that a randomly selected mileage will be in the second interval in Table 12.3 if the population of all mileages is normally distributed, we estimate p2 to be It follows that E2 = 50p2 = 50(.0662) = 3.31 is the expected frequency for the second interval under the normality assumption. The other expected frequencies are computed similarly. In general, pi is the probability that a randomly selected mileage will be in interval i if the population of all possible mileages is normally distributed with mean 31.56 and standard deviation .798, and Ei is the expected number of the 50 mileages that would be in interval i if the population of all possible mileages has this normal distribution. It seems reasonable to reject the null hypothesis H0: the population of all mileages is normally distributed in favor of the alternative hypothesis Ha: the population of all mileages is not normally distributed if the observed frequencies in Table 12.3 differ substantially from the corresponding expected frequencies in Table 12.3. We compare the observed frequencies with the expected frequencies under the normality assumption by computing the chi-square statistic Since we have estimated m = 2 parameters (μ and σ) in computing the expected frequencies (Ei values), it can be shown that the sampling distribution of χ2 is approximately a chi-square distribution with k − 1 − m = 8 − 1 − 2 = 5 degrees of freedom. Therefore, we can reject H0 at level of significance α if where the point is based on k − 1 − m = 8 − 1 − 2 = 5 degrees of freedom. If we wish to test H0 at the .05 level of significance, Table A.17 tells us that Therefore, since we cannot reject H0 at the .05 level of significance, and we cannot reject the hypothesis that the population of all mileages is normally distributed. Therefore, for practical purposes it is probably reasonable to assume that the population of all mileages is approximately normally distributed and that inferences based on this assumption are valid. Finally, the p-value for this test, which is the area under the chi-square curve having 5 degrees of freedom to the right of χ2 = .43242, can be shown to equal .994. Since this p-value is large (much greater than .05), we have little evidence to support rejecting the null hypothesis (normality). Note that although some of the expected cell frequencies in Table 12.3 are not at least 5, the number of classes (groups) is 8 (which exceeds 4), the average of the expected cell frequencies is at least 5, and the smallest expected cell frequency is at least 1. Therefore, it is probably reasonable to consider the result of this chi-square test valid. If we choose to base the chi-square test on the more restrictive assumption that all of the expected cell frequencies are at least 5, then we can combine adjacent cell frequencies as follows: When we use these combined cell frequencies, the chi-square approximation is based on k − 1 − m = 5 − 1 − 2 = 2 degrees of freedom. We find that χ2 = .30102 and that p-value = .860. Since this p-value is much greater than .05, we cannot reject the hypothesis of normality at the .05 level of significance. In Example 12.2 we based the intervals employed in the chi-square goodness of fit test on the class boundaries of a histogram for the observed mileages. Another way to establish intervals for such a test is to compute the sample mean x and the sample standard deviation s and to use intervals based on the Empirical Rule as follows: Interval 1: less than x − 2s Interval 2: x − 2s < x − s Interval 3: x − s < x Interval 4: x < x + s Interval 5: x + s < x + 2s Interval 6: greater than x + 2s However, care must be taken to ensure that each of the expected frequencies is large enough (using the previously discussed criteria). No matter how the intervals are established, we use x as an estimate of the population mean μ and we use s as an estimate of the population standard deviation σ when we calculate the expected frequencies (Ei values). Since we are estimating m = 2 population parameters, the rejection point is based on k − 1 − m = k − 1 − 2 = k − 3 degrees of freedom, where k is the number of intervals employed. In the following box we summarize how to carry out this chi-square test: A Goodness of Fit Test for a Normal Distribution 1 We will test the following null and alternative hypotheses: H0: the population has a normal distribution Ha: the population does not have a normal distribution 2 Select a random sample of size n and compute the sample mean x and sample standard deviation s. 3 Define k intervals for the test. Two reasonable ways to do this are to use the classes of a histogram of the data or to use intervals based on the Empirical Rule. 4 Record the observed frequency (fi) for each interval. 5 Calculate the expected frequency (Ei) for each interval under the normality assumption. Do this by computing the probability that a normal variable having mean x and standard deviation s is within the interval and by multiplying this probability by n. Make sure that each expected frequency is large enough. If necessary, combine intervals to make the expected frequencies large enough. 6 Calculate the chi-square statistic and define the p-value for the test to be the area under the curve of the chi-square distribution having k − 3 degrees of freedom to the right of χ2. 7 Reject H0 in favor of Ha at level of significance α if either of the following equivalent conditions holds: a b p-value < α Here the point is based on k − 3 degrees of freedom. While chi-square goodness of fit tests are often used to verify that it is reasonable to assume that a random sample has been selected from a normally distributed population, such tests can also check other distribution forms. For instance, we might verify that it is reasonable to assume that a random sample has been selected from a Poisson distribution. In general, the number of degrees of freedom for the chi-square goodness of fit test will equal k − 1 − m, where k is the number of intervals or categories employed in the test and m is the number of population parameters that must be estimated to calculate the needed expected frequencies. Exercises for Section 12.1 CONCEPTS 12.1 Describe the characteristics that define a multinomial experiment. 12.2 Give the conditions that the expected cell frequencies must meet in order to validly carry out a chi-square goodness of fit test. 12.3 Explain the purpose of a goodness of fit test. 12.4 When performing a chi-square goodness of fit test, explain why a large value of the chi-square statistic provides evidence that H0 should be rejected. 12.5 Explain two ways to obtain intervals for a goodness of fit test of normality. METHODS AND APPLICATIONS 12.6 The shares of the U.S. automobile market held in 1990 by General Motors, Japanese manufacturers, Ford, Chrysler, and other manufacturers were, respectively, 36%, 26%, 21%, 9%, and 8%. Suppose that a new survey of 1,000 new-car buyers shows the following purchase frequencies: a Show that it is appropriate to carry out a chi-square test using these data. AutoMkt b Test to determine whether the current market shares differ from those of 1990. Use α = .05. 12.7 Last rating period, the percentages of viewers watching several channels between 11 p.m. and 11:30 p.m. in a major TV market were as follows: TVRate Suppose that in the current rating period, a survey of 2,000 viewers gives the following frequencies: a Show that it is appropriate to carry out a chi-square test using these data. b Test to determine whether the viewing shares in the current rating period differ from those in the last rating period at the .10 level of significance. What do you conclude? 12.8 In the Journal of Marketing Research (November 1996), Gupta studied the extent to which the purchase behavior of scanner panels is representative of overall brand preferences. A scanner panel is a sample of households whose purchase data are recorded when a magnetic identification card is presented at a store checkout. The table below gives peanut butter purchase data collected by the A. C. Nielson Company using a panel of 2,500 households in Sioux Falls, South Dakota. The data were collected over 102 weeks. The table also gives the market shares obtained by recording all peanut butter purchases at the same stores during the same period. ScanPan a Show that it is appropriate to carry out a chi-square test. b Test to determine whether the purchase behavior of the panel of 2,500 households is consistent with the purchase behavior of the population of all peanut butter purchasers. Assume here that purchase decisions by panel members are reasonably independent, and seta α = .05. 12.9 The purchase frequencies for six different brands of videotape are observed at a video store over one month: VidTape a Carry out a test of homogeneity for this data with α = .025. b Interpret the result of your test. 12.10 A wholesaler has recently developed a computerized sales invoicing system. Prior to implementing this system, a manual system was used. The distribution of the number of errors per invoice for the manual system is as follows: Invoice2 After implementation of the computerized system, a random sample of 500 invoices gives the following error distribution: a Show that it is appropriate to carry out a chi-square test using these data. b Use the following Excel output to determine whether the error percentages for the computerized system differ from those for the manual system at the .05 level of significance. What do you conclude? 12.11 Consider the sample of 65 payment times given in Table 2.4 (page 56). Use these data to carry out a chi-square goodness of fit test to test whether the population of all payment times is normally distributed by doing the following: PayTime a It can be shown that x = 18.1077 and that s = 3.9612 for the payment time data. Use these values to compute the intervals (1) Less than x − 2s (2) x − 2s < x − s (3) x − s < x (4) x < x + s (5) x + s < x + 2s (6) Greater than x + 2s b Assuming that the population of all payment times is normally distributed, find the probability that a randomly selected payment time will be contained in each of the intervals found in part a. Use these probabilities to compute the expected frequency under the normality assumption for each interval. c Verify that the average of the expected frequencies is at least 5 and that the smallest expected frequency is at least 1. What does this tell us? d Formulate the null and alternative hypotheses for the chi-square test of normality. e For each interval given in part a, find the observed frequency. Then calculate the chi-square statistic needed for the chi-square test of normality. f Use the chi-square statistic to test normality at the .05 level of significance. What do you conclude? 12.12 Consider the sample of 60 bottle design ratings given in Table 1.3 (page 8). Use these data to carry out a chi-square goodness of fit test to determine whether the population of all bottle design ratings is normally distributed. Use α = .10, and note that x = 30.35 and s = 3.1073 for the 60 bottle design ratings. Design 12.13 THE BANK CUSTOMER WAITING TIME CASE Consider the sample of 100 waiting times given in Table 1.8 (page 14). Use these data to carry out a chi-square goodness of fit test to determine whether the population of all waiting times is normally distributed. Use α = .10, and note that x = 5.46 and s = 2.475 for the 100 waiting times. WaitTime 12.14 The table on the next page gives a frequency distribution describing the number of errors found in 30 1,000-line samples of computer code. Suppose that we wish to determine whether the number of errors can be described by a Poisson distribution with mean μ = 4.5. Using the Poisson probability tables, fill in the table. Then perform an appropriate chi-square goodness of fit test at the .05 level of significance. What do you conclude about whether the number of errors can be described by a Poisson distribution with μ = 4.5? Explain. CodeErr 12.2: A Chi-Square Test for Independence We have spent considerable time in previous chapters studying relationships between variables. One way to study the relationship between two variables is to classify multinomial count data on two scales (or dimensions) by setting up a contingency table. EXAMPLE 12.3: The Client Satisfaction Case A financial institution sells several kinds of investment products—a stock fund, a bond fund, and a tax-deferred annuity. The company is examining whether customer satisfaction depends on the type of investment product purchased. To do this, 100 clients are randomly selected from the population of clients who have purchased shares in exactly one of the funds. The company records the fund type purchased by these clients and asks each sampled client to rate his or her level of satisfaction with the fund as high, medium, or low. Table 12.4 on page 498 gives the survey results. Table 12.4: Results of a Customer Satisfaction Survey Given to 100 Randomly Selected Clients Who Invest in One of Three Fund Types—a Bond Fund, a Stock Fund, or a Tax-Deferred Annuity Invest We can look at the data in Table 12.4 in an organized way by constructing a contingency table (also called a two-way cross-classification table). Such a table classifies the data on two dimensions—type of fund and degree of client satisfaction. Figure 12.2 gives MegaStat and MINITAB output of a contingency table of fund type versus level of satisfaction. This table consists of a row for each fund type and a column for each level of satisfaction. Together, the rows and columns form a “cell” for each fund type–satisfaction level combination. That is, there is a cell for each “contingency” with respect to fund type and satisfaction level. Both the MegaStat and MINITAB output give a cell frequency for each cell, which is the top number given in the cell. This is a count (observed frequency) of the number of surveyed clients with the cell’s fund type– satisfaction level combination. For instance, 15 of the surveyed clients invest in the bond fund and report high satisfaction, while 24 of the surveyed clients invest in the tax-deferred annuity and report medium satisfaction. In addition to the cell frequencies, each output also gives Figure 12.2: MegaStat and MINITAB Output of a Contingency Table of Fund Type versus Level of Client Satisfaction (See the Survey Results in Table 12.4 Invest Row totals (at the far right of each table): These are counts of the numbers of clients who invest in each fund type. These row totals tell us that 1 30 clients invest in the bond fund. 2 30 clients invest in the stock fund. 3 40 clients invest in the tax-deferred annuity. Column totals (at the bottom of each table): These are counts of the numbers of clients who report high, medium, and low satisfaction. These column totals tell us that 1 40 clients report high satisfaction. 2 40 clients report medium satisfaction. 3 20 clients report low satisfaction. Overall total (the bottom-right entry in each table): This tells us that a total of 100 clients were surveyed. Besides the row and column totals, both outputs give row and column percentages (directly below the row and column totals). For example, 30.00 percent of the surveyed clients invest in the bond fund, and 20.00 percent of the surveyed clients report low satisfaction. Furthermore, in addition to a cell frequency, the MegaStat output gives a row percentage, a column percentage, and a cell percentage for each cell (these are below the cell frequency in each cell). For instance, looking at the “bond fund–high satisfaction cell,” we see that the 15 clients in this cell make up 50.0 percent of the 30 clients who invest in the bond fund, and they make up 37.5 percent of the 40 clients who report high satisfaction. In addition, these 15 clients make up 15.0 percent of the 100 clients surveyed. The MINITAB output gives a row percentage and a column percentage, but not a cell percentage, for each cell. We will explain the last number that appears in each cell of the MINITAB output later in this section. Looking at the contingency tables, it appears that the level of client satisfaction may be related to the fund type. We see that higher satisfaction ratings seem to be reported by stock and bond fund investors, while holders of tax-deferred annuities report lower satisfaction ratings. To carry out a formal statistical test we can test the null hypothesis H0: fund type and level of client satisfaction are independent versus Ha: fund type and level of client satisfaction are dependent In order to perform this test, we compare the counts (or observed cell frequencies) in the contingency table with the counts that would appear in the contingency table if we assume that fund type and level of satisfaction are independent. Because these latter counts are computed by assuming independence, we call them the expected cell frequencies under the independence assumption. We illustrate how to calculate these expected cell frequencies by considering the cell corresponding to the bond fund and high client satisfaction. We first use the data in the contingency table to compute an estimate of the probability that a randomly selected client invests in the bond fund. Denoting this probability as pB, we estimate pB by dividing the row total for the bond fund by the total number of clients surveyed. That is, denoting the row total for the bond fund as rB and letting n denote the total number of clients surveyed, the estimate of pB is rB/n = 30/100 = .3. Next we compute an estimate of the probability that a randomly selected client will report high satisfaction. Denoting this probability as pH, we estimate pH by dividing the column total for high satisfaction by the total number of clients surveyed. That is, denoting the column total for high satisfaction as cH, the estimate of pH is cH/n = 40/100 = .4. Next, assuming that investing in the bond fund and reporting high satisfaction are independent, we compute an estimate of the probability that a randomly selected client invests in the bond fund and reports high satisfaction. Denoting this probability as pBH, we can compute its estimate by recalling from Section 4.4 that if two events A and B are statistically independent, then P(A ∩ B) equals P(A)P(B). It follows that, if we assume that investing in the bond fund and reporting high satisfaction are independent, we can compute an estimate of pBH by multiplying the estimate of pB by the estimate of pH. That is, the estimate of pBH is (rB/n)(cH/n) = (.3)(.4) = .12. Finally, we compute an estimate of the expected cell frequency under the independence assumption. Denoting the expected cell frequency as EBH, the estimate of EBH is This estimated expected cell frequency is given in the MINITAB output of Figure 12.2(b) as the last number under the observed cell frequency for the bond fund–high satisfaction cell. Noting that the expression for Ê BH can be written as we can generalize to obtain a formula for the estimated expected cell frequency for any cell in the contingency table. Letting Ê ij denote the estimated expected cell frequency corresponding to row i and column j in the contingency table, we see that where ri is the row total for row i and cj is the column total for column j. For example, for the fund type–satisfaction level contingency table, we obtain and These (and the other estimated expected cell frequencies under the independence assumption) are the last numbers below the observed cell frequencies in the MINITAB output of Figure 12.2(b). Intuitively, these estimated expected cell frequencies tell us what the contingency table looks like if fund type and level of client satisfaction are independent. To test the null hypothesis of independence, we will compute a chi-square statistic that compares the observed cell frequencies with the estimated expected cell frequencies calculated assuming independence. Letting fij denote the observed cell frequency for cell ij, we compute If the value of the chi-square statistic is large, this indicates that the observed cell frequencies differ substantially from the expected cell frequencies calculated by assuming independence. Therefore, the larger the value of chi-square, the more doubt is cast on the null hypothesis of independence. To find an appropriate rejection point, we let r denote the number of rows in the contingency table and we let c denote the number of columns. Then, it can be shown that, when the null hypothesis of independence is true, the sampling distribution of χ2 is approximately a χ2 distribution with (r − 1)(c − 1) = (3 − 1)(3 − 1) = 4 degrees of freedom. If we test H0 at the .05 level of significance, we reject H0 if and only if Since Table A.17 (page 878) tells us that the point corresponding to (r − 1)(c − 1) = 4 degrees of freedom equals 9.48773, we have and we reject H0 at the .05 level of significance. We conclude that fund type and level of client satisfaction are not independent. In the following box we summarize how to carry out a chi-square test for independence: A Chi-Square Test for Independence Suppose that each of n randomly selected elements is classified on two dimensions, and suppose that the result of the two-way classification is a contingency table having r rows and c columns. Let fij = the cell frequency corresponding to row i and column j of the contingency table (that is, the number of elements classified in row i and column j) ri = the row total for row i in the contingency table cj = the column total for column j in the contingency table Êij = = the estimated expected number of elements that would be classified in row i and column j of the contingency table if the two classifications are statistically independent If we wish to test H0: the two classifications are statistically independent versus Ha: the two classifications are statistically dependent we define the test statistic Also, define the p-value related to χ2 to be the area under the curve of the chi-square distribution having (r − 1)(c − 1) degrees of freedom to the right of χ2. Then, we can reject H0 in favor of Ha at level of significance α if either of the following equivalent conditions holds: 1 2 p-value < α Here the point is based on (r − 1)(c − 1) degrees of freedom. This test is based on the fact that it can be shown that, when the null hypothesis of independence is true, the sampling distribution of χ2 is approximately a chi-square distribution with (r − 1)(c − 1) degrees of freedom, if the sample size n is large. It is generally agreed that n should be considered large if all of the estimated expected cell frequencies ( Ê ij values) are at least 5. Moore and McCabe (1993) indicate that it is reasonable to use the chi-square approximation if the number of cells (rc) exceeds 4, the average of the Ê ij values is at least 5, and the smallest Ê ij value is at least 1. Notice that in Figure 12.2(b) all of the estimated expected cell frequencies are greater than 5. EXAMPLE 12.4: The Client Satisfaction Case Again consider the MegaStat and MINITAB outputs of Figure 12.2, which give the contingency table of fund type versus level of client satisfaction. Both outputs give the chi-square statistic (= 46.438) for testing the null hypothesis of independence, as well as the related p-value. We see that this p-value is less than .001. It follows, therefore, that we can reject H0: fund type and level of client satisfaction are independent at the .05 level of significance, since the p-value is less than .05. In order to study the nature of the dependency between the classifications in a contingency table, it is often useful to plot the row and/or column percentages. As an example, Figure 12.3 gives plots of the row percentages in the contingency table of Figure 12.2(a). For instance, looking at the column in this contingency table corresponding to a high level of satisfaction, the contingency table tells us that 40.00 percent of the surveyed clients report a high level of satisfaction. If fund type and level of satisfaction really are independent, then we would expect roughly 40 percent of the clients in each of the three categories—bond fund participants, stock fund participants, and tax-deferred annuity holders—to report a high level of satisfaction. That is, we would expect the row percentages in the “high satisfaction” column to be roughly 40 percent in each row. However, Figure 12.3(a) gives a plot of the percentages of clients reporting a high level of satisfaction for each investment type (that is, the figure plots the three row percentages in the column corresponding to “high satisfaction”). We see that these percentages vary considerably. Noting that the dashed line in the figure is the 40 percent reporting a high level of satisfaction for the overall group, we see that the percentage of stock fund participants reporting high satisfaction is 80 percent. This is far above the 40 percent we would expect if independence exists. On the other hand, the percentage of tax-deferred annuity holders reporting high satisfaction is only 2.5 percent—way below the expected 40 percent if independence exists. In a similar fashion, Figures 12.3(b) and (c) plot the row percentages for the medium and low satisfaction columns in the contingency table. These plots indicate that stock fund participants report medium and low levels of satisfaction less frequently than the overall group of clients, and that tax-deferred annuity participants report medium and low levels of satisfaction more frequently than the overall group of clients. Figure 12.3: Plots of Row Percentages versus Investment Type for the Contingency Table in Figure 12.2(a) To conclude this section, we note that the chi-square test for independence can be used to test the equality of several population proportions. We will show how this is done in Exercise 12.21. Exercises for Section 12.2 CONCEPTS 12.15 What is the purpose behind summarizing data in the form of a two-way contingency table? 12.16 When performing a chi-square test for independence, explain how the “cell frequencies under the independence assumption” are calculated. For what purpose are these frequencies calculated? METHODS AND APPLICATIONS 12.17 A marketing research firm wishes to study the relationship between wine consumption and whether a person likes to watch professional tennis on television. One hundred randomly selected people are asked whether they drink wine and whether they watch tennis. The following results are obtained: WineCons a For each row and column total, calculate the corresponding row or column percentage. b For each cell, calculate the corresponding cell, row, and column percentages. c Test the hypothesis that whether people drink wine is independent of whether people watch tennis. Set α = .05. d Given the results of the chi-square test, does it make sense to advertise wine during a televised tennis match (assuming that the ratings for the tennis match are high enough)? Explain. 12.18 In recent years major efforts have been made to standardize accounting practices in different countries; this is called harmonization. In an article in Accounting and Business Research, Emmanuel N. Emenyonu and Sidney J. Gray (1992) studied the extent to which accounting practices in France, Germany, and the UK are harmonized. DeprMeth a Depreciation method is one of the accounting practices studied by Emenyonu and Gray. Three methods were considered—the straight-line method (S), the declining balance method (D), and a combination of D & S (sometimes European firms start with the declining balance method and then switch over to the straight-line method when the figure derived from straight line exceeds that from declining balance). The data in Table 12.5 summarize the depreciation methods used by a sample of 78 French, German, and U.K. firms. Use these data and the MegaStat output to test the hypothesis that depreciation method is independent of a firm’s location (country) at the .05 level of significance. Table 12.5: Depreciation Methods Used by a Sample of 78 Firms DeprMeth b Perform a graphical analysis to study the relationship between depreciation method and country. What conclusions can be made about the nature of the relationship? 12.19 In the book Business Research Methods (5th ed.), Donald R. Cooper and C. William Emory discuss studying the relationship between on-the-job accidents and smoking. Cooper and Emory describe the study as follows: Accident Suppose a manager implementing a smoke-free workplace policy is interested in whether smoking affects worker accidents. Since the company has complete reports of on-the-job accidents, she draws a sample of names of workers who were involved in accidents during the last year. A similar sample from among workers who had no reported accidents in the last year is drawn. She interviews members of both groups to determine if they are smokers or not. The sample results are given in Table 12.6. Table 12.6: A Contingency Table of the Results of the Accidents Study Accident a For each row and column total in Table 12.6, find the corresponding row/column percentage. b For each cell in Table 12.6, find the corresponding cell, row, and column percentages. c Use the MINITAB output in Figure 12.4 to test the hypothesis that the incidence of on-the-job accidents is independent of smoking habits. Set α = .01. Figure 12.4: MINITAB Output of a Chi-Square Test for Independence in the Accident Study d Is there a difference in on-the-job accident occurrences between smokers and nonsmokers? Explain. 12.20 In the book Essentials of Marketing Research, William R. Dillon, Thomas J. Madden, and Neil A. Firtle discuss the relationship between delivery time and computer-assisted ordering. A sample of 40 firms shows that 16 use computer-assisted ordering, while 24 do not. Furthermore, past data are used to categorize each firm’s delivery times as below the industry average, equal to the industry average, or above the industry average. The results obtained are given in Table 12.7. Table 12.7: A Contingency Table Relating Delivery Time and Computer-Assisted Ordering DelTime a Test the hypothesis that delivery time performance is independent of whether computer-assisted ordering is used. What do you conclude by setting α = .05? DelTime b Verify that a chi-square test is appropriate. c Is there a difference between delivery-time performance between firms using computer-assisted ordering and those not using computer-assisted ordering? d Carry out graphical analysis to investigate the relationship between delivery-time performance and computer-assisted ordering. Describe the relationship. 12.21 A television station wishes to study the relationship between viewership of its 11 p.m. news program and viewer age (18 years or less, 19 to 35, 36 to 54, 55 or older). A sample of 250 television viewers in each age group is randomly selected, and the number who watch the station’s 11 p.m. news is found for each sample. The results are given in Table 12.8. TVView Table 12.8: A Summary of the Results of a TV Viewership Study TVView a Let p1, p2, p3, and p4 be the proportions of all viewers in each age group who watch the station’s 11 p.m. news. If these proportions are equal, then whether a viewer watches the station’s 11 p.m. news is independent of the viewer’s age group. Therefore, we can test the null hypothesis H0 that p1, p2, p3, and p4 are equal by carrying out a chi-square test for independence. Perform this test by setting α = .05. b Compute a 95 percent confidence interval for the difference between p1 and p4. Chapter Summary In this chapter we presented two hypothesis tests that employ the chi-square distribution. In Section 12.1 we discussed a chi-square test of goodness of fit. Here we considered a situation in which we study how count data are distributed among various categories. In particular, we considered a multinomial experiment in which randomly selected items are classified into several groups, and we saw how to perform a goodness of fit test for the multinomial probabilities associated with these groups. We also explained how to perform a goodness of fit test for normality. In Section 12.2 we presented a chi-square test for independence. Here we classify count data on two dimensions, and we summarize the cross-classification in the form of a contingency table. We use the cross-classified data to test whether the two classifications are statistically independent, which is really a way to see whether the classifications are related. We also learned that we can use graphical analysis to investigate the nature of the relationship between the classifications. Glossary of Terms chi-square test for independence: A test to determine whether two classifications are independent. (page 500) contingency table: A table that summarizes data that have been classified on two dimensions or scales. (page 496) goodness of fit test for multinomial probabilities: A test to determine whether multinomial probabilities are equal to a specific set of values. (page 490) goodness of fit test for normality: A test to determine if a sample has been randomly selected from a normally distributed population. (page 493) homogeneity (test for): A test of the null hypothesis that all multinomial probabilities are equal. (page 490) multinomial experiment: An experiment that concerns count data that are classified into more than two categories. (page 487) Important Formulas and Tests A goodness of fit test for multinomial probabilities: page 490 A goodness of fit test for a normal distribution: page 493 A test for homogeneity: page 490 A chi-square test for independence: page 500 Supplementary Exercises 12.22 A large supermarket conducted a consumer preference study by recording the brand of wheat bread purchased by customers in its stores. The supermarket carries four brands of wheat bread, and the brand preferences of a random sample of 200 purchasers are given in the following table: BreadPref Test the null hypothesis that the four brands are equally preferred by setting α equal to .05. Find a 95 percent confidence interval for the proportion of all purchasers who prefer Brand B. 12.23 An occupant traffic study was carried out to aid in the remodeling of a large building on a university campus. The building has five entrances, and the choice of entrance was recorded for a random sample of 300 persons entering the building. The results obtained are given in the following table: EntrPref Test the null hypothesis that the five entrances are equally used by setting a equal to .05. Find a 95 percent confidence interval for the proportion of all people who use Entrance III. 12.24 In a 1993 article in Accounting and Business Research, Meier, Alam, and Pearson studied auditor lobbying on several proposed U.S. accounting standards that affect banks and savings and loan associations. As part of this study, the authors investigated auditors’ positions regarding proposed changes in accounting standards that would increase client firms’ reported earnings. It was hypothesized that auditors would favor such proposed changes because their clients’ managers would receive higher compensation (salary, bonuses, and so on) when client earnings were reported to be higher. Table 12.9 summarizes auditor and client positions (in favor or opposed) regarding proposed changes in accounting standards that would increase client firms’ reported earnings. Here the auditor and client positions are cross-classified versus the size of the client firm. AuditPos Table 12.9: Auditor and Client Positions Regarding Earnings-Increasing Changes in Accounting Standards AuditPos a Test to determine whether auditor positions regarding earnings-increasing changes in accounting standards depend on the size of the client firm. Use α = .05. b Test to determine whether client positions regarding earnings-increasing changes in accounting standards depend on the size of the client firm. Use α = .05. c Carry out a graphical analysis to investigate a possible relationship between (1) auditor positions and the size of the client firm and (2) client positions and the size of the client firm. d Does the relationship between position and the size of the client firm seem to be similar for both auditors and clients? Explain. 12.25 In the book Business Research Methods (5th ed.), Donald R. Cooper and C. William Emory discuss a market researcher for an automaker who is studying consumer preferences for styling features of larger sedans. Buyers, who were classified as “first-time” buyers or “repeat” buyers, were asked to express their preference for one of two types of styling—European styling or Japanese styling. Of 40 first-time buyers, 8 preferred European styling and 32 preferred Japanese styling. Of 60 repeat buyers, 40 preferred European styling, and 20 preferred Japanese styling. a Set up a contingency table for these data. b Test the hypothesis that buyer status (repeat versus first-time) and styling preference are independent at the .05 level of significance. What do you conclude? c Carry out a graphical analysis to investigate the nature of any relationship between buyer status and styling preference. Describe the relationship. 12.26 Again consider the situation of Exercise 12.24. Table 12.10 summarizes auditor positions regarding proposed changes in accounting standards that would decrease client firms’ reported earnings. Determine whether the relationship between auditor position and the size of the client firm is the same for earnings-decreasing changes in accounting standards as it is for earnings-increasing changes in accounting standards. Justify your answer using both a statistical test and a graphical analysis. AuditPos2 Table 12.10: Auditor Positions Regarding Earnings-Decreasing Changes in Accounting Standards AuditPos2 12.27 The manager of a chain of three discount drug stores wishes to investigate the level of discount coupon redemption at its stores. All three stores have the same sales volume. Therefore, the manager will randomly sample 200 customers at each store with regard to coupon usage. The survey results are given in Table 12.11. Test the hypothesis that redemption level and location are independent with α = .01. Use the MINITAB output in Figure 12.5. Coupon Table 12.11: Results of the Coupon Redemption Study Coupon Figure 12.5: MINITAB Output of a Chi-Square Test for Independence in the Coupon Redemption Study 12.28 THE VIDEO GAME SATISFACTION RATING CASE Consider the sample of 65 customer satisfaction ratings given in Table 12.12. Carry out a chi-square goodness of fit test of normality for the population of all customer satisfaction ratings. Recall that we previously calculated x = 42.95 and s = 2.6424 for the 65 ratings. VideoGame Table 12.12: A Sample of 65 Customer Satisfaction Ratings VideoGame 12.29: Internet Exercise A report on the 1995 National Health Risk Behavior Survey, conducted by the Centers for Disease Control and Prevention, can be found at the CDC website [http://www.cdc.gov: Data & Statistics: Youth Risk Behavior Surveillance System : Data Products : 1995 National College Health Risk Behavior Survey or, directly, go to http://www.cdc.gov/nocdphp/dash/MMWRFile/ss4606.htm]. Among the issues addressed in the survey was whether the subjects had, in the prior 30 days, ridden with a driver who had been drinking alcohol. Does the proportion of students exhibiting this selected risk behavior vary by ethnic group? The report includes tables summarizing the “Ridden Drinking” risk behavior by ethnic group (Table 3) and the ethnic composition (Table 1) for a sample of n = 4,609 college students. The “Ridden Drinking” and ethnic group information is extracted from Tables 1 and 3 and is displayed as proportions or probabilities in the leftmost panel of the table below. Note that the values in the body of the leftmost panel are given as conditional probabilities, the probabilities of exhibiting the “Ridden Drinking” risk behavior, given ethnic group. These conditional probabilities can be multiplied by the appropriate marginal probabilities to compute the joint probabilities for all the risk behavior by ethnic group combinations to obtain the summaries in the center panel. Finally, the joint probabilities are multiplied by the sample size to obtain projected counts for the number of students in each “Ridden Drinking” by ethnic group combination. The “Other” ethnic group was omitted from the Table 3 summaries and is thus not included in this analysis. Is there sufficient evidence to conclude that the proportion of college students exhibiting the “Ridden Drinking” behavior varies by ethnic group? Conduct a chi-square test for independence using the projected count data provided in the rightmost panel of the summary table. (Data are available in MINITAB and Excel files, YouthRisk.mtw and YouthRisk.xls.) Test at the 0.01 level of significance and report an approximate p-value for your test. Be sure to clearly state your hypotheses and conclusion. YouthRisk Appendix 12.1: Chi-Square Tests Using MINITAB The instruction blocks in this section each begin by describing the entry of data into the MINITAB Data window. Alternatively, the data may be loaded directly from the data disk included with the text. The appropriate data file name is given at the top of each instruction block. Please refer to Appendix 1.1 for further information about entering data, saving data, and printing results when using MINITAB. Chi-square test for goodness of fit in Figure 12.1 on page 489 (data file: MicroWav.MTW): • Enter the microwave oven data from Tables 12.1 and 12.2 on page 487—observed frequencies in column C1 with variable name Frequency and market shares (entered as decimal fractions) in column C2 with variable name MarketShr. To compute the chi-square statistic: To compute the p-value for the test: We first compute the probability of obtaining a value of the chi-square statistic that is less than or equal to the computed value (=8.77857): • Select Calc : Probability Distributions : Chi-Square. • In the Chi-Square Distribution dialog box, click on “Cumulative probability.” • Enter 3 in the “Degrees of freedom” box. • Click the “Input constant” option and enter k1 into the corresponding box. • Enter k2 into the “Optional storage” box. • Click OK in the Chi-Square Distribution dialog box. This computes the needed probability and stores its value as a constant k2. • Select Calc : Calculator. • In the Calculator dialog box, enter PValue into the “Store result in variable” box. • In the Expression window, enter the formula 1 − k2, and click OK to compute the p-value related to the chi-square statistic. To display the p-value: Crosstabulation table and chi-square test of independence for the client satisfaction data as in Figure 12.2(b) on page 497 (data file: Invest.MTW): • Follow the instructions for constructing a cross-tabulation table of fund type versus level of client satisfaction as given in Appendix 2.1. • After entering the categorical variables into the “Cross Tabulation and Chi-Square” dialog box, click on the Chi-Square… button. • In the “Cross Tabulation—Chi-Square” dialog box, place checkmarks in the “Chi-Square analysis” and “Expected cell counts” check boxes and click OK. • Click OK in the “Cross Tabulation and Chi-Square” dialog box to obtain results in the Session window. The chi-square statistic can also be calculated from summary data by entering the cell counts from Table 12.2(b) and by selecting “Chi-Square Test (Table in Worksheet)” from the Stat : Tables sub-menu. Appendix 12.2: Chi-Square Tests Using Excel The instruction blocks in this section each begin by describing the entry of data into an Excel spreadsheet. Alternatively, the data may be loaded directly from the data disk included with the text. The appropriate data file name is given at the top of each instruction block. Please refer to Appendix 1.2 for further information about entering data, saving data, and printing results when using Excel. Chi-square goodness of fit test in Exercise 12.10 on page 495 (data file: Invoice2.xlsx): • In the first row of the spreadsheet, enter the following column headings in order—Percent, Expected, Number, and ChiSqContribution. • Beginning in cell A2, enter the “percentage of invoice figures” from Exercise 12.10 as decimal fractions into column A. • Compute expected values. Enter the formula =500*A2 into cell B2 and press enter. Copy this formula through cell B6 by double-clicking the drag handle (in the lower right corner) of cell B2. • Enter the “number of invoice figures” from Exercise 12.10 into cells C2 through C6. • Compute cell Chi-square contributions. In cell D2, enter the formula =(C2 – B2)^2/B2 and press enter. Copy this formula through cell D6 by double-clicking the drag handle (in the lower right corner) of cell D2. • Compute the Chi-square statistic in cell D8. Use the mouse to select the range of cells D2.D8 and click the ∑ button on the Excel ribbon. • Click on an empty cell, say cell A15, and select the Insert Function button on the Excel ribbon. • In the Insert Function dialog box, select Statistical from the “Or select a category:” menu, select CHIDIST from the “Select a function:” menu, and click OK. • In the “CHIDIST Function Arguments” dialog box, enter D8 into the “X” box and 3 into the “Deg_freedom” box. • Click OK in the “CHIDIST Function Arguments” dialog box to produce the p-value related to the chi-square statistic in cell A15. Contingency table and chi-square test of independence similar to Figure 12.2(b) on page 497 (data file: Invest.xlsx): • Follow the instructions given in Appendix 2.2 for using a PivotTable to construct a crosstabulation table of fund type versus level of customer satisfaction and place the table in a new worksheet. To compute a table of expected values: • In cell B9, type the formula =$E4*B$7/$E$7 (be very careful to include the $ in all the correct places) and press the enter key (to obtain the expected value 12 in cell B9). • Click on cell B9 and use the mouse to point the cursor to the drag handle (in the lower right corner) of the cell. The cursor will change to a black cross. Using the black cross, drag the handle right to cell D9 and release the mouse button to fill cells C9.D9. With B9.D9 still selected, use the black cross to drag the handle down to cell D11. Release the mouse button to fill cells B10.D11. • To add marginal totals, select the range B9.E12 and click the ∑ button on the Excel ribbon. To compute the Chi-square statistic: • In cell B15, type the formula = (B4 – B9)^2/B9 and press the enter key to obtain the cell contribution 0.75 in cell B15. • Click on cell B15 and (using the procedure described above) use the “black cross cursor” to drag the cell handle right to cell D15 and then down to cell D17 (obtaining the cell contributions in cells B15.D17). • To add marginal totals, select the range B15.E18 and click the ∑ button on the Excel ribbon. • The Chi-square statistic is in cell E18 (=46.4375). To compute the p-value for the Chi-square test of independence: • Click on an empty cell, say E20. • Select the Insert Function button on the Excel ribbon. • In the Insert Function dialog box, select Statistical from the “Or select a category:” menu, select CHIDIST from the “Select a function:” menu, and click OK. • In the “CHIDIST Function Arguments” dialog box, enter E18 (the cell location of the chi-square statistic) into the “X” window and 4 into the “Deg_freedom” window. • Click OK in the “CHIDIST Function Arguments” dialog box to produce the p-value related to the chi-square statistic in cell E20. Appendix 12.3: Chi-Square Tests Using MegaStat The instructions in this section begin by describing the entry of data into an Excel worksheet. Alternatively, the data may be loaded directly from the data disk included with the text. The appropriate data file name is given at the top of each instruction block. Please refer to Appendix 1.2 for further information about entering data, saving data, and printing results in Excel. Please refer to Appendix 1.3 for more information about using MegaStat. Contingency table and chi-square test of independence in Figure 12.2(a) on page 497 (data file: Invest.xlsx): Chi-square goodness of fit test for the scanner panel data in Exercise 12.8 on page 494 (data file: ScanPan.xlsx): • Enter the scanner panel data in Exercise 12.8 (page 494) as shown in the screen with the number of purchases for each brand in column C and with the market share for each brand (expressed as a percentage) in column D. Note that the total number of purchases for all brands equals 19,115 (which is in cell C11). • In cell E4, type the cell formula =D4*19115 and press enter to compute the expected frequency for the Jiff—18 ounce brand/size combination. Copy this cell formula (by double clicking the drag handle in the lower right corner of cell E4) to compute the expected frequencies for each of the other brands in cells E5 through E10. • Select Add-Ins : MegaStat : Chi-square/Crosstab : Goodness of Fit Test. • In the “Goodness of Fit Test” dialog box, click in the “Observed values Input range” window and enter the range C4.C10. Enter this range by dragging with the mouse—the autoexpand feature cannot be used in the “Goodness of Fit Test” dialog box. • Click in the “Expected values Input range” window, and enter the range E4.E10. Again, enter this range by dragging with the mouse. • Click OK in the “Goodness of Fit Test” dialog box. Chi-square test for independence with contingency table input data in the depreciation situation of Exercise 12.18 on page 502 (data file: DeprMeth.xlsx): • Enter the depreciation method contingency table data in Table 12.5 on page 502 as shown in the screen—depreciation methods in rows and countries in columns. • Select Add-Ins : MegaStat : Chi-square/Crosstab : Contingency Table. • In the “Contingency Table Test for Independence” dialog box, click in the Input Range window and (by dragging the mouse) enter the range A4.D7. Note that the entered range may contain row and column labels, but the range should not include the “total row” or “total column.” • In the list of Output Options, check the Chi-square checkbox to obtain the results of the chi-square test for independence. • If desired, row, column, and cell percentages can be obtained by placing checkmarks in the “% of row,” “% of column,” and “% of total” checkboxes in the list of Output Options. Here we have elected to not request these percentages. • Click OK in the “Contingency Table Test for Independence” dialog box. (Bowerman 486) Bowerman, Bruce L. Business Statistics in Practice, 5th Edition. McGraw-Hill Learning Solutions, 022008. .

 

T-test (1 of 2)

Difference testing is used primarily to identify if there is a detectable difference between products, services, people, or situations. These tests are often conducted in business situations to:

· Ensure a change in formulation or production introduces no significant change in the end product or service.

· Substantiate a claim of a new or improved product or service

· Confirm that a new ingredient/supplier does not affect the perceived attributes of the product or service.

· Track changes during shelf life of a product or the length of time of a service.

Differences Between Two Independent Sample Means:

Coke vs. Pepsi. Independent sample t-tests are used to compare the means of two independently sampled groups (ex., do those drinking Coke differ on a performance variable, or the numbers of cans consumed in one week) compared to those drinking Pepsi. The individuals are randomly assigned to the Coke and Pepsi groups. With a confidence interval of ≤.05 (corresponding probability level of 95%) the researcher concludes the two groups are significantly different in their means (average consumption rate of Coke and Pepsi over a one week period of time) if the t-test value meets or exceeds the required critical value. If the tvalue does not meet the critical t value required then the research investigator simply concludes that no differences exist. Further explanation is not required. Presented below is a more useable situation.

Using the raw data and formula above to calculate the actual t-test value, when calculated properly, is 2.43. Always remember that S = Standard deviation and that the mean is oftentimes shown by the capital letter M rather than a bar mark over a capital X. By going to the appropriate t-tables in your textbook find the critical value for t at the .05 confidence interval. The value you should find is 1.761

 T-test (2 of 2)

Differences Between Two Means of Correlated Samples

Correlated t-test statistical processes are used to determine whether or not there is a relationship of a particular measurement variable on a pre- and post-test basis. Oftentimes when there exists a statistically significant relationship on a pre- and post-test basis the business manager can use the first measurement values to predict the second in future situations without having to present a post-test situation.

Example: Using the same data presented above let us assume that there are not two independent groups but the same group under two different conditions—noise production environment and non-noise production environment.

Using the raw data and formula above to calculate the t-test value the actual t-test, when calculated properly, is 3.087. By going to the appropriate t tables in your text book find the critical value for t at the .05 confidence interval. The value you should find is 1.895.

 

Using the navigation on the left, please proceed to the next page.

 

 

Hypothesis

s

330.95

Sample Size 10

Sample Mean

Sample Standard Deviation

9

9

2.1009220369

t Test for Differences in Two

Mean
Data
Hypothesized Difference 0
Level of Significance 0.05
Population 1 Sample
Sample Size 10
Sample Mean 62200
Sample Standard Deviation 9
Population 2 Sample
63700
6912.95
Intermediate Calculations
Population 1 Sample Degrees of Freedom
Population 2 Sample Degrees of Freedom
Total Degrees of Freedom 18
Pooled Variance 67427752.8025
Difference in Sample Means -1500
t Test Statistic -0.408466946
Two-Tail Test
Lower Critical Value

2.1009220369
Upper Critical Value
p-Value 0.687749195
Do not reject the null hypothesis

Output

MALE

10

Descriptive statistics
FE

MALE
count 10
mean 62,200.00 63,700.00
sample variance 87,066,666.67 47,788,888.89
sample standard deviation 9,330.95 6,912.95

Sheet1

MALE

67000

58000

69000

55000

62200 63700 Mean
FEMALE
50000 58000
75000 69000
72000 73000
67000
54000 55000
63000
52000 53000
68000 70000
71000
60000
9330.952077182 6912.9508090893 S.D.

Sheet2

Sheet3

CALCULATE T TE

S

T

Calculate the “t” value for independent groups for the following data using the formula provided in the a

tt

ached word document.   Using the raw measurement data presented, determine whether or not there exists a statistically significant difference between the salaries of female and male human resource managers using the appropriate t

test. Develop a  testable hypothesis, confidence level, and degrees of freedom. Report the required “t” critical values based on the degrees of freedom.  Show calculations.

Answer

The null hypothesis tested is

H0: There is no significant difference between the average salaries of female and male human resource managers. (µ1

=

µ

2

)

The alternative hypothesis is

H1: There is significant difference between the average salaries of female and male human resource managers. (µ1≠ µ2)

The test statistic used is

12

12
2

~

NN

DM

MM

tt
S

+-


=
Where
22
1122
1212
(1)(1)
11
2
DM
NsNs
S
NNNN
éùéù
-+-
=+
êúêú
+-
ëûëû

Here M1 = 62,200, M2 = 63,700
s1 = 9330.95, s2 = 6912.95
N1 = 10, N2 = 10 (See the excel sheet)
Then,
(
)
(
)
22
(101)9330.95(101)6912.95
11
101021010
DM
S
éù
-+-
éù
=+
êú
êú
+-
ëû
êú
ëû
= 3672.267768
Therefore test statistic,
62,20063,700
3672.267768
t

=
= -0.408466946
Degrees of freedom = N1 + N2 – 2 = 10 + 10 – 2 = 18
Let the significance level be 0.05.

Rejection criteria: Reject the null hypothesis, if the calculated value of t is greater than the critical value of t at 0.05 significance level. 
The critical values can be obtained from the student’s t tables with 18 d.f. at 0.05 significance level.
Upper critical value = 2.1
Lower critical value = -2.1

0
.
4
0
.
3
0
.
2
0
.
1
0
.
0
X
D
e
n
s
i
t
y

2
.
1
0
0
.
0
2
5
2
.
1
0
0
.
0
2
5
0
D
i
s
t
r
i
b
u
t
i
o
n

P
l
o
t
T
,

d
f
=
1
8

Conclusion: Fails to reject the null hypothesis. The sample does not provide enough evidence to support the claim that there is significant difference between the salaries of female and male human resource managers.
_1320131899.unknown

_1320132612.unknown

_1320133261

_1320132305.unknown

_1320131785.unknown

Still stressed from student homework?
Get quality assistance from academic writers!

Order your essay today and save 25% with the discount code LAVENDER