Final Paper: Research Proposal
Review the Example Research Proposal provided in the course materials. Design a research study on the topic of the study selected in Week One and critiqued in Week Three. Your design should seek to resolve the limitations you identified in the study you critiqued. Your paper must address all of the components required in the “Methods” section of a research proposal:
- State the research question and/or hypothesis.
- Specify the approach (qualitative or quantitative), research design, sampling strategy, data collection procedures, and data analysis techniques to be used.
- If the design is quantitative, also describe the variables, measures, and statistical tests you would use.
- Analyze ethical issues that may arise and explain how you would handle these issues.
Your Final Paper must be six to eight pages in length (excluding title and reference pages) and formatted according to APA style as outlined in the Ashford Writing Center. Utilize a minimum of six peer-reviewed sources that were published within the last 10 years, in addition to the textbook, that are documented in APA style as outlined in the Ashford Writing Center. The sources should consist of the following:
- One source should be the article you critiqued in the Week Three assignment.
- At least two sources should be about the research methodology you have chosen for your study.
- At least one source should be on ethical issues in research.
- The remaining sources may be about anything pertinent to your study.
In accordance with APA style, all references listed must be cited in the body of the paper.
Required Sections and Subsections (use these headings in your paper)
- Introduction – Introduce the research topic, explain why it is important, and present your research question and/or hypothesis.
- Literature Review – Summarize the current state of knowledge on your topic, making reference to the findings of previous research studies (including the one you critiqued in Week Three). Briefly analyze and critique these studies and mention the research methods that have previously been used to study the topic. State whether your proposed study is a replication of a previous study or a new approach using methods that have not been used before. Be sure to properly cite all of your sources in APA style.
- Methods
Design – Indicate whether your proposed study is qualitative or quantitative in approach. Identify the specific research design, using one of the designs we have studied in Weeks Three through Five, and indicate whether it is experimental or non-experimental. Evaluate your chosen design and explain why you believe this design is appropriate for the topic and how it will provide the information you need to answer the research question. Cite sources on research methodology to support your choices.
Participants – Identify and describe the sampling strategy you would use to recruit participants for your study. Estimate the number of participants you would need and explain why your sampling method is appropriate for your research design and approach.
Procedure/Measures – Apply the scientific method by describing the steps you would use in carrying out your study. Indicate whether you will use any kind of test, questionnaire, or measurement instrument. If using an existing published instrument, provide a brief description and cite your source. If you are creating a questionnaire, survey, or test, describe the types of information you will gather and explain how you would establish the validity and reliability. If you are not using such an instrument, describe how you would collect the data.
Data Analysis – Describe the statistical techniques (if quantitative) or the analysis procedure (if qualitative) you plan to use to analyze the data. Cite at least one source on the chosen analysis technique (from your Week Two assignment).
Ethical Issues – Analyze the impact of ethical concerns on your proposed study, such as confidentiality, deception, informed consent, potential harm to participants, conflict of interest, IRB approval, etc. After analyzing the ethical issues that apply to your research proposal, indicate what you would do to handle these concerns. - Conclusion – Briefly summarize the major points from your paper and reiterate why your proposed study is needed.
Writing the Final Paper
The Final Paper:
- Must be six to eight double-spaced pages in length, and formatted according to APA style as outlined in the Ashford Writing Center.
- Must include a title page with the following:
Title of paper
Student’s name
Course name and number
Instructor’s name
Date submitted - Must begin with an introductory paragraph that has a succinct thesis statement.
- Must address the topic of the paper with critical thought.
- Must end with a conclusion that reaffirms your thesis.
- Must use at least six peer-reviewed sources that were published within the last 10 years, in addition to the textbook.
- Must document all sources in APA style, as outlined in the Ashford Writing Center.
- Must include a separate reference page, formatted according to APA style as outlined in the Ashford Writing Center.
CANVAS – PSY326 Week Five
Welcome to Week 5 of Psychology 326, Research Methods at Ashford University. In this final
week of the course, you will learn about what is considered the gold standard of research
designs– the experiment. Although this category of designs involves the highest level of control
by the researcher, it’s not the answer to every research question or problem. Sometimes
descriptive or correlational design is best for the situation.
Experimental designs are good for helping establish causation. The most notorious ethical
breaches have occurred in the context of experimental research. The potential for harming
participants, whether accidental or deliberate, is greatest when some kind of manipulation in the
form of a treatment or intervention is given to participants.
That’s why this type of research design is subject to the highest level of scrutiny by IRB’s, and
the government agencies that regulate them. In your discussion post this week, be prepared to
talk about how and why a quasi experimental design, that is an almost experiment, might solve
some ethical problems that would come up in a true experiment.
The Week 5 quiz covers Chapter 5 of the textbook. The quiz will include questions about
experimental and quasi experimental research design and threats to internal and external validity
of exponents. The final paper is a research proposal. It’s not a report of either real or hypothetical
completed research, but a detailed plan for a new research study on the topic of the study you
critiqued in Week 3.
The assignment instructions provide details about what must be included in your paper, including
section headings, and the information that needs to be covered in each section. Notice that there
is no results section. This is because the research proposal is not about a study that’s already been
done. It’s the kind of document you would send to the IRB when requesting permission to begin
a study.
You need to include anything the IRB might ask about, such as how you will safeguard the
confidentiality of your participants, whether any deception will be involved, and if so why it is
necessary, how you will protect the participants from any kind of harm, and plans for debriefing.
If you choose a non-experimental research design, some of these items won’t apply to you. The
proposal does have a conclusion section, but this is where you summarize the main points of the
research plan and reiterate why you believe the study is important and should be allowed to be
done.
Don’t try to make any conclusions about the results of the study. As always, don’t forget to make
use of the resources in the Ashford library, especially the Research Methods Research Guide to
find information and peer reviewed sources to support your discussion post and your final
research proposal. If you have any questions or concerns, contact your instructor. Have a great
week.
Definition of really complicated word.
PSY326 Research Methods Week 5 Guidance
Welcome to the final week! View the video on the Week 5 overview screen for an introduction to the topics and assignments. Read Chapter 5 of your textbook. An article by Dr. Anthony Onwuegbuzie about threats to validity is recommended reading (see the course guide). To get the article, go to the Ashford Library, find the ERIC database and search for “Onwuegbuzie” as the author and “internal and external validity” as either the title or subject.
After completing this instructional unit, you will:
· Evaluate the key components of the experimental design.
· Examine threats to both internal and external validity in experimental research.
In this week’s discussion, you will describe different ways that experimental and quasi-experimental studies can be designed, explain the differences between them, and discuss the ways that an independent variable can be manipulated in an experiment. Be sure to include an example of either an experiment or a quasi-experiment, and explain why the chosen design is appropriate for the situation in the example.
Remember that all discussions should cite at least two scholarly sources, so be sure to search the Ashford Library resources (including the Research Methods research guide) for journal articles that extend the information given in the textbook on experimental research. All references should be cited in APA format. See the Ashford Writing Center, under Learning Resources in the left navigation panel, for examples of correct APA style.
This week’s quiz will cover the concepts related to experimental and quasi-experimental research designs discussed in Chapter 5 of the textbook. The quiz is due on Sunday.
In the final paper due this week, you will prepare a research proposal for a new study on the topic of the study you critiqued in Week 3. What would you do to confirm or extend the research done in that study? Would you try a different research design or use the same one? Carefully review the final paper instructions and the feedback from your Week 3 assignment to help you prepare the final paper. Use the section headings from the assignment instructions to organize the paper and ensure that you have included all of the required information. Note that the section headings for the final paper are slightly different from the headings for the critique assignment.
The experiment has historically been considered the “gold standard” of research methods, if the purpose of the study is to establish causation. This is because the elements of control imposed in a well-conducted experiment make it possible to determine that the independent variable (the hypothesized cause) is the only factor that is influencing the dependent variable (the observed effect). Correlational designs can establish a relationship between variables, but they cannot prove a cause and effect relationship. Experimental research has traditionally been associated with quantitative research methods because in order to compare results and say that something is better or higher than something else, numbers have to be used in the comparison.
Experiments use all of the qualities of the scientific method: objectivity, precise measurements, control of other possible influencing factors, careful logical reasoning, and replication, as described in your textbook (Newman, 2016). To be considered a true experiment, a research study must have all three of these characteristics: (1) manipulation of an independent variable; (2) random assignment of participants to groups or conditions; and (3) control of extraneous variables. A quasi-experiment is a study that has some, but not all, of these characteristics. Variables which are presumed causes but cannot be manipulated by the researcher are sometimes called “quasi-independent” variables. If quasi-independent variables such as gender or race are of interest in a study, they must be used in conjunction with an independent variable that can be manipulated, such as an aspect of the environment, in order for the study to count as a true experiment. Without a manipulated independent variable, the study would be a quasi-experiment. Think about situations when it is either impossible or unethical to randomly assign people to conditions or to manipulate a variable. In these situations, a quasi-experimental design may be called for.
Random assignment means that after you have recruited a sample of participants, you use a random process to divide them into two groups, usually based on the values of the independent variable. Randomly assigning participants to groups guards against possible bias and helps assure that the groups will be equal on unknown or extraneous factors. The two groups are called treatment (or experimental) and control.
The treatment group receives some amount of the treatment and the control group receives either no treatment or a standard treatment – this is also called manipulation of the independent variable. Everything about the two groups should be the same except for their condition on the independent variable. After the treatment period is over, the researcher measures the dependent variable for all participants and compares the scores for the two groups. If the treatment group has a significantly different score than the control group, you can be fairly certain that the treatment was what made the difference.
For example, suppose you want to find out if a new method of teaching science is really better than the way it is currently being done. You would get a random sample of appropriate students and randomly assign them to groups. (If you can randomly assign students to groups, this would be an experiment, but if you have to use existing classes as the groups, it would be considered a quasi-experiment.) You have to watch out for possible problems, called threats to validity, such as students from different groups talking to each other about how they are being taught (this is called diffusion of treatment). Random assignment prevents bias (i.e., assigning your favorite students to a particular group) and it also helps you make sure that you don’t have all of the students who were already better at science in the same group. The groups should be as equal as possible at the beginning of the study.
In this example, the independent variable is the teaching method, which is assigned by the researcher. The dependent variable would be the score on a science test that all of the participants would take at the end of the study. If all goes well and the groups are really random and have an equal average on unknown factors that might affect knowledge or ability in the subject, you will be able to rely on the results of the comparison of test scores. So, if the treatment group’s average score is much higher than the control group’s average score, you can be reasonably confident that the new teaching method works better than the old method. If the average scores for the two groups are about the same, then you can conclude that both methods work just as well. Of course, if the control group has a higher average score than the treatment group, you won’t want to switch to the new teaching method!
There are two aspects of experimental validity – internal and external. Internal validity has to do with being able to be sure that the independent variable in your experiment was really the cause of any observed change in the dependent variable. Without internal validity, you may as well not bother doing the experiment, because the results cannot be trusted.
External validity, on the other hand, is desirable, but not essential in the same way that internal validity is. External validity concerns whether the results of the research can be generalized to people (or animals) other than the participants in the research sample.
Both aspects of validity need to be protected from threats, a few of which are described in Chapter 5 of the textbook. The Onwuegbuzie (2000) article contains a much more complete and detailed description of threats to internal and external validity. Controlling extraneous variables (things that might influence the dependent variable but are not part of the experiment) will help resolve many internal validity threats, but the more controls that are instituted to increase internal validity the less likely it is that external validity will be strong. Researchers must find a balance between internal and external validity.
If you have any questions about this week’s readings or assignments, email your instructor or post your question on the “Ask Your Instructor” forum. Remember, use the forum only for questions that may concern the whole class. For personal issues, use email.
References
Newman, M. (2016). Research methods in psychology (2nd ed.). San Diego, CA: Bridgepoint Education, Inc.
Onwuegbuzie, A. J. (2000). Expanding the framework of internal and external validity in quantitative research. Retrieved from http://eric.ed.gov/
REVIEW ARTICLE
Exploring positive pathways to care for members of
the UK Armed Forces receiving treatment for PTSD:
a qualitative study
Dominic Murphy1*, Elizabeth Hunt1, Olga Luzon2 and Neil Greenberg
1
1King’s Centre for Military Health Research, King’s College London, London, UK; 2Department of
Clinical Psychology, Royal Holloway University, London, UK
Objective: To examine the factors which facilitate UK military personnel with post-traumatic stress disorder
(PTSD) to engage in help-seeking behaviours.
Methods: The study recruited active service personnel who were attending mental health services, employed a
qualitative design, used semi-structured interview schedules to collect data, and explored these data using
interpretative phenomenological analysis (IPA).
Results: Five themes emerged about how participants were able to access help; having to reach a crisis point
before accepting the need for help, overcoming feelings of shame, the importance of having an internal locus
of control, finding a psychological explanation for their symptoms and having strong social support.
Conclusions: This study reported that for military personnel who accessed mental health services, there were a
number of factors that supported them to do so. In particular, factors that combated internal stigma, such as
being supported to develop an internal locus of control, appeared to be critical in supporting military
personnel to engage in help-seeking behaviour.
Keywords: Military health; PTSD; depression; pathways; stigma; barriers
*Correspondence to: Dominic Murphy, KCMHR, Weston Education Centre, Cutcombe Road, SE5 9PR
London, UK, Email: dominicmurphy100@gmail.com
For the abstract or full text in other languages, please see Supplementary files under Article Tools online
Received: 17 June 2013; Revised: 4 October 2013; Accepted: 20 November 2013; Published: 17 February 2014
S
ince 2002, the UK and US military’s have con-
ducted highly challenging operations in Afghanistan
and Iraq. These military operations have been
the focus of a number of large-scale epidemiological re-
search studies, which have investigated the psychological
health of US and UK service personnel. Studies in the
United States have observed rates of post-traumatic stress
disorder (PTSD) in deployed personnel to be between
8 and 18% (Hoge et al., 2004; Smith et al., 2008). Further,
13% of participants met criteria for alcohol problems
and 18% for symptoms of anxiety and depression, with a
very high co-morbidity rate between these disorders and
PTSD (Riddle et al., 2007; Smith et al., 2008). This
increase in the rate of PTSD following deployment has
been replicated prospectively (Vasterling et al., 2006).
However, in the UK, the effects of the conflict upon the
mental health of service personnel have been quite
different.
The most extensive UK epidemiological studies of
service personnel since 2003 have been carried out at
King’s College London. This study is based on a
randomly selected representative sample of the UK
military, and in 2006, this study reported rates of PTSD
to be 4% and symptoms of common mental health
problems (including anxiety and depression) to be 20%
(Hotopf et al., 2006); higher rates of PTSD (6%) were
found in combat troops and reserve forces. These rates
remained reasonably constant at the second wave of data
collection in 2010 (Fear et al., 2010). However, figures
released by the Ministry of Defence (MoD) demonstrate
substantially lower rates of personnel accessing services
for these problems, between 4�4.5% and 0.8�1.2%,
respectively, over the past 3 years (Defence Analytical
Services Agency, 2011). This is supported by research
that reported that only 23% of UK service personnel who
meet criteria for a mental health diagnosis are receiving
any support from mental health services (Iversen et al.,
2010). Of those who engaged in help-seeking, 77% were
getting treatment, with 56% receiving medication, 51%
psychological therapy and 3% inpatient
treatment.
PSYCHOTRAUMATOLOGY
EUROPEAN JOURNAL OF
�
European Journal of Psychotraumatology 2014. # 2014 Dominic Murphy et al. This is an Open Access article distributed under the terms of the Creative
Commons Attribution 4.0 Unported (CC-BY 4.0) License (http://creativecommons.org/licenses/by/4.0/), allowing third parties to copy and redistribute the
material in any medium or format, and to remix, transform, and build upon the material, for any purpose, even commercially, under the condition that appropriate
credit is given, that a link to the license is provided, and that you indicate if changes were made. You may do so in any reasonable manner, but not in any way that
suggests the licensor endorses you or your use.
Citation: European Journal of Psychotraumatology 2014, 5: 21759 – http://dx.doi.org/10.3402/ejpt.v5.21759
1
(page number not for citation purpose)
http://www.eurojnlofpsychotraumatol.net/index.php/ejpt/rt/suppFiles/21759/0
http://www.eurojnlofpsychotraumatol.net/index.php/ejpt/article/view/21759
http://dx.doi.org/10.3402/ejpt.v5.21759
A study within the UK Armed Forces followed up
service personnel who had been involved in a 6-year
longitudinal study, 3 years later
(Iversen et al., 2005b).
The study observed that most ex-service personnel do
well once they leave. However, those who had a mental
health problem when they left the Armed Forces were
substantially more likely to be suffering from a mental
health problem and be unemployed 3 years after leaving
(Iversen et al., 2005b). In addition, having a mental
health problem predicted leaving the Armed Forces and
mental health status remained constant after leaving
(Iversen et al., 2005b).
As documented above, only a modest number of
military personnel experiencing mental health difficulties
are able to access treatment, and little is known about the
treatment experiences of military personnel who do
access services (Iversen et al., 2009). What we do know
is that many ex-service personnel are able to get treatment
from the NHS, which provides a range of specialist
services. Previous research has identified a number of
barriers that may explain the reluctance to access services
(Britt, Wright, & Moore, 2012; Gould et al., 2010; Iversen
et al., 2011; Kim, Thomas, Wilk, Castro, & Hoge, 2010).
These barriers broadly fit within three categories: internal
stigma (including self-stigma), external stigma (including
public stigma and mistrust in services), and access factors
(including lack of knowledge of available services).
Several trials have been conducted to improve the number
of people seeking treatment by aiming to reduce stigma.
A review of these trials concluded that there has been
little evidence of the efficacy of these interventions
(Mulligan, Fear, Jones, Wessely, & Greenberg, 2011).
The current study aims to investigate the specific
pathways to accessing mental health services for members
of the UK Armed Forces. In particular, to elucidate
factors that support individuals to access services, and
where barriers exist, how these are overcome. This is in
line with the agenda of military occupational mental
health services that have prioritised the importance of
supporting individuals to access services at the earliest
opportunity.
Methods
Setting & design
This study utilised a sample of UK service personnel who
are accessing defence mental health services. Two military
departments of community mental health (DCMHs)
located in the south east of England were selected as
they were geographically close to the investigating team;
DCMHs provide services to all military personnel. The
MoD and RHUL ethics committees granted ethical
approval for this study.
A qualitative methodology was adopted for this study
due to the exploratory nature of the research questions
under investigation. The aim of the research questions
was to understand the lived experiences of participants
during their pathways to accessing mental health services,
and interpretative phenomenological analysis (IPA) has
been argued to be the most appropriate qualitative
analytic approach to do this (Smith, Flowers, & Larkin,
2009).
Participants
A sample size of between 8 and 10 participants was
decided upon as informed by the selection of IPA (Smith
& Osborn, 2008). An ad hoc sampling strategy was used
for this study. The lead author (D. M.) met clinicians at
the DCMHs and explained the inclusion and exclusion
criteria. Clinicians were then requested to ask the clients
who met these criteria whether they wished to participate
in the study. Inclusion criteria for selection into the study
included having a diagnosis of either PTSD or depression
and currently receiving treatment. Individuals were not
selected if they were in the process of being medically
discharged from the military due to disciplinary reasons
(this exclusion criteria was requested by the MoD ethics
committee and the authors do not have access to the
reasons why service personnel were being discharged), or
if there was a clinical reason that meant it would not be
appropriate for the individual to take part in the study. In
general, these clinical reasons were if clients were new to
the service. Clinicians were concerned that the study may
be seen as an additional source of stress at a time when
clients were first engaging in treatment and could have
potentially created a barrier to their engagement in
treatment.
Materials
A semi-structured interview schedule was used. Broadly,
the aim of the interview schedule was to understand the
different pathways that participants’ took to access
services, including which factors enabled them to do so,
and how they overcame potential internal and external
barriers. The interview schedule was piloted with three
individuals who were accessing defence mental health
services. The aim of this was to ensure that the questions
were understandable and to check whether additional
questions needed to be added. Following this, the inter-
view schedule was refined taking into account feedback
from a number of pilot interviews. This included advice
about removing a number of questions and clarifying the
stems of several questions.
Participants were also asked to complete two measures
to record symptoms of mental illness. The Post Traumatic
Checklist (PCL-C) is a self-report 17-item measure of the
17 DSM-IV symptoms of PTSD (Weather & Ford, 1996).
The PCL-C has been previously validated against a
clinical interview, which recommended using a cut-off
of 50 or more (Blanchard, Jones-Alexander, Buckley, &
Dominic Murphy et al.
2
(page number not for citation purpose)
Citation: European Journal of Psychotraumatology 2014, 5: 21759 – http://dx.doi.org/10.3402/ejpt.v5.21759
http://www.eurojnlofpsychotraumatol.net/index.php/ejpt/article/view/21759
http://dx.doi.org/10.3402/ejpt.v5.21759
Forneris, 1996). The Patient Health Questionnaire (PHQ-
9) is a self-report measure that is based directly upon the
DSM-IV criteria for depression and includes nine items.
The PHQ-9 is scored from 0 to 27, and scores give an
indication of symptom severity; scores between 15 and 19
indicates moderate to severe depression and a score of 20
or above indicates major depression (Kroenke & Spitzer,
2002). Participants were also asked a number of questions
about their demographic characteristics.
Procedure
Recruitment was carried out between March 2012 and
June 2012. The DCMH staff were approached, and the
inclusion and exclusion criteria for the study were dis-
cussed and a list of potential participants was drawn up.
After initial consent had been granted for their details to be
passed on from their treating clinician, potential partici-
pants were contacted to discuss the study, seek consent for
them to be recruited, and find a suitable date and time to
conduct the interview.
Analysis
The first stage of data analysis was to collate the demo-
graphic characteristics and data collected through the
standardised measures (PCL-C and PHQ-9). The second
stage involved analysing the qualitative data in accordance
with published guidelines for conducting IPA (Smith &
Osborn, 2008; Willig, 2008). In brief, this involved working
through a number of different stages. The first stage was to
become familiar with the first participant’s transcript. The
second stage was to make initial notations for ideas
and themes in the text. The notations remained close to
the participant’s words. The third stage was to develop
emerging themes by re-reading the initial notations and
assigning labels. The aim of these labels was to capture the
essence of what the participant had described. The fourth
stage was to search for connections between emerging
themes. The list of labels was scrutinised and emergent
themes that appeared to be connected to each other were
grouped together under super-ordinate themes. Super-
ordinate themes were broader in scope than emergent
themes and contained a number of associated sub-themes.
This process was then repeated for the next participant’s
transcript. Once analysis had been completed for each
transcript, a final master list of super-ordinate and sub-
themes was generated. During this stage, differences and
similarities between cases were noted. At this stage, themes
between transcripts were grouped together and re-labelled
where appropriate.
Results
Sample
Recruitment was carried out at two DCMHs. The sample
consisted of 8 participants, with four from each DCMH.
For the purposes of the study, participants were assigned
pseudonyms to protect their anonymity.
Data were collected on participants’ socio-demographic
characteristics to situate the sample; these are described in
Table 1. The majority of the sample were male (six out of
eight), in a relationship (7/8), had children (6/8), were
Other Ranks and not officers (5/8), were British (7/8) and
reported their ethnicity to be white (8/8). The ages of
participants ranged from early 20s to mid-50s, with the
majority or participants aged between mid-20s and mid-
30s. The lengths of service varied from 4 to 31 years, with
the mean length of service approximately 13 years. Nearly,
50% of the sample was in the Royal Navy and 50% was in
the Army.
Rates of mental health are reported in Table 2. The
results indicate that three of the participants reported
clinically significant levels of distress at the time of the
interview, as measured on both the PHQ-9 and PCL-C.
In addition, two further participants’ scores approached
the cut-offs that defined case criteria on both of the
measures. One of the inclusion criteria for the study was
that participants had a diagnosis of PTSD or major
depression. The observed variation in rates of distress
may be indicative of participants being at different stages
of treatment at the time the interviews were conducted.
Table 1. Socio-demographic characteristics of the sample
Participant Sex Age Relationship status Children Nationality Ethnicity Service Rank (officer or in ranks) Years in military
P1 Male 42 Divorced Yes British White Army Officer 23
P2 Male 51 Married Yes British White Navy Officer 31
P3 Male 34 Married Yes British White Navy Officer 14
P4 Male 30 Married Yes British White Navy Ranks 11
P5 Female 27 Partner No British White Navy Ranks 10
P6 Female 22 Partner No British White Army Ranks 4
P7 Male 31 Married Yes British White Army Ranks 4
P8 Male 35 Married Yes New Zealand White Army Ranks 6
Exploring positive pathways to care for members of the UK
Citation: European Journal of Psychotraumatology 2014, 5: 21759 – http://dx.doi.org/10.3402/ejpt.v5.21759 3
(page number not for citation purpose)
http://www.eurojnlofpsychotraumatol.net/index.php/ejpt/article/view/21759
http://dx.doi.org/10.3402/ejpt.v5.21759
Results of qualitative analysis
Five super-ordinate themes emerged from the data. Each
of these super-ordinate themes contained a number of
sub-themes; these are presented in Table 3.
Theme one: recognising something
was wrong
A theme that emerged was that participants perceived it
had been difficult for them to recognise they were
experiencing mental health difficulties. This appeared to
result in participants ignoring early warning signs of
mental health difficulties and trying to carry on until it
was impossible for them to do so any longer.
Reaching a crisis point. The participants perceived
having reached a ‘‘crisis point’’ which meant they could
not ignore the mental health difficulties they were
experiencing any longer. What constituted a crisis point
differed between participants and was related to factors
in their environments.
P7: I can remember just being in such a state, I
mean, I was seriously disturbed, so there was so
many things that I felt, panic, terror, depression. I’d
be, go and find a quiet spot and just break down
and cry.
Difficulties experienced as physical symptoms. The par-
ticipants recalled that they first experienced physical
rather than psychological symptoms.
P1: So lots of things came together at that time.
My body was clearly screaming at me, I mean
there were lots, all through the years actually I had
lots and lots of not fully explained medical pro-
blems, which we now think were directly related to
PTSD.
Theme two: overcoming internal stigma
One of the super-ordinate themes that emerged from the
transcripts was related to how individuals perceived
overcoming internal stigma related to experiencing men-
tal health difficulties. Broadly, this fell into two areas:
overcoming feelings of shame about experiencing mental
health difficulties and the effect on self-esteem of being
prescribed psychiatric medication.
Shame. Participants spoke about feeling concerned that
they would experience stigma, in particular, being per-
ceived as ‘‘weak’’ by their peers. However, it appeared
that for the majority their fears were not realised, but
rather it was internal stigma they were
experiencing.
Interviewer: So it sounds like you maybe had some
of those fears about stigma but they weren’t
realised.
P1: But actually they didn’t, they weren’t real, they
didn’t, it’s not manifested itself. I think people are
much more aware now of it. I think the problem was
with me rather than with everybody else, it was the
anticipation of stigma, maybe that says more about
me than other
people.
Table 2. PHQ-9 and PCL-C scores for sample
Participant PHQ-9 score1 Met criteria for PHQ-9 case PCL-C score2 Met criteria for PCL-C case
P1 13 No 41 No
P2 4 No 8 No
P3 0 No 8 No
P4 23 Major depression 80 Yes
P5 4 No 28 No
P6 12 No 40 No
P7 21 Major depression 71 Yes
P8 17 Moderate to severe depression 63 Yes
1PHQ-9 scored from 0 to 27: scores 15�19 indicates moderate to severe depression and a score of 20 or above indicates major
depression.
2PCL-C scored from 17 to 85; scores above 50 indicates meeting criteria for post-traumatic stress reactions.
Table 3. Master list of super-ordinate and sub-themes
Super-ordinate themes Sub-themes
Recognising something
was wrong
Reaching a crisis point
Difficulties experienced as
physical symptoms
Overcoming internal stigma Shame
Stigma related to psychiatric
medication
Finding an explanation Trusted witness to difficulties
Psychological explanation
Getting a diagnosis
Not being alone Normalisation
Safe space
Sense of hope
Acceptance
Understanding
Control Autonomy
Communication
Dominic Murphy et al.
4
(page number not for citation purpose)
Citation: European Journal of Psychotraumatology 2014, 5: 21759 – http://dx.doi.org/10.3402/ejpt.v5.21759
http://www.eurojnlofpsychotraumatol.net/index.php/ejpt/article/view/21759
http://dx.doi.org/10.3402/ejpt.v5.21759
Stigma related to psychiatric medication. Participants
highlighted the link between being offered medication
and internal stigma related from suffering with a mental
health difficulty. They discussed their ambivalence to-
wards medication. On the one hand, believing that
medication may help them, but on the other hand,
describing how taking medication meant there was
something wrong with you. Medication seemed to be
symbolic of having a mental illness that could no longer
be ignored.
P5: I kept saying, ‘‘I’m not going on medication’’
but I knew I had to, I knew I needed to in the end.
My mum, she’s always been on antidepressants and
I thought, I always said I’d never, ever wanna be like
that.
Theme three: finding an explanation
Participants highlighted the importance of being able to
find an explanation for their difficulties. By understand-
ing and accepting that their difficulties had a psycholo-
gical component, this supported participants’ to seek
help. How participants’ came to find this explanation
differed greatly.
Trusted witness to difficulties. Participants perceived the
importance of having a trusted witness to their difficulties
who could point out something was seriously wrong. This
supported participants to accept that their difficulties
were serious and that they needed
to seek help.
P6: Yeah the first time round, I’ve got a very close
friend in the Paras, he’s a Liaison Officer. He
noticed that I was very down and I spoke differently,
very slowly and I just wasn’t really interested in what
he was saying and that’s not really me. I’m quite an
enthusiastic outgoing person and I changed quite a
lot the first time.
Psychological explanation. Participants described how
beneficial it was to be given a psychological explanation
for their difficulties. This may have been because it helped
them realise that their difficulties had a reason or a
function.
P2: Yeah, so I have to, like when I do anything I
have to sort of, I have to understand the mechanics
of it, so I asked the psychiatrist how does this
actually work? But if I understand the process is
find it really helpful.
Getting a diagnosis. Participants spoke about how
receiving a diagnosis was a crucial step for them in their
journey to seek help because it put a label on the
difficulties that they were experiencing.
P8: I think I was only officially told that, you know,
I think they said I had chronic PTSD and yeah it
was my nurse that told me and I don’t know and
then she told me, you know, she explained ‘‘These
symptoms that you’re having . . .’’ And obviously
there was quite a few ‘‘Is all the signs.’’
P8: I was like ‘‘Jesus it must be that.’’ Then, I don’t
know it just made me really interested, I really
wanted, cause I knew what it was then and I was like
‘‘Right I can fix myself here surely.’’
Theme four: not being alone
Another theme that emerged was related to factors that
stopped participants feeling alone supported them to
seek, or continue, treatment for the difficulties they were
experiencing.
Normalisation. Participants spoke about the positive
experience of learning that the difficulties they were
experiencing were similar to those experienced by other
people.
P4: But it’s just looking into it, because when you
look into it you realise, hang on, they’re talking
about people going through this, this, this and this,
but that’s the same as me, so you start thinking, well
I’m not the only person here.
Safe space. What appeared common across the tran-
scripts was that having a safe space allowed participants
the opportunity to take a step back and realise something
was wrong; this then provided them with the motivation
to seek help.
P4: I was sick on shore for two weeks. During that
time it gave me time to actually rest in a secure
environment because I was at home, I had my family
around me. It was a secure environment. I didn’t
have to look over my shoulder. And it gave me a lot
of thinking time. I talked things through with my
wife and thought, something’s wrong here.
Sense of hope. Hope that things could improve was a
theme that emerged in seven of the transcripts. Most of
the participants recalled that hope was connected to
feeling that treatment was available to help them over-
come their difficulties.
P1: There was part of me that was relieved, but
there’s always part of me that, nobody’s harder on
me than I am and, but there was also huge relief. It
was, I realised that finally we may be able to do
something about this.
Acceptance. Participants spoke about the fear of not
being accepted by significant people in their lives because
of their mental health difficulties. However, it seems that
often these fears were based on internal beliefs and not
realised.
P5: I don’t even know why I was worried because I
know that they wouldn’t have ever judged me but at
the time that’s how I was feeling that they were
gonna judge me.
Exploring positive pathways to care for members of the UK
Citation: European Journal of Psychotraumatology 2014, 5: 21759 – http://dx.doi.org/10.3402/ejpt.v5.21759 5
(page number not for citation purpose)
http://www.eurojnlofpsychotraumatol.net/index.php/ejpt/article/view/21759
http://dx.doi.org/10.3402/ejpt.v5.21759
Understanding. Participants talked about how impor-
tant it had been for them that other people understood
the mental health difficulties they were experiencing.
Participants spoke about how this had helped them not
feel alone as they could share their experiences with
someone who understood them.
P3: If I needed to talk to somebody about it there
was always somebody that was there to talk about
it. My wife really wanted to know, she’d phone me
after every session to see how it had gone. And
there’s a lot to take away from my sessions to share
with her. And so it’s a journey we’ve been through
together.
Theme five: control
Participants perceived that their mental health difficulties
had made them feel as if they were subject to an external
locus of control. In contrast, many of the participants
spoke of how helpful it had been for them when engaging
in help-seeking behaviour to feel an internal locus of
control about their treatment options.
Autonomy. Crucial to having a sense of control was
having autonomy over their treatment plans. Tom ex-
plained how he felt supported by his line manager
because they handed him control. This may be a very
different experience compared to other aspects of military
life, where typically service personnel have less control of
their day-to-day tasks.
P3: it was a case of, well what do you want rather
than them finding me something to do, what do you
want to do? So I was lucky in that respect.
Communication. Interviewed participants were worried
about how they might be viewed by their friends or
colleagues. They had mixed views about whether it was
better to share their experiences or not.
P1 talked about how it had been a useful process for
him to share his experiences with his line manager.
P1: Yeah, and once the PTSD thing had been
diagnosed, actually I was given a printout of the
initial session. And actually what I found the best
way was actually I showed it to my boss, I said this
is medically in confidence, but I said I want, I can’t
really explain it but read this, and he read that bit,
and from then on they couldn’t do enough, it was
just.
In contrast, other participants decided that it would
not be helpful to tell their colleagues.
P2: Not many people knew about it because I just
walked out of this meeting and I went for a beer
with an air force guy, a mate, and he just said take
some time off, and that’s what I did. And of course
they didn’t know that I then went and sought help.
So there wasn’t some sort of big showdown, which
you then had to confront going back to work.
Discussion
The study explored which factors enabled serving mem-
bers of the UK Armed Forces experiencing mental health
difficulties to access care, and how they overcame
common barriers to do so. To the best of the authors’
knowledge, this approach to looking at stigma and
barriers to care has not been undertaken before with
the UK military.
We found that all of the participants spoke about
having to reach a crisis point before they sought help.
What was common between the crises was individuals
reaching the point where ‘‘something had to be done’’;
that is to say that the individual could not continue living
their life as they were. Many of the participants spoke
about a military culture that promotes the value of
‘‘cracking on despite a problem.’’ Whilst this may be
advantageous in many aspects of military life, the
participants spoke about how it led them to experience
very serious difficulties before they would accept that
they had a problem.
The majority of participants spoke about the presence
of physical symptoms prior to psychological symptoms.
It appears that participants expressed their psychological
distress through somatic symptoms. It has previously
been observed in military populations that physical
health difficulties are viewed as more acceptable than
mental health ones and that personnel are more likely
to attend appointments for the former, rather than the
latter (Rona, Jones, French, Hooper, & Wessely, 2004).
This finding is mirrored when looking between cultures
that have different explanations for mental illness, which
can lead to either the somatic or psychological expres-
sion of symptoms. For example, Chinese people have
been observed to be more likely to express symptoms
of depression somatically than north-Americans (Ryder
et al., 2008).
Overcoming feelings of shame about experiencing
mental illness was a common theme reported by partici-
pants. Many of the participants linked accessing mental
health services to their feelings of shame because this
meant they had a ‘‘problem.’’ In addition, by accessing
services it meant that their peers would also knew that
they had a ‘‘problem.’’ These two processes map on to
Corrigan’s theory of internal and external stigma (Corrigan,
2004). Participants spoke about how, over the course of
engaging with services, they were able to overcome their
internal stigma beliefs. For many, this process was related
to realising that their negative beliefs about mental illness
conflicted with the positive changes in their lives they
witnessed due to seeking help. Similarly, what seemed to
help the participants overcome their external stigma
beliefs was the realisation that their fears of rejection
from their peers were not actualised.
Three key factors that facilitated participants to engage
help-seeking behaviour emerged. The first of these was
Dominic Murphy et al.
6
(page number not for citation purpose)
Citation: European Journal of Psychotraumatology 2014, 5: 21759 – http://dx.doi.org/10.3402/ejpt.v5.21759
http://www.eurojnlofpsychotraumatol.net/index.php/ejpt/article/view/21759
http://dx.doi.org/10.3402/ejpt.v5.21759
being supported to develop an internal locus of control
(Hiroto, 1974). Developing an internal locus of control
contrasted with how the participants described their lives
prior to seeking help; which for the majority, this period
consisted of feeling as if there was an external locus of
control. A relationship between an external locus of
control and anxiety and depression has been documen-
ted by other researchers (Vuger-Kovaèiæ, Gregurek,
Kovaèiæ, Vuger, & Kaleniæ, 2007). Furthermore, lower
levels of anxiety and depression have been observed in
individuals who report an internal, rather than an
external, locus of control (Jaswel & Dewan, 1997).
The second theme that participants reported as having
facilitated their accessing services was gaining a psycho-
logical understanding of their mental illness. This is
supported by previous literature within civilian popula-
tions that observed having a psychological understanding
predicted help-seeking behaviour (Deane, Skogstad, &
Williams, 1999). Whilst the mechanisms for this relation-
ship are unknown, from the current study it can be
hypothesised that a psychological explanation was more
culturally acceptable for members of the armed forces
than a biological explanation, which is associated with
more stigma. Indeed, many of the participants spoke
about how gaining a psychological explanation helped
allay their concerns about being ‘‘mad’’ and having
something ‘‘wrong with them.’’
Being well supported by their social networks was the
final theme described by participants as having facilitated
them to access mental health services. This finding is
supported by previous research within civilian popula-
tions that documented that individuals with mental
illness, who report better social support, were more likely
to engage in help-seeking behaviours (Briones et al.,
1990).
There are a number of limitations to this study. When
interpreting these results, it is important to acknowledge
that there may have been bias towards recruiting parti-
cipants with lower levels of psychological distress. There
was some evidence to support this in the scores reported
on the measures of psychological distress. This needs to
be interpreted carefully as there may have been a bias for
therapists to exclude potential clients if they deemed them
to be suffering from high levels of psychological distress,
or only suggest potential participants who they deemed
had shown significant improvement. Alternatively, it
could have been that only participants who had bene-
fitted from treatment were put forward, in which case
their positive experience of treatment, may have acted to
influence their recall of the factors that helped them
engage in treatment by framing this decision in a
potentially more positive light. It is regrettable that the
authors’ do not have access to information related to
stage of treatment, which may have allowed for further
exploration of this. Whilst there are good clinical reasons
for making these decisions, they could present limitations
to the findings of the current study because individuals
who have been identified as being most at risk of not
being able access services are those with higher levels of
psychological distress (Iversen et al., 2005a).
Conclusions
The results of this study suggest that there are three key
areas that support individuals to seek help. The first of
these were factors that helped individuals recognise that
they were experiencing difficulties and help them realise
that these difficulties had a psychological component.
The second were factors that helped an individual feel as
if they were no longer alone to deal with their difficulties.
For example, this included feeling accepted and sup-
ported by their social network. The final area that
supported individuals to seek help was them feeling
empowered to do so by having an internal locus of
control. In PTSD, feelings of helplessness and power-
lessness are extremely debilitating. Clinically, factors that
promote an internal locus of control are very important
for reducing these feelings. The participants spoke about
how factors that promoted an internal locus of control
helped them overcome feelings of internal stigma. It is
interesting to reflect that the factors that promoted an
internal locus of control could also have acted to reduce
the distress caused by symptoms of PTSD by helping to
tackle feelings of helplessness, isolation and powerless-
ness. Understanding the relevance of these three factors
should help military commanders to plan effective
stigma-reduction programmes.
Conflict of interest and funding
There is no conflict of interest in the present study for any
of the authors.
References
Blanchard, E. B., Jones-Alexander, J., Buckley, T. C., & Forneris,
C. A. (1996). Psychometric properties of the PTSD Checklist
(PCL). Behaviour Research & Therapy, 34, 669�673.
Briones, D. F., Heller, P. L., Chalfant, H. P., Roberts, A. E., Guirre-
Hauchbaum, S. F., & Farr, J. (1990). Socioeconomic status,
ethnicity, psychological distress, and readiness to utilize a
mental health facility. American Journal of Psychiatry, 147(10),
1333�1340.
Britt, T. W., Wright, K. M., & Moore, D. (2012). Leadership as a
predictor of stigma and practical barriers toward receiving
mental health treatment: A multilevel approach. Psychological
Services, 9, 26�37.
Corrigan, P. W. (2004). How stigma interferes with mental health
care. American Psychologist, 59, 614�625.
Deane, F. P., Skogstad, P., & Williams, M. W. (1999). Impact
of attitudes, ethnicity and quality of prior therapy on
New Zealand male prisoners’ intentions to seek professional
psychological help. International Journal for the Advancement
of Counselling, 21, 55�67.
Exploring positive pathways to care for members of the UK
Citation: European Journal of Psychotraumatology 2014, 5: 21759 – http://dx.doi.org/10.3402/ejpt.v5.21759 7
(page number not for citation purpose)
http://www.eurojnlofpsychotraumatol.net/index.php/ejpt/article/view/21759
http://dx.doi.org/10.3402/ejpt.v5.21759
Defence Analytical Services Agency. (2011). Rates of mental health
disorders in UK armed forces (2007�2011). London: Ministry
of Defence.
Fear, N. T., Jones, M., Murphy, D., Hull, L., Iversen, A., Coker, B.,
et al. (2010). What are the consequences of deployment to Iraq
and Afghanistan on the mental health of the UK armed forces?
A cohort study. Lancet, 375, 1783�1797.
Gould, M., Adler, A., Zamorski, M., Castro, C., Hanily, N., Steele,
N., et al. (2010). Do stigma and other perceived barriers to
mental health care differ across Armed Forces? Journal of the
Royal Society of Medicine, 103, 148�156.
Hiroto, D. S. (1974). Learned helplessness and the locus of control.
Journal of Experimental Psychology, 102, 187�193.
Hoge, C. W., Castro, C. A., Messer, S. C., McGurk, D., Cotting,
D. I., & Koffman, R. L. (2004). Combat duty in Iraq and
Afghanistan, mental health problems, and barriers to care. The
New England Journal of Medicine, 351, 13�22.
Hotopf, M., Hull, L., Fear, N. T., Browne, T., Horn, O., Iversen, A.,
et al. (2006). The health of UK military personnel who
deployed to the 2003 Iraq war: A cohort study. Lancet, 367,
1731�1741.
Iversen, A., Dyson, C., Smith, N., Greenberg, N., Walwyn, R.,
Unwin, C., et al. (2005a). ‘‘Goodbye and good luck’’: The
mental health needs and treatment experiences of British ex-
service personnel. British Journal of Psychiatry, 186, 480�486.
Iversen, A., Nikolaou, V., Greenberg, N., Unwin, C., Hull, L.,
Hotopf, M., et al. (2005b). What happens to British veterans
when they leave the armed forces? European Journal of Public
Health, 15, 175�184.
Iversen, A. C., Van, S. L., Hughes, J. H., Browne, T., Greenberg, N.,
Hotopf, M., et al. (2010). Help-seeking and receipt of
treatment among UK service personnel. British Journal of
Psychiatry, 197, 149�155.
Iversen, A. C., Van, S. L., Hughes, J. H., Browne, T., Hull, L., Hall,
J., et al. (2009). The prevalence of common mental disorders
and PTSD in the UK military: Using data from a clinical
interview-based study. BMC Psychiatry, 9, 68.
Iversen, A. C., Van, S. L., Hughes, J. H., Greenberg, N., Hotopf, M.,
Rona, R. J., et al. (2011). The stigma of mental health problems
and other barriers to care in the UK Armed Forces. BMC
Health Services Research, 11, 31.
Jaswel, S., & Dewan, A. (1997). The relationship between locus of
control and depression. Journal of Personality and Clinical
Studies, 13, 25�27.
Kim, P., Thomas, J., Wilk, J., Castro, C., & Hoge, C. (2010). Stigma,
barriers to care, and use of mental health services among active
duty and National Guard soldiers after combat. Psychiatric
Services, 61, 582�588.
Kroenke, K., & Spitzer, R. (2002). The PHQ-9: A new depression
diagnostic and severity measure. Psychiatric Annals, 32, 509�
515.
Mulligan, K., Fear, N. T., Jones, N., Wessely, S., & Greenberg, N.
(2011). Psycho-educational interventions designed to prevent
deployment-related psychological ill-health in Armed Forces
personnel: A review. Psychological Medicine, 41, 673�686.
Riddle, J. R., Smith, T. C., Smith, B., Corbeil, T. E., Engel, C. C.,
Wells, T. S., et al. (2007). Millennium cohort: The 2001�2003
baseline prevalence of mental disorders in the U.S. military.
Journal of Clinical Epidemiology, 60, 192�201.
Rona, R. J., Jones, M., French, C., Hooper, R., & Wessely, S. (2004).
Screening for physical and psychological illness in the British
Armed Forces: I: The acceptability of the programme. Journal
of Medical Screening, 11, 148�152.
Ryder, A. G., Xiongzhao, Z., Yang, J., Shuqiao, Y., Jinyao, Y.,
Heine, S. J., et al. (2008). The cultural shaping of depression:
Somatic symptoms in China, psychological symptoms in North
America? Journal of Abnormal Psychology, 117, 300�313.
Smith, J. A., Flowers, P., & Larkin, M. (2009). Interpreta-
tive phenomenological analysis: Theory, method and research.
London: Sage.
Smith, J. A., & Osborn, M. (2008). Interpretative phenomenological
analysis. In J. A. Smith (Ed.), Qualitative psychology: A
practical guide to research methods (pp. 53�80). London: Sage.
Smith, T. C., Ryan, M. A. K., Wingard, D. L., Slymen, D. J., Sallis,
J. F., & Kritz-Silverstein, D. (2008). New onset and persistent
symptoms of post-traumatic stress disorder self reported after
deployment and combat exposures: Prospective population
based US military cohort study. British Medical Journal, 336,
366�371.
Vasterling, J. J., Proctor, S. P., Amoroso, P., Kane, R., Heeren, T., &
White, R. F. (2006). Neuropsychological outcomes of army
personnel following deployment to the Iraq War. JAMA, 296,
519�529.
Vuger-Kovaèiæ, D., Gregurek, R., Kovaèiæ, D., Vuger, T., &
Kaleniæ, B. (2007). Relation between anxiety, depression and
locus of control of patients with multiple sclerosis. Multiple
Sclerosis, 13, 1065�1067.
Weather, F. W., & Ford, J. (1996). Psychometric review of the PTSD
checklist. In B. H. Stamm (Ed.), Measurement of stress, trauma
and adaptation (pp. 250�251). Lutherville: Sidran Press.
Willig, C. (2008). Introducing qualitative research in psychology.
Milton Keynes: Open University Press.
Dominic Murphy et al.
8
(page number not for citation purpose)
Citation: European Journal of Psychotraumatology 2014, 5: 21759 – http://dx.doi.org/10.3402/ejpt.v5.21759
http://www.eurojnlofpsychotraumatol.net/index.php/ejpt/article/view/21759
http://dx.doi.org/10.3402/ejpt.v5.21759
Copyright of European Journal of Psychotraumatology is the property of Co-Action
Publishing and its content may not be copied or emailed to multiple sites or posted to a
listserv without the copyright holder’s express written permission. However, users may print,
download, or email articles for individual use.
DOCUMENT RESUME
ED
4
4
8
2
0
5 TM 0
32
235
AUTHOR Onwuegbuzie, Anthony J.
TITLE Expanding the Framework of Internal and External Validity in
Quantitative Research.
PUB DATE 2
00
0-11-2
1
NOTE
6
2p.; Paper presented at the Annual Meeting of the
Association for the Advancement of Educational Research
(AAER) (Ponte Vedra, FL, November 2000).
PUB TYPE Opinion Papers (120) Reports – Descriptive (
14
1) —
Speeches /Meeting Papers (
15
0)
EDRS PRICE MF01/PC03 Plus Postage.
DESCRIPTORS Models; *Qualitative Research; Research Design; *Validity
ABSTRACT
An experiment is deemed to be valid, inasmuch as valid
cause-effect relationships are established, if the results are due only to
the manipulated independent variable (possess internal validity) and are
generalizable to groups, environments, and contexts outside of the
experimental settings (possess external validity). Consequently, all
experimental studies should be assessed for internal and external
validity.
. Undoubtedly, the seminal work of Donald Campbell and Julian Stanley provides
the most authoritative source regarding threats to internal and external
validity. Since their conceptualization, many researchers have argued that
these threats to internal and external validity not only should be examined
for experimental designs but are also pertinent for other quantitative
research designs. Unfortunately, with respect to nonexperimental quantitative
research designs, it appears that Campbell and Stanley’s sources of internal
and external validity do not represent the realm of pertinent threats to the
validity of studies. The purpose of this paper is to provide a rationale for
assessing threats to internal validity and external validity in all
quantitative research studies, regardless of the research design. In
addition, a more comprehensive framework of dimensions and subdimensions of
internal and external validity is presented than has been undertaken
previously. Different ways of expanding the discussion about threats to
internal and external validity are presented. (Contains 1 figure and 58
references.) (Author/SLD)
Reproductions supplied by EDRS are the best that can be made
from the original document.
Framework for Internal and External Validity 1
Running head: FRAMEWORK FOR INTERNAL AND EXTERNAL VALIDITY
Expanding the Framework of Internal and External Validity in Quantitative Research
Anthony J. Onwuegbuzie
Valdosta State University
PERMISSION TO REPRODUCE AND
DISSEMINATE THIS MATERIAL H
AS
BEEN GRANTED BY
-5.
-avai,
TO THE EDUCATIONAL RESOURCES
INFORMATION CENTER (ERIC)
1
U.S. DEPARTMENT OF EDUCATIONOffice of Educational Research
and Improvement
EDUCATIONAL RESOURCES INFORMATION
CENTER
beendocument has been
R
reproduced as
received from the person or organization
originating if
Minor changes have been made to
improve reproduction quality.
Points of view or opinions
stated in thisdocument do not necessarily represent
official OERI position or policy.
Paper presented at the annual meeting of the Association for the Advancement of
Educational Research (AAER), Ponte Vedra, Florida, November,
21
, 2000.
BEST COPY AVAILABLE
Framework for Internal and External Validity 2
Abstract
An experiment is deemed to be valid, inasmuch as valid cause-effect relationships
are established, if results obtained are due only to the manipulated independent variable
(i.e., possess internal validity) and are generalizable to groups, environments, and contexts
outside of the experimental settings (i.e., possess external validity). Consequently, all
experimental studies should be assessed for internal and external validity. Undoubtedly
the seminal work of Donald Campbell and Julian Stanley provides the most authoritative
source regarding threats to internal and external validity. Since their conceptualization,
many researchers have argued that these threats to internal and external validity not only
should be examined for experimental designs, but are also pertinent for other quantitative
research designs. Unfortunately, with respect to non-experimental quantitative research
designs, it appears that the Campbell and Stanley’s sources of internal and external validity
do not represent the realm of pertinent threats to the validity of studies.
Thus, the purpose of the present paper is to provide a rationale for assessing
threats to internal validity and external validity in all quantitative research studies,
regardless of the research design. Additionally, a more comprehensive framework of
dimensions and sub-dimensions of internal and external validity is presented than has
been undertaken previously. Finally, different ways of expanding the discussion about
threats to internal and external validity are presented.
Framework for Internal and External Validity 3
Expanding the Framework of Internal and External Validity in Quantitative Research
Recently, the Committee on Professional Ethics of the American Statistical
Association (ASA) addressed the following eight general topic areas relating to ethical
guidelines for statistical practice: (a) professionalism; (b) responsibilities for funders,
clients, and employers; (c) responsibilities in publications and testimony; (d) responsibilities
to research subjects; (e) responsibilities to research team colleagues; (f) responsibilities
to other statisticians or statistical practitioners; (g) responsibilities regarding allegations of
misconduct; and (h) responsibilities of employers, including organizations, individuals,
attorneys, or other clients utilizing statistical practitioners. With respect to responsibilities
in publications and testimony, the Committee stated the following:
(6) Account for all data considered in a study and explain sample(s) actually used.
(
7
) Report the sources and assessed adequacy of the data.
(8) Clearly and fully report the steps taken to guard validity.
(
9
) Where appropriate, address potential confounding variables not included in the
study. (ASA, 1999, p. 4)
Although the ASA Committee on Professional Ethics did not directly refer to these
concepts, it would appear that these recommendations are related to internal and external
validity.
At the same time, the ASA Committee was presenting its guidelines, the American
Psychological Association (APA) Board of Scientific Affairs, who convened a committee
called the Task Force on Statistical Inference, was providing recommendations for the use
of statistical methods (Wilkinson & the Task Force on Statistical Inference, 1999). Useful
Framework for Internal and External Validity 4
recommendations were furnished by the Task Force in the areas of design, population,
sample, assignment (i.e., random assignment and nonrandom assignment), measurement
(i.e., variables, instruments, procedure, and power and sample size), results
(complications), analysis (i.e., choosing a minimally sufficient analysis, computer programs,
assumptions, hypothesis tests, effect sizes, interval estimates, multiplicities, causality,
tables and figures), and discussion (i.e., interpretation and conclusions).
Although the APA Task Force stated that “This report is concerned with the use of
statistical methods only and is not meant as an assessment of research methods in
general” (Wilkinson & the Task Force on Statistical Inference, 1999, p. 2), it is somewhat
surprising that internal and external validity was mentioned directly only once. Specifically,
when discussing the reporting of instruments, the task force declared:
There are many methods for constructing instruments and psychometrically
validating scores from such measures. Traditional true-score theory and item-
response test theory provide appropriate frameworks for assessing reliability and
internal validity. Signal detection theory and various coefficients of association can
be used to assess external validity. [emphasis added] (p. 5)
The APA Task Force also stated (a) “In the absence of randomization, we should do our
best to investigate sensitivity to various untestable assumptions” (p. 4); (b) “Describe any
anticipated sources of attrition due to noncompliance, dropout, death, or other factors” (p.
6); (c) “Describe the specific methods used to deal with experimenter bias, especially if you
collected the data yourself’ (p. 4); (d) “When you interpret effects, think of credibility,
generalizability, and robustness” (p. 16).; (e) “Are the design and analytic methods robust
.1 5
Framework for Internal and External Validity 5
enough to support strong conclusions?” (p. 16); and (f) “Remember, however, that
acknowledging limitations is for the purpose of qualifying results and avoiding pitfalls in
future research” (p. 16). It could be argued that these six statements pertain to validity.
However, the fact that internal and external validity was not directly mentioned by the ASA
Committee on Professional Ethics, as well as the fact that these concepts were mentioned
only once by the APA Task Force and were not directly referenced in the “Discussion”
section of the report, is a cause for concern, bearing in mind that the issue of internal and
external validity not only is regarded by instructors of research methodology, statistics, and
measurement as being the most important in their fields, but that it also receives the most
extensive coverage in their classes (Mundfrom, Shaw, Thomas, Young, & Moore, 1998).
In experimental research, the researcher manipulates at least one independent
variable (i.e., the hypothesized cause), attempts to control potentially extraneous (i.e.,
confounding) variables, and then measures the effect(s) on one or more dependent
variables. According to quantitative research methodologists, experimental research is the
only type of research in which hypotheses concerning cause-and-effect relationships can
be validly tested. As such, proponents of experimental research believe that this design
represents the apex of research. An experiment is deemed to be valid, inasmuch as valid
cause-effect relationships are established, if results obtained are due only to the
manipulated independent variable (i.e., possess internal validity) and are generalizable to
groups, environments, and contexts outside of the experimental settings (i.e., possess
external validity). Consequently, according to this conceptualization, all experimental
studies should be assessed for internal and external validity.
6
Framework for Internal and External Validity 6
A d
efinition of internal validity and external validity can be found in any standard
research methodology textbook. For example, Gay and Airasian (2000, p.
34
5) describe
internal validity as “the condition that observed differences on the dependent variable are
a direct result of the independent variable, not some other variable.” As such, internal
validity is threatened when plausible rival explanations cannot be eliminated. Johnson and
Christensen (2000, p. 200) define external validity as “the extent to which the results of a
study can be generalized to and across populations, settings, and times.” Even if a
particular finding has high internal validity, this does not mean that it can be generalized
outside the study context.
Undoubtedly the seminal works of Donald Campbell and Julian Stanley (Campbell,
1957; Campbell & Stanley, 1963) provide the most authoritative source regarding threats
to internal and external validity. Campbell and Stanley identified the following eight threats
to internal validity: history, maturation, testing, instrumentation, statistical regression,
differential selection of participants, mortality, and interaction effects (e.g., selection-
maturation interaction) (Gay & Airasian, 2000). Additionally, building on the work of
Campbell and Stanley, Smith and Glass (1987) classified threats to external validity into
the following three areas: population validity (i.e., selection-treatment interaction),
ecological validity (i.e., experimenter effects, multiple-treatment interference, reactive
arrangements, time and treatment interaction, history and treatment interaction), and
external validity of operations (i.e., specificity of variables, pretest sensitization).
Although experimental research designs are utilized frequently in the physical
sciences, this type of design is not as commonly used in social science research in general
7
Framework for Internal and External Validity 7
and educational research in particular due to the focus on the social world as opposed to
the physical world. Nevertheless, since Campbell and Stanley’s conceptualization, some
researchers (e.g., Huck & Sandler, 1979; McMillan, 2000) have argued that threats to
internal and external validity not only should be evaluated for experimental designs, but are
also pertinent for other types of quantitative research (e.g., descriptive, correlational,
causal-comparative, quasi-experimental). Unfortunately, with respect to non-experimental
quantitative research designs, it appears that the above sources of internal and external
validity do not represent the realm of pertinent threats to the validity of studies.
Thus, the purpose of the present paper is to provide a rationale for assessing
threats to internal validity and external validity in all quantitative research studies,
regardless of the research design. After providing this rationale, the discussion will focus
on providing additional sources of internal and external validity. In particular, a more
comprehensive framework of dimensions and sub-dimensions of internal and external
validity will be presented than has been undertaken previously. Brief heuristic examples
will be given for each of these new dimensions and sub-dimensions. Finally, different ways
of expanding the discussion about threats to internal and external validity will be presented.
UTILITY OF DELINEATING THREATS TO INTERNAL AND EXTERNAL VALIDITY
Despite the recommendations of the ASA Committee on Professional Ethics (ASA,
1999) and the APA Task Force (Wilkinson & the Task Force on Statistical Inference, 1999),
a paucity of researchers provide a commentary of threats to internal and external validity
in the discussion section of their articles. Onwuegbuzie (2000a) reviewed the prevalence
of discussion of threats to internal and external validity in empirical research reports
8
Framework for Internal and External Validity 8
published in several reputable journals over the last few years, including the American
Educational Research Journal (AERJ)–a flagship journal. With respect to the AERJ,
Onwuegbuzie found that although 5 (31.3%) of the 16 quantitative-based research articles
published in 1998 contained a general statement in the discussion section that the findings
had limited generalizability, only 1 study utilized the term “external validity.” The picture
regarding internal validity was even more disturbing, with none of the 16 articles published
that year containing a discussion of any threats to internal validity. Moreover, in almost all
of these investigations, implications of the findings were discussed as if no rival hypotheses
existed. In many instances, this may give the impression that confirmation bias took place,
in which theory confirmation was utilized instead of theory testing (Greenwald, Pratkanis,
Leippe, & Baumgardner, 1986).
As stated by Onwuegbuzie (2000a), authors’ general failure to discuss threats to
validity likely stems from a fear that to do so would expose any weaknesses in their
research, which, in turn, might lead to their manuscripts being rejected by journal
reviewers. Yet, it is clear that every single study in the field of education has threats to
internal and external validity. For example, instrumentation can never be fully eliminated
as a potential threat to internal validity because outcome measures can never yield scores
that are perfectly reliable or valid. Thus, whether or not instrumentation is acknowledged
in a research report, does not prevent it from being a validity threat. With respect to
external validity, all samples, whether random or non-random are subject to sampling error.
Thus, population and ecological validity is a threat to external validity in virtually all
educational studies.
9
Framework for Internal and External Validity 9
The fact that the majority of empirical investigations do not contain a discussion of
threats to internal and external validity also probably stems from a misperception on the
part of some researchers that such threats are only relevant in experimental studies. For
other researchers, failure to mention sources of invalidity may arise from an
uncompromising positivistic stance. As noted by Onwuegbuzie (2000b), pure positivists
contend that statistical techniques are objective; however, they overlook many subjective
decisions that are made throughout the research process (e.g., using a 5% level of
significance). Further, the lack of random sampling prevalent in educational research,
which limits generalizability, as well as the fact that variables can explain as little as 2% of
the variance of an outcome measure to be considered non-trivial, make it clear that all
empirical research in the field of education are subject to considerable error. This should
prevent researchers from being as adamant about the existence of positivism in the social
sciences as in the physical sciences (Onwuegbuzie, 2000b).
Moreover, discussing threats to internal and external validity has at least three
advantages. First and foremost, providing information about sources of invalidity allows the
reader to place the researchers’ findings in their proper context. Indeed, failure to discuss
the limitations of a study may provide the reader with the false impression that no external
replications are needed. Yet, replications are the essence of research (Onwuegbuzie &
Daniel, 2000; Thompson, 1994a). Second, identifying threats to internal and external
validity helps to provide directions for future research. That is, replication studies can be
designed to minimize one or more of these validity threats identified by the researcher(s).
Third, once discussion of internal and external validity becomes commonplace in
10
Framework for Internal and External Validity 10
research reports, validity meta analyses could be conducted to determine the most
prevalent threats to internal and external validity for a given research hypothesis. These
validity meta analyses would provide an effective supplement to traditional meta analyses.
In fact, validity meta analyses could lead to thematic effect sizes being computed for the
percentage of occasions in which a particular threat to internal or external validity is
identified in replication studies (Onwuegbuzie, 2000c). For example, a narrative that
combines traditional meta analyses and validity meta analyses could take the following
form:
Across studies, students who received Intervention A performed on standardized
achievement tests, on average, nearly two-thirds of a standard deviation (Cohen’s
(1988) Mean d = .65) higher than did those who received Intervention B. This
represents a moderate-to-large effect. However, these findings are tempered by the
fact that in these investigations, several threats to internal validity were noted.
Specifically, across these studies, statistical regression was the most frequently
identified threat to internal validity (prevalence rate/effect size =
33
%), followed by
mortality (effect size = 22%). With respect to external validity, population validity
was the most frequently cited threat (effect size =
42
%), followed by reactive
arrangements (effect size = 15%)….
Such validity meta analyses would help to promote the use of external replications and to
minimize the view held by some researchers that a single carefully-designed study could
serve as a panacea for solving educational problems (Onwuegbuzie, 2000c).
FRAMEWORK FOR IDENTIFYING THREATS TO INTERNAL AND EXTERNAL VALIDITY
Framework for Internal and External Validity 11
As noted by McMillan (2000), threats to internal and external validity typically are
presented with respect to experimental research designs. Consequently, most authors of
research methodology textbooks tend to present only the original categories of validity
threats conceptualized by Campbell and Stanley (1963). Unfortunately, this framework
does not represent the range of validity threats. Thus, it is clear that in order to promote
the discussion of threats to internal and external validity in all empirical reports, regardless
of research design used, Campbell and Stanley’s framework needs to be expanded.
Without such an expansion, for example, many threats to internal validity outside these
threats will continue to be labeled as history.
Surprisingly, despite Huck and Sandler’s (1979) recommendation that researchers
extend the classic list of seven threats to internal validity identified by Campbell and
Stanley, an extensive review of the literature revealed only two articles representing a
notable attempt to expand Campbell and Stanley’s framework. Specifically, Huck and
Sandler (1979) presented 20 categories of threats to validity, which they termed rival
hypotheses. Unfortunately, using this label gives the impression that threats to internal and
external validity are only pertinent for empirical studies in which hypotheses are tested.
Yet, these threats also are pertinent when descriptive research designs are utilized. For
example, in research in which no inferences are made (e.g., descriptive survey research),
instrumentation typically is a threat to internal validity inasmuch as if the survey instrument
does not lead to valid responses, then descriptive statistics that arise from the survey
responses, however simple, will be invalid.
Thus, Huck and Sandler’s (1979) list although extremely useful, falls short of providing a
1 0
Framework for Internal and External Validity 12
framework that is applicable for all empirical research.
More recently, McMillan (2000) presented a list of
54
threats to validity. Moreover,
McMillan re-named internal validity as internal credibility, which he defined as “rival
explanations to the propositions made on the basis of the data” (p. 2). McMillan further
subdivided his threats to internal credibility into three categories, which he labeled (a)
statistical conclusion, (b), relationship conclusion, and (c) causal conclusion. According to
this theorist, statistical conclusion threats are threats that are statistically based (e.g., small
effect size); relationship conclusion threats are mostly related to correlational and quasi-
experimental research designs; and causal conclusion mostly pertain to experimental
research designs. McMillan (2000) also renamed external validity as generalizability in an
attempt to provide a more “conceptually clear and straightforward” definition (p. 3). He
divided threats that fall into these categories as population validity and ecological validity.
In short, McMillan produced a 2 (experimental vs. nonexperimental) x 3 (statistical
conclusion, relationship conclusion, causal conclusion) matrix for internal credibility, and
a 2 (experimental vs. nonexperimental) x 2 (population validity vs. ecological validity) matrix
for generalizability. Perhaps the most useful aspect of McMillan’s re-conceptualization of
Campbell and Stanley’s threats to internal and external validity is the fact that threats were
categorized as falling into either an experimental or non-experimental design. However,
as is the case for Huck and Sandler’s (1979) conceptualization, McMillan’s two matrices
is still not as integrative with respect to quantitative research designs as perhaps it could
be.
Thus, what follows is a re-conceptualization of Campbell and Stanley’s (1963)
1 3
Framework for Internal and External Validity 13
threats to internal and external validity, which further builds on the work of Huck and
Sandler (1979) and McMillan (2000). Interestingly, threats to internal validity and external
validity can be renamed as threats to internal replication and external replication,
respectively. An internal replication threat represents the extent to which the results of a
study would re-occur if the study was replicated using exactly the same sample, setting,
context, and time. If the independent variable truly was responsible for changes in the
dependent variable, with no plausible rival hypotheses, then conducting an internal
replication of the study would yield exactly the same results. On the other hand, an
external replication threat refers to the degree that the findings of a study would replicate
across different populations of persons, settings, contexts, and times. If the sample was
truly generalizable, then external replications across different samples would produce the
same findings. However, rather than labeling these threats internal replication and external
replication, as was undertaken by Huck and Sandler (i.e., rival hypotheses) and McMillan
(i.e., internal credibility and generalizability), for the purposes of the present re-
conceptualization, the terms internal validity and external validity were retained. It was
believed that keeping the original labels would reduce the chances of confusion especially
among graduate students and, at the same time, increase the opportunity that this latest
framework will be diffused (Rogers, 1995). Further, the reader will notice that rather than
use the term experimental group, which connotes experimental designs, the term
intervention group has been used, which more accurately reflects the school context,
whereby interventions typically are implemented in a non-randomized manner.
Threats to internal and external validity can be viewed as occurring at one or more
14
Framework for Internal and External Validity 14
of the three major stages of the inquiry process, namely: research design/data collection,
data analysis, and data interpretation. Unlike the case for qualitative research, in
quantitative research, these stages typically represent three distinct time points in the
research process. Figure 1 presents a concept map of the major dimensions of threats to
internal and external validity at the three major stages of the research process. What
follows is a brief discussion of each of the threats to validity dimensions and their
subdimensions.
Insert Figure 1 about here
Research Design/Data
Collection
Threats to Internal Validity
As illustrated in Figure 1, the following 22 threats to internal validity occur at the
research design/data collection stage. These threats include Campbell and Stanley’s
(1963) 8 threats to internal validity, plus an additional 14 threats.
History. This threat to internal validity refers to the occurrence of events or
conditions that are unrelated to the treatment but that occur at some point during the study
to produce changes in the outcome measure. The longer an inquiry lasts, the more likely
that history will pose a threat to validity. History can stem from either internal or extraneous
events. With respect to the latter, suppose that counselors and teachers in a high school
conducted a series of workshops for all students that promoted multiculturism and diversity.
However, suppose that between the time that the series of workshops ended and the post-
15
Framework for Internal and External Validity 15
intervention outcome measure (e.g., attitudes toward ethnic integration) was administered,
a racial incident took place in a nearby school that received widespread media coverage.
Such an occurrence could easily reduce the effectiveness of the workshop and thus
threaten internal validity by providing rival explanations of subsequent
findings.
Maturation. Maturation pertains to the processes that operate within a study
participant due, at least in part, to the passage of time. These processes lead to physical,
mental, emotional, and intellectual changes such as aging, boredom, fatigue, motivation,
and learning, that can be incorrectly attributed to the independent variable. Maturation is
particularly a concern for younger study participants, such as Kindergartners.
Testing. Testing, also known as pretesting and pretest sensitization, also refers to
changes that may occur in participants’ scores obtained on the second administration or
post-intervention measure as a result of having taken the pre-intervention instrument. In
other words, being administered a pre-intervention instrument may improve scores on the
post-intervention measure regardless of whether any intervention takes place in between.
Testing is more likely to prevail when (a) cognitive measures are utilized that involve the
recall of factual information and (b) the time between administration is short. When
cognitive tests are administered, a pre-intervention measure may lead to increased scores
on the post-intervention measure because the participants are more familiar with the
testing format and condition, have developed a strategy for increasing performance, are
less anxious about the test on the second occasion, or can remember some of their prior
responses and thus make subsequent adjustments. With attitudes and measures of
personality and other affective variables, being administered a pre-intervention measure
Framework for Internal and External Validity 16
may induce participants subsequently to reflect about the questions and issues raised
during the pre-intervention administration and to supply similar or different responses to
the post-intervention measure as a result of this reflection.
Instrumentation. The Instrumentation threat to internal validity occurs when scores
yielded from a measure lacks the appropriate level of consistency (i.e., low reliability) or
does not generate valid scores, as a result of inadequate content-, criterion-, and/or
construct-related validity. Instrumentation can occur in many ways, including when (a) the
post-intervention measure is not parallel (e.g., different level of difficulty) to the pre-
intervention measure (i.e., the test has low equivalent-forms reliability); (b) the pre-
intervention instrument leads to unstable scores regardless of whether or not an
intervention takes place (i.e., has low test-retest reliability); (c) at least one of the measures
utilized does not generate reliable scores (i.e., low internal-consistency reliability; and (d)
the data are collected through observation, and the observing or scoring is not consistent
from one situation to the next within an observer (i.e., low intra-rater reliability) or is not
consistent among two or more data collectors/analysts (i.e., low inte
r-
rater reliability).
Statistical regression. Statistical regression typically occurs when participants are
selected on the basis of their extremely low or extremely high scores on some pre-
intervention measure. This phenomenon refers to the tendency for extreme scores to
regress, or move toward, the mean on subsequent measures. Interestingly, many
educational researchers study special groups of individuals such as at-risk children with
learning difficulties or disabilities. These special populations usually have been identified
because of their extreme scores on some outcome measure. A researcher often cannot
1/4, 1 7
Framework for Internal and External Validity 17
be certain whether any post-intervention differences observed for these individuals are real
or whether they represent statistical artifacts. According to Campbell and Kenny (1999),
regression toward the mean is an artifact that can be due to extreme group selection,
matching, statistical equating, change scores, time-series studies, and longitudinal studies.
Thus, statistical regression is a common threat to internal validity in educational research.
Differential selection of participants. Differential selection of participants, also known
as selection bias, refers to substantive differences between two or more of the comparison
groups prior to the implementation of the intervention. This threat to internal validity, which
clearly becomes realized at the data collection stage, most often occurs when already-
formed (i.e., non-randomized) groups are compared. Group differences may occur with
respect to cognitive, affective, personality, or demographic variables. Unfortunately, it is
more difficult to conduct controlled, randomized studies in natural educational settings, thus
differential selection of participants is a common threat to internal validity. Thus,
investigators always should strive to assess the equivalency of groups by comparing
groups with respect to as many variables as possible. Indeed, such equivalency checks
should be undertaken even when randomization takes place, because although
randomization increases the chances of group equivalency on important variables, it does
not guarantee this equality. That is, regardless of the research design, when groups are
compared, selection bias always exists to some degree. The greater this bias, the greater
the threat to internal validity.
Mortality. Mortality, also known as attrition, refers to the situation in which
participants who have been selected to participate in a research study either fail to take
Framework for Internal and External Validity 18
part at all or do not participate in every phase of the investigation (i.e., drop out of the
study). However, a loss of participants, per se, does not necessarily produce a bias. This
bias occurs when participant attrition leads to differences between the groups that cannot
be attributed to the intervention. Mortality-induced discrepancy among groups often
eventuates when there is a differential loss of participants from the various treatment
conditions, such that an inequity develops or is exacerbated on variables other than the
independent variable. Mortality often is a threat to internal validity when studying at-risk
students who tend to have lower levels of persistence, when volunteers are utilized in an
inquiry, or when a researcher is comparing a new intervention to an existing method.
Because dropping out of a study is often the result of relatively low levels of
motivation, persistence, and the like, a greater loss in attrition in the control group may
attenuate any true differences between the control and intervention groups due to the fact
that the control group members who remain are closer to the intervention group members
with respect to these affective variables. Conversely, when a greater attrition rate occurs
in the intervention group, group differences measured at the end of the study period may
be artificially inflated because individuals who remain in the inquiry represent more
motivated or persistent members. Both these scenarios provide rival explanations of
observed findings. In any case, the researcher should never assume that mortality occurs
in a random manner and should, whenever possible, (a) design a study that minimizes the
chances of attrition; (b) compare individuals who withdraw from the investigation to those
who remain, with respect to as many available cognitive, affective, personality, and
demographic variables as possible; and (c) attempt to determine the precise reason for
a_9
Framework for Internal and External Validity 19
withdrawal for each person.
Selection interaction effects. Many of the threats to internal validity presented above
also can interact with the differential selection of participants to produce an effect that
resembles the intervention effect. For example, a selection by mortality threat can occur
if one group has a higher rate of attrition than do the other groups, such that discrepancies
between groups create factors unrelated to the intervention that are greater as a result of
differential attrition than prior to the start of the investigation. Similarly, a selection by
history threat would occur if individuals in the groups experience different history events,
and that these events differentially affected their responses to the intervention. A selection
by maturation interaction would occur when one group has a higher rate of maturation than
do the other groups (even if no pretest differences prevailed), and that this higher rate
accounts for at least a portion of the observed effect. This type of interaction is common
when volunteers are compared to non-volunteers.
Implementation bias. Although not a threat recognized by Campbell and Stanley
(1963), McMillan (2000), nor Huck and Sandler (1979), implementation bias is a common
and serious threat to internal validity in many educational intervention studies. Indeed, it
is likely that implementation bias is the most frequent and pervasive threat to internal
validity at the data collection stage in intervention studies. Implementation bias often stems
from differential selection of teachers who apply the innovation to the intervention groups.
In particular, as the number of instructors involved in an instructional innovation increases,
so does the likelihood that at least some of the teachers will not implement the initiative to
its fullest extent. Such lack of adherence to protocol on the part of some teachers might
20
Framework for Internal and External Validity 20
stem from lack of motivation, time, training, or resources; inadequate knowledge or ability;
poor self-efficacy; implementation anxiety; stubbornness; or poor attitudes. Whatever the
source, implementation bias leads to the protocol designed for the intervention not being
followed in the intended manner (i.e., protocol bias). For example, poor attitudes of some
of the teachers toward an innovation may lead to intervention protocol being violated,
which then transgresses to their students, resulting in effect sizes being attenuated. A
particularly common component of the implementation threat that prevails is related to
time. Many studies involve the assessment of an innovation after one year or even less,
which often is an insufficient time frame to observe positive gains. Differences in teaching
experience between teachers participating in the intervention and non-intervention groups
is another way in which implementation bias may pose a threat to internal validity.
Sample augmentation bias. Sample augmentation bias is another threat to internal
validity that does not appear to have been mentioned formally in the literature. This form
of bias, which essentially is the opposite of mortality, prevails when one or more individuals
join the intervention or non-intervention groups. In the school context, this typically
happens when students (a) move away from a school that is involved in the study, (b)
move to a school involved in the research from a school that was not involved in the
investigation, or (c) move from an intervention school to a non-intervention school. In each
of these cases, not all students receive the intervention for the complete duration of the
study. Thus, sample augmentation bias can either increase or attenuate the effect size.
Behavior bias. Also, not presented in the literature, is behavior bias that occurs
when an individual has a strong personal bias in favor of or against the intervention prior
21
Framework for Internal and External Validity 21
to the beginning of the study. Such a bias would lead to a protocol bias that threatens
internal validity. Behavior bias is most often a threat when participants are exposed to all
levels of a treatment.
Order bias. When multiple interventions are being compared in a research study,
such that all participants are exposed to and measured under each and every intervention
condition, an order effect can provide a threat to internal validity when the effect of the
order of the intervention conditions cannot be separated from the effect of the intervention
conditions. For example, any observed differences between the intervention conditions
may actually be the result of a practice effect or a fatigue effect. Further, individuals may
succumb to the primacy or recency effect. Thus, in these types of studies, researchers
should vary the order in which the intervention was presented, preferably in a random
manner (i.e., counterbalancing).
Observational bias. Observational bias occurs when the data collectors have
obtained an insufficient sampling of the behavior(s) of interest. This lack of adequate
sampling of behaviors happens if either persistent observation or prolonged engagement
does not occur (Lincoln & Guba, 1985).
Researcher bias. Researcher bias may occur during the data collection stage when
the researcher has a personal bias in favor of one technique over another. This bias may
be subconsciously transferred to the participants in such a way that their behavior is
affected. In addition to influencing the behavior of participants, the researcher could affect
study procedures or even contaminate data collection techniques. Researcher bias
particularly is a threat to internal validity when the researcher also serves as the person
Framework for Internal and External Validity 22
implementing the intervention. For example, if a teacher-researcher investigating the
effectiveness of a new instructional technique that he or she has developed and believes
to be superior to existing strategies, he or she may unintentionally influence the outcome
of the investigation.
Researcher bias can be either active or passive. Passive sources include
personality traits or attributes of the researcher (e.g., gender, ethnicity, age, type of
clothing worn), whereas active sources may include mannerisms and statements made by
the researcher that provide an indication of the researcher’s preferences. Another form of
researcher bias is when the researcher’s prior knowledge of the participants differentially
affects the participants’ behavior. In any case, the optimal approach to minimize this threat
to internal validity is to let other trained individuals, rather than the researcher, work directly
with the study participants, and perhaps even collect the data.
Matching bias. A researcher may use matching techniques to select a series of
groups of individuals (e.g., pairs) who are similar with respect to one or more
characteristics, and then assign each person within each group to one of the treatment
conditions. Alternatively, once participants have been selected for one of the treatment
conditions, a researcher may find matches for each member of this condition and assign
these matched individuals to the other treatment group(s). Unfortunately, this poses a
threat to internal validity in much the same way as does the mortality threat. Specifically,
because those individuals from the sampling frame for whom a match cannot be found are
excluded from the study, any difference between those selected and those excluded may
lead to a statistical artifact. Indeed, even though matching eliminates the possibility that the
. 4.,
Framework for Internal and External Validity 23
independent variable will be confounded with group differences on the matching
variable(s), a possibility remains, however, that one or more of the variables not used to
match the groups may be more related to the observed findings than is the independent
variable.
Treatment replication error. Treatment replication error occurs when researchers
collect data that do not reflect the correct unit of analysis. The most common form of
treatment replication error is when an intervention is administered once to each group of
participants or to a few classes or other existing groups, yet only individual outcome data
are collected (McMillan, 1999). As eloquently noted by McMillan (1999), such practice
seriously violates the assumption that each replication of the intervention for each and
every participant is independent of the replications of the intervention for all other
participants. If there is one administration of the intervention to a group, whatever
peculiarities that prevail as a result of that administration are confounded with the
intervention. Moreover, systematic error likely ensues (McMillan, 1999). Further, individuals
within a group likely influence one another in a group context when being measured by the
outcome measure(s). Such confounding provides rival explanations to any subsequent
observed finding, thereby threatening internal validity at the data collection stage. This
confounding is even more severe when the intervention is administered to groups over a
long period of time because the number of confounding variables increases as a function
of time (McMillan, 1999). Both McMillan (1999) and Onwuegbuzie and Collins (2000) noted
that the majority of the research in the area of cooperative learning is flawed because of
this treatment replication error.
24
Framework for Internal and External Validity 24
Disturbingly, treatment replication errors also occur in the presence of randomization
of participants to groups, specifically when the intervention is assigned to and undertaken
in groups–that is, when each participant does not respond independently from other
participants (McMillan, 1999). Thus, researchers should collect data at the group level for
subsequent analysis when there is a limited number of interventions independently
replicated.
Evaluation anxiety. There is little doubt that in the field of education, achievement
is the most common outcome measure. Unfortunately, evaluation anxiety, which is
experienced by many students, has the potential to threaten internal validity by introducing
systematic error into the measurement. This threat to internal validity stemming from
evaluation anxiety occurs at all levels of the educational process. For example,
Onwuegbuzie and Seaman (1995) found that graduate students with high levels of
statistics test anxiety who were randomly assigned to a statistics examination that was
administered under timed conditions tended to have lower levels of performance than did
their high-anxious counterparts who were administered the same test under untimed
conditions. These researchers concluded that when timed examinations are administered,
the subsequent results may be more reflective of anxiety level than of actual ability or
learning that has taken place as the result of the intervention. Similarly, at the elementary
and secondary school level, Hill and Wigfield (1984) have suggested that examination
scores of students with high levels of test anxiety, obtained under timed examination
conditions, may represent an invalid lower-bound estimate of their actual ability or aptitude.
Thus, researchers should be cognizant of the potential confounding role of the testing
Framework for Internal and External Validity 25
environment at the research design/data
collection stage.
Multiple-treatment interference. Multiple-treatment interference occurs when the
same research participants are exposed to more than one intervention. Multiple-treatment
interference exclusively (e.g., Campbell & Stanley, 1963) has been conceptualized as a
threat to external validity. However, this interference also threatens internal validity.
Specifically, when individuals receive multiple interventions, carryover effects from an
earlier intervention may make it difficult to assess the effectiveness of a later treatment,
thereby providing rival explanations of the findings. Thus, a sufficient washout period is
needed for the effects of the previous intervention to dissipate, if this is possible. Typically,
the less time that elapses between the administration of the interventions, the greater the
threat to internal validity. Therefore, when designing studies in which participants receive
multiple interventions, researchers should seek to maximize the washout period, as well
as to counterbalance the administration of the interventions.
Reactive arrangements. Reactive arrangements, also known as reactivity or
participant effects, refer to a number of facets related to the way in which a study is
undertaken and the reactions of the participants involved. In other words, reactive
arrangements pertain to changes in individuals’ responses that can occur as a direct result
of being aware that one is participating in a research investigation. For example, the mere
presence of observers or equipment during a study may alter the typical responses of
students that rival explanations for the findings prevail, which, in turn, threaten internal
validity. In virtually all research methodology textbooks, reactive arrangements is labeled
solely as a threat to external validity. Yet, reactive arrangements also provide a threat to
26
Framework for Internal and External Validity 26
internal validity by confounding the findings and providing rival explanations.
Reactive arrangements comprise the following five major components: (a) the
Hawthorne effect, (b) the John Henry effect, (c) resentful demoralization, (d) the novelty
effect, and (e) the placebo effect. The Hawthorne effect represents the situation when
individuals interpret their receiving an intervention as being given special attention. As
such, the participants’ reaction to their perceived special treatment is confounded with the
effects of the intervention. The Hawthorne effect tends to increase the effect size because
individuals who perceive they are receiving preferential treatment are more likely to
participate actively in the intervention condition.
The John Henry effect, or compensatory rivalry, occurs when on being informed that
they will be in the control group, individuals selected for this condition decide to compete
with the new innovation by expending extra effort during the investigation period. Thus, the
John Henry effect tends to reduce the effect size by artificially increasing the performance
of the control group. Resentful demoralization is similar to the John Henry effect inasmuch
as it involves the reaction of the control group members. However, instead of knowledge
of being in the control group increasing their performance levels, they become resentful
about not receiving the intervention, interpret this as a sign of being ignored or
disregarded, and become demoralized. This loss of morale consequently leads to a
reduction in effort expended and subsequent decrements in performance or other
outcomes. Thus, resentful demoralization tends to increase the effect size.
The novelty effect, refers to increased motivation, interest, or participation on the
part of study participants merely because they are undertaking a different or novel task.
2
Framework for Internal and External Validity 27
The novelty effect is a threat to internal validity because it competes with the effects of the
intervention as an explanation to observed findings. Unlike the Hawthorne, John Henry,
and resentful demoralization effects, in which the direction of the effect size can be
predicted, the novelty effect can either increase or decrease the effect size. For example,
a novel intervention may increase interest levels and, consequently, motivation and
participation levels, which, in turn, may be accompanied by increases in levels of
performance. This sequence of events would tend to increase the effect size pertaining to
the intervention effect. On the other hand, if a novel stimuli is introduced into the
environment that is not part of the intervention but is used to collect data (e.g., a video
camera), then participants can become distracted, thereby reducing their performance
levels. This latter example would reduce the effect size. Encouragingly, the novelty effect
often can be minimized by conducting the study for a period of time sufficient to allow the
novelty of the intervention to subside.
Finally, the placebo effect, a term borrowed from the medical field, represents a
psychological effect, in which individuals in the control group attain more favorable
outcomes (e.g., more positive attitudes, higher performance levels) simply because they
believed that they were in the intervention group. This phenomenon not only has the effect
of reducing the effect size but negating it, and, thus, seriously affects internal validity.
Treatment diffusion. Treatment diffusion, also known as the seepage effect, occurs
when different intervention groups communicate with each other, such that some of the
treatment seeps out into another intervention group. Interest in each other’s treatments
may lead to groups borrowing aspects from each other so that the study no longer has two
28
Framework for Internal and External Validity 28
or more distinctly different interventions, but overlapping interventions. In other words, the
interventions are no longer independent among groups, and the integrity of each treatment
is diffused. Treatment diffusion is quite common in the school setting where siblings may
be in different classes and, consequently, in different intervention groups. Typically, it is
the more desirable intervention that seeps out, or is diffused, into the other conditions. In
this case, treatment diffusion leads to a protocol bias for the control groups. Thus,
treatment diffusion has a tendency to reduce the effect size. However, treatment diffusion
can be minimized by having strict intervention protocols and then monitoring the
implementation of the interventions.
Time x treatment interaction. A time by treatment interaction occurs if individuals in
one group are exposed to an intervention for a longer period of time than are individuals
receiving another intervention in such a way that this differentially affects group members’
responses to the intervention. Alternatively, although participants in different groups may
receive their respective intervention for the same period of time, a threat to validity may
prevail if one of these interventions needs a longer period of time for any positive effects
to be realized. For example, suppose that a researcher wanted to compare the academic
performance of students experiencing a 4×4-block scheduling model, in which students
take four subjects for 90 minutes per day for the duration of a semester, to a block-8
scheduling model, in which students take the first four subjects for two days, the other four
subjects for another two days, and all eight subjects on the fifth day of the week. Thus,
students in the 4×4-block scheduling model are exposed to four subjects per semester
for a total of eight subjects per year, whereas students in the block-8 scheduling model are
29
Framework for Internal and External Validity 29
taught eight subjects per semester. If the researcher was to compare the academic
performance after six months, although students in both groups would have experienced
the interventions for the same period of time, a time by treatment interaction threat to
internal validity likely would prevail inasmuch as students in the 4×4-block scheduling
would have received more exposure to four subjects but less exposure to the other four
subjects.
Another way in which time by treatment interaction can affect internal validity
pertains to the amount of time that elapses between administration of the pretest and
posttest. Specifically, an intervention effect based on the administration of a posttest
immediately following the end of the intervention phase may not yield the same effect if a
delayed posttest is administered some time after the end of the intervention phase.
History x treatment interaction. a history by treatment interaction occurs if the
interventions being compared experience different history events, and that these events
differentially affect group members’ responses to the intervention. For example, suppose
a new intervention is being compared to an existing one. However, if during the course of
the study, another innovation is introduced to the school(s) receiving the new intervention,
it would be impossible to separate the effects of the new intervention from the effects of
the subsequent innovation. Unfortunately, it is common for schools to be exposed to
additional interventions while one intervention is taking place. The difference between this
particular component of history by treatment interaction threat to internal validity and the
multiple-interference threat is that, whereas the researcher has no control over the former,
the latter (i.e., multiple-treatment interference threat) is a function of the research design.
30
Framework for Internal and External Validity 30
Threats to External Validity
The following 12 threats to external validity occur at the research design/data
collection stage.
Population validity. Population validity refers to the extent to which findings are
generalizable from the sample of individuals on which a study was conducted to the larger
target population of individuals, as well as across different subpopulations within the larger
target population. Utilizing large and random samples tend to increase the population
validity of results. Unfortunately, population validity is a threat in virtually all educational
studies because (a) all members of the target population rarely are available for selection
in a study, and (b) random samples are difficult to obtain due to practical considerations
such as time, money, resources, and logistics. With respect to the first consideration, most
researchers are forced to select a sample from the accessible population representing the
group of participants who are available for participation in the inquiry. Unfortunately, it
cannot be assumed that the accessible population is representative of the target
population. The degree of representativeness depends on how large the accessible
population is relative to the target population. With respect to the second consideration,
even if a random sample is taken, this does not guarantee that the sample will be
representativeness of either the accessible or the target population. As such population
validity is a threat in nearly all studies, necessitating external replications, regardless of the
level of internal validity attained in a particular study.
Ecological validity. Ecological validity refers to the extent to which findings from a
study can be generalized across settings, conditions, variables, and contexts. For example,
Framework for Internal and External Validity 31
if findings can be generalized from one school to another, from one school district to
another school district, or from one state to another, then the study possesses ecological
validity. As such, ecological validity represents the extent to which findings from a study
are independent of the setting or location in which the investigation took place. Because
schools and school districts often differ substantially with respect to variables such as
ethnicity, socioeconomic status, and academic achievement, ecological validity is a threat
in most studies.
Temporal validity. Temporal validity refers to the extent to which research findings
can be generalized across time. In other words, temporal validity pertains to the extent that
results are invariant across time. Although temporal validity is rarely discussed as a threat
to external validity by educational researchers, it is a common threat in the educational
context because most studies are conducted at one period of time (e.g., cross-sectional
studies). Thus, failure to consider the role of time at the research design/data collection
stage can threaten the external validity of the study.
Multiple-treatment interference. As noted above, multiple-treatment interference
occurs when the same research participants are exposed to more than one intervention.
Multiple treatment interference also may occur when individuals who have already
participated in a study are selected for inclusion in another, seemingly unrelated, study.
It is a threat to external validity inasmuch as it is a sequencing effect that reduces a
researcher’s ability to generalize findings to the accessible or target population because
generalization typically is limited to the particular sequence of interventions that was
administered.
32
Framework for Internal and External Validity 32
Researcher bias. Researcher bias, also known as experimenter effect, has been
defined above in the threats to internal validity section. The reason why researcher bias
also poses a threat to external validity is because the findings may be dependent, in part,
on the characteristics and values of the researcher; The more unique the researcher’s
characteristic and values that interfere with the data collected, the less generalizable the
findings.
Reactive Arrangements. Reactive arrangements, as described above in the section
on internal validity, is more traditionally viewed as a threat to external validity. The five
components of reactive arrangements reduce external validity because, in their presence,
findings pertaining to the intervention become a function of which of these components
prevail. Thus, it is not clear whether, for example; an intervention effect in the presence of
the novelty effect would be the same if the novelty effect had not prevailed, thereby
threatening the generalizability of the results.
Order bias. As is the case for reactive arrangements, order bias is a threat to
external validity because in its presence, observed findings would depend on the order in
which the multiple interventions are administered. As such, findings resulting from a
particular order of administration could not be confidently generalized to situations in which
the sequence of interventions is different.
Matching bias. Matching bias is a threat to external validity to the extent that findings
from the matched participants could not be generalized to the results that would have
occurred among individuals in the accessible population for whom a match could not be
found–that is, those in the sampling frame who were not selected for the study.
33
Framework for Internal and External Validity 33
Specificity of variables. Specificity of variables is a threat to external validity in
almost every study. Specificity of variables refers to the fact that any given inquiry is
undertaken utilizing (a) a specific type of individual; (b) at a specific time, (c) at a specific
location, (d) under a specific set of circumstances, (e) based on a specific operational
definition of the independent variable, (f) using specific dependent variables, and (g) using
specific instruments to measure all the variables. The more unique the participants, time,
context, conditions, and variables, the less generalizable the findings will be. In order to
counter threats to external validity associated with specificity of variables, the researcher
must operationally define variables in a way that has meaning outside of the study setting
and exercise extreme caution in generalizing findings.
Treatment diffusion. Treatment diffusion threatens external validity inasmuch as the
extent to which the intervention is diffused to other treatment conditions threatens the
researcher’s ability to generalize the findings. Like for internal validity, treatment diffusion
can threaten external validity by contaminating one of the treatment conditions in a unique
way that cannot be replicated.
Pretest x treatment interaction. Pretest by treatment interaction refers to situations
in which the administration of a pretest increases or decreases the participants’
responsiveness or sensitivity to the intervention, thereby rendering the observed findings
of the pretested group unrepresentative of the effects of the independent variable for the
unpretested population from which the study participants were selected. In this case, a
researcher can generalize the findings to pretested groups but not to unpretested groups.
The seriousness of the pretest by treatment interaction threat to external validity is
34
Framework for Internal and External Validity 34
dependent on the characteristics of the research participants, the duration of the study, and
the nature of the independent and dependent variables. For example, the shorter the
study, the more the pre-intervention measures may influence the participants’ post-
intervention responses. Additionally, research utilizing self-report measures such as
attitudinal scales are more susceptible to the pretest by treatment threat.
Selection x treatment interaction. Selection by treatment interaction is similar to the
differential selection of participants threat to internal validity inasmuch as it stems from
important pre-intervention differences between intervention groups, differences that
emerge because the intervention groups are not representative of the same underlying
population. Thus, it would not be possible to generalize the results from one group to
another group. Although selection-treatment interaction tends to be more common when
participants are not randomized to intervention groups, this threat to external validity still
prevails when randomization takes place. This is because randomization does not render
the group representative of the target population.
Data Analysis
Threats to Internal Validity
As illustrated in Figure 1, the following 21 threats to internal validity occur at the data
analysis stage.
Statistical regression. As noted by Campbell and Kenny (1999), statistical
regression can occur at the data analysis stage when researchers attempt to statistically
equate groups, analyze change scores, or analyze longitudinal data. Most comparisons
made in educational research involve intact groups that may have pre-existing differences.
33
Framework for Internal and External Validity 35
Unfortunately, these differences often threaten the internal validity of the findings (Gay &
Airasian, 2000). Thus, in an attempt to minimize this threat, some analysts utilize analysis
of covariance (ANCOVA) techniques that attempt to control statistically for pre-existing
differences between the groups being studied (Onwuegbuzie & Daniel, 2000).
Unfortunately, most of these published works have inappropriately used ANCOVA because
one or more of the assumptions have either not been checked or met (Glass, Peckham,
& Sanders, 1972). According to Campbell and Kenny (1999), covariates always have
measurement error, which if large, leads to a regression artifact. Further, it is virtually
impossible to measure and to control for all influential covariates. For example, in
comparing Black and White students, many analysts attempt to adjust for socioeconomic
status or other covariates. However, almost in every case, such an adjustment represents
an under-adjustment. As illustrated by Campbell and Kenny (1999), White students
generally score higher on covariates than do Black students. Because these covariates are
positively correlated with many educational outcomes (e.g., academic achievement),
controlling for these covariates only partially adjusts for ethnic differences. Additionally,
when making such comparisons, scores of each group typically regress to different means.
Thus, statistical equating predicts more regression toward the mean than actually occurs
(Lund, 1989).
For compensatory programs, in which the control group(s) outscore the intervention
group(s) on pre-intervention measures, the bias resulting from statistical equating tends
to lead to negative bias. Conversely, for anticompensatory programs, whereby intervention
participants outscore the control participants, statistical equating tends to produce positive
36
Framework for Internal and External Validity 36
bias (Campbell & Kenny, 1999). As such, statistical regression may mask the benefits of
an effective program. Conversely, negative effects of a program can become obscured as
a result of statistical regression. Simply put, statistical equating is unlikely to produce
unbiased estimates of the intervention effect. Thus, researchers should be cognizant of this
potential for bias when performing statistical adjustments.
Onwuegbuzie and Daniel (2000) discussed other problems associated with use of
ANCOVA techniques. In particular, they note the importance of the homogeneity of
regression slopes assumption. According to these theorists, to the extent that the individual
regression slopes are different, the part correlation of the covariate-adjusted dependent
variable with the independent variable will more closely mirror a partial correlation, and the
pooled regression slope will not provide an adequate representation of some or all of the
groups. In this case, the ANCOVA will introduce bias into the data instead of providing a
“correction” for the confounding variable (Loftin & Madison, 1991). Ironically, as noted by
Henson (1998), ANCOVA typically is appropriate when used with randomly assigned
groups; however, it is typically not justified when groups are not randomly assigned.
Another argument against the use of ANCOVA is that after using a covariate to
adjust the dependent variable, it is not clear whether the residual scores are interpretable
(Thompson, 1992). Disturbingly, some researchers utilize ANCOVA as a substitute for not
incorporating a true experimental design, believing that methodological designs and
statistical analyses are synonymous (Henson, 1998; Thompson, 1994b). In many cases,
statistical equating creates the illusion of equivalence but not the reality. Indeed, the
problems with statistical adjusting has prompted Campbell and Kenny to declare: “The
37
Framework for Internal and External Validity 37
failure to understand the likely direction of bias when statistical equating is used is one of
the most serious difficulties in contemporary data analysis” (p. 85).
A popular statistical technique is to measure the effect of an intervention by
comparing pre-intervention and post-intervention scores, using analyses such as the
dependent (matched-pairs) t-test. Unfortunately, this type of analysis is affected by
regression to the mean, which tends to reduce the effect size (Campbell &
Kenny, 1999).
Also, as stated by Campbell and Kenny (1999), “in longitudinal studies with many periodic
waves of measurement, anchoring the analysis…at any one time…is likely to produce an
ever-increasing pseudo effect as the time interval increases” (p. 139). Thus, analysis of
both change scores and longitudinal data can threaten internal validity.
Restricted range. Lacking the knowledge that virtually all parametric analyses
represent the general linear model, many researchers inappropriately categorize variables
in non-experimental designs using ANOVA, in an attempt to justify making causal
inferences, when all that occurs typically is a discarding of relevant variance (Cliff, 1987;
Onwuegbuzie & Daniel, 2000; Pedhazur, 1982; Prosser, 1990). For example, Cohen
(1983) calculated that the Pearson product-moment correlation between a variable and its
dichotomized version (i.e., divided at the mean) was .798, which suggests that the cost of
dichotomization is approximately a 20% reduction in correlation coefficient. In other words,
an artificially dichotomized variable accounts for only 63.7% (i.e., .7982) as much variance
as does the original continuous variable. It follows that with factorial ANOVAs, when
artificial categorization occurs, even more power is sacrificed. Thus, restricting the range
of scores by categorizing data tends to pose a threat to internal validity at the data analysis
38
Framework for Internal and External Validity 38
stage by reducing the size of the effect.
Thus, as stated by Kerlinger (1986), researchers should avoid artificially
categorizing continuous variables, unless compelled to do so as a result of the distribution
of the data (e.g., bimodal). Indeed, rather than categorizing independent variables, in many
cases, regression techniques should be used, because they have been shown consistently
to be superior to OVA methods (Onwuegbuzie & Daniel, 2000).
Mortality. In an attempt to analyze groups with equal or approximately equal sample
sizes (i.e., to undertake a “balanced” analysis), some researchers remove some of the
participants’ scores from their final dataset. That is, the size of the largest group(s) is
deliberately reduced to resemble more closely the size of the smaller group(s). Whether
or not cases are removed randomly, this practice poses a threat to internal validity the
extent to which the participants who are removed from the dataset are different than those
who remain. That is, the practice of sub-sampling from a dataset introduces or adds bias
into the analysis, influencing the effect size in an unknown manner.
Non-Interaction seeking bias. Many researchers neglect to assess the presence
interactions when testing hypotheses. By not formally testing for interactions, researchers
may be utilizing a model that does not honor, in the optimal sense, the nature of reality that
they want to study, thereby providing a threat to internal
validity at the data analysis stage.
Type 1 to Type X error. Daniel and Onwuegbuzie (2000) have identified 10 errors
associated with statistical significance testing. These errors were labeled Type I to Type
X. The first four errors are known to all statisticians as Type I (falsely rejecting the null
hypothesis), Type II (incorrectly failing to reject the null hypothesis), Type III (incorrect
3 9
Framework for Internal and External Validity 39
inferences about result directionality), and Type IV (incorrectly following-up an interaction
effect with a simple effects analysis). The following six additional types of error were
identified by Daniel and Onwuegbuzie (2000): (a) Type V error — internal replication
errormeasured via incidence of Type I or Type II errors detected during internal replication
cycles when using methodologies such as the jackknife procedure; (b) Type VI error-
-reliability generalization error — measured via linkages of statistical results to
characteristics of scores on the measures used to generate results (a particularly
problematic type of error when researchers fail to consider differential reliability estimates
for subsamples within a data set); (c) Type VII errorheterogeneity of
variance/regressionmeasured via the extent to which data treated via analysis of
variance/covariance are not appropriately screened to determine whether they meet
homogeneity assumptions prior to analysis of group comparison statistics; (d) Type VIII
errortest directionality errormeasured as the extent to which researchers express
alternative hypotheses as directional yet assess results with two-tailed tests; (e) Type IX
errorsampling bias errormeasured via disparities in results generated from numerous
convenience samples across a multiplicity of similar studies; and (f) Type X errordegrees
of freedom errormeasured as the tendency of researchers using certain statistical
procedures (chiefly stepwise procedures) erroneously to compute the degrees of freedom
utilized in these procedures. All of these errors pose a threat to internal validity at the data
analysis stage.
Observational bias. In studies when observations are made, an initial part of the
data analysis often involves coding the observations. Whenever inter-rater reliability of the
4 0
Framework for Internal and External Validity 40
coding scheme is less than 100%, internal validity is threatened. Thus, researchers should
always attempt to assess the inter-rater reliability of any coding of observations. When
inter-reliability estimates cannot be obtained because there is only one rater, intra-rater
reliability estimates should be assessed.
Researcher bias. Perhaps the biggest form of researcher bias is what has been
termed the halo effect. The halo effect occurs when a researcher is evaluating open-ended
responses, or the like, and allows his or her prior knowledge of the participants to influence
the scores given. This results in findings that are biased. Clearly, this is a threat to internal
validity at the data analysis stage.
Matching bias. Another common data analysis technique is to match groups after
the data on the complete sample have been collected. Unfortunately, the ability of
matching to equate groups, again often is more of an illusion than a reality (Campbell &
Kenny, 1999). Moreover, bias is introduced as a result of omitting those who were not
matched, providing a threat to internal validity.
Treatment replication error. Using an inappropriate unit of analysis is a common
mistake made by researchers (McMillan, 1999). The treatment replication error threat to
internal validity occurs at the data analysis stage when researchers utilize an incorrect unit
of analysis even though data are available for them to engage in a more appropriate
analysis. For example, in analyzing data pertaining to cooperative learning groups, an
investigator may refrain from analyzing available group scores. That is, even though the
intervention is given to groups of students, the researcher might incorrectly use individual
students as the unit of analysis, instead of utilizing each group as a treatment unit and
4
Framework for Internal and External Validity 41
analyzing group data. Unfortunately, analyzing individual students’ scores does not take
into account possible confounding factors. Although it is likely that analyzing group data
instead of individual data results in a loss of statistical power due to a reduction in the
number of treatment units, the loss in power typically is compensated for by the fact that
using group data is more free from contamination. Moreover, analyzing individual data
when groups received the intervention violates the independence assumption, thereby
providing a serious threat to internal validity. Usually, independence violations tend to
inflate both the Type I error rate and effect size estimates. Thus, researchers always
should analyze data at the group level for subsequent analysis when there is a limited
number of interventions independently replicated.
Violated assumptions. Disturbingly, it is clear that many researchers do not
adequately check the underlying assumptions associated with a particular statistical test.
This is evidenced by the paucity of researchers who provide information about the extent
to which assumptions are met (see for example, Keselman et al., 1998; Onwuegbuzie,
1999). Thus, researchers always should check model assumptions. For example, if the
normality assumption is violated, analysts should utilize the non-parametric counterparts.
Mufticoffineaity. Most analysts do not appear to evaluate multicollinearity among the
regression variables (Onwuegbuzie & Daniel, 2000). However, multicollinearity is a more
common threat than researchers acknowledge or appear to realize. For example,
race/ethnicity and socioeconomic status often are confounded with each other in such a
way that the presence of one variable in a model may affect the predictive power of the
other variable. Moreover, multicollinearity leads to inflated or unstable statistical
42
Framework for Internal and External Validity 42
coefficients, thereby providing rival explanations for the findings. Thus, multicollinearity
should routinely be assessed in multiple regression models.
Mis-specification error. Mis-specification error is perhaps the most hidden threat to
internal validity. This error, which involves omitting one or more important variables from
the final model, often stems from a weak or non-existent theoretical framework for building
a statistical model. This inattention to a theoretical framework leads many researchers to
utilize data-driven techniques such as stepwise multiple regression procedures (i.e.,
forward selection, backward selection, stepwise selection). Indeed, the use of stepwise
regression in educational research is rampant (Huberty, 1994), probably due to its
widespread availability on statistical computer software programs. As a result of this
seeming obsession with stepwise regression, as stated by Cliff (1987, pp. 120-121), “a
large proportion of the published results using this method probably present conclusions
that are not supported by the data.”
Mis-specification error also includes non-interaction seeking bias, discussed above,
in which interactions are not tested. Indeed, this is a particular problem when undertaking
structural equation modeling (SEM) techniques. Many SEM software do not facilitate the
statistical testing of interaction terms. Unfortunately, mis-specification error, although likely
common, is extremely difficult to detect, especially if the selected non-optimal model, which
does not include any interaction terms, fits the data adequately.
Threats to External Validity
As illustrated in Figure 1, the following five threats to external validity occur at the
data analysis stage: population validity, researcher bias, specificity of variables, matching
43
Framework for Internal and External Validity 43
bias, and mis-specification error. All of these threats have been discussed above. Thus,
only a brief mention will be made of
each.
Population validity. Every time a researcher analyzes a subset of her or his dataset,
it is likely that findings emerging from this subset are less generalizable than are those that
would have arisen if the total sample had been used. In other words, any kind of sub-
sampling from the dataset likely decreases population validity. The greater the discrepancy
between those sampled and those not sampled from the full dataset, the greater the threat
to population validity. Additionally, threats to population validity often occur at the data
analysis stage because researchers fail to disaggregate their data, incorrectly assuming
that their findings are invariant across all sub-samples inherent in their study. In fact, when
possible, researchers should utilize condition-seeking methods, whereby they “seek to
discover which, of the many conditions that were confounded together in procedures that
have obtained a finding, are indeed necessary or sufficient” (Greenwald et al., 1986, p.
223).
Researcher bias. Researcher bias, such as the halo effect, not only affects internal
validity at the data analysis stage, but also threatens external validity because the
particular type of bias of the researcher may be so unique as to make the findings
ungeneralizable.
Specificity of variables. As noted above, specificity of variables is one of the most
common threats to external validity at the research design/data collection stage. Indeed,
seven ways in which specificity of variables is a threat to external validity at this stage was
identified above (type of participants, time, location, circumstance, operational definition
A d
Framework for Internal and External Validity 44
of the independent variables, operational definition of the dependent variables, and types
of instruments used). At the data analysis stage, specificity of variables also can be an
external validity threat vis-à-vis the manner in which the independent and dependent
variables are operationalized. For example, in categorizing independent and dependent
variables, many researchers use local norms; that is, they classify participants’ scores
based on the underlying distribution. Because every distribution of scores is sample
specific, the extent to which a variable categorized using local norms can be generalized
outside the sample is questionable. Simply put, the more unique the operationalization of
the variables, the less generalizable will be the findings. In order to counter threats to
external validity associated operationalization of variables, when possible, the researcher
should utilize variables in ways that are transferable (e.g., using standardized norms).
Matching bias. Some researchers match individuals in the different intervention
groups just prior to analyzing the data. Matching provides a threat to external validity at this
stage if those not selected for matching from the dataset are in some important way
different than those who are matched, such that the findings from the selected individuals
may not be generalizable to the unselected persons.
Mis-specification en-or. As discussed above, mis-specification error involves omitting
one or more important variables (e.g., interaction terms) from the analysis. Although a final
model selected may have acceptable internal validity, such omission reduces the external
validity because it is not clear whether the findings would be the same if the omitted
variable(s) had been included.
Data Interpretation
r-
Framework for Internal and External Validity 45
Threats to Internal Validity
As illustrated in Figure 1, the following seven threats to internal validity occur at the
data
interpretation stage.
Effect size. As noted by Onwuegbuzie and Daniel (2000), perhaps the most
prevalent error made in quantitative research, which appears across all types of inferential
analyses, involves the incorrect interpretation of statistical significance and the related
failure to report and to interpret confidence intervals and effect sizes (i.e., variance-
accounted for effect sizes or standardized mean differences) (Daniel, 1998a, 1998b;
Ernest & McLean, 1998; Knapp, 1998; Levin, 1998; McLean & Ernest, 1998; Nix &
Barnette, 1998a, 1998b; Thompson, 1998b). This error, which occurs at the data
interpretation stage, threatens internal validity because it often leads to under-interpretation
of associated p-values when sample sizes are small and the corresponding effect sizes are
large, and an over-interpretation of p-values when sample sizes are large and effect sizes
are small (e.g., Daniel, 1998a). Because of this common confusion between significance
in the probabilistic sense (i.e., statistical significance) and significance in the practical
sense (i.e., effect size), some researchers (e.g., Daniel, 1998a) have recommended that
authors insert the word “statistically” before the word “significant,” when interpreting the
findings of a null hypothesis statistical test. Thus, as stated by the APA Task Force,
researchers should “always present effect sizes for primary outcomes…[and]…reporting
and interpreting effect sizes…is essential to good research” (Wilkinson & the Task Force
on Statistical Inference, 1999, pp. 10-11).
Confirmation bias. Confirmation bias is the tendency for interpretations and
A 6
Framework for Internal and External Validity 46
conclusions based on new data to be overly consistent with preliminary hypotheses
(Greenwald et al., 1986). Unfortunately, confirmation bias is a common threat to external
validity at the data interpretation stage, and has been identified via expectancy biasing of
student achievement, perseverance of belief in discredited hypotheses, the primacy effects
in impression formation and persuasion, delayed recovery of simple solutions, and
selective retrieval of information that confirms the researcher’s hypotheses, opinions, or
self-concept (Greenwald et al., 1986). Apparently, confirmation bias is more likely to prevail
when the researcher is seeking to test theory than when he or she is attempting to
generate theory, because testing a theory can “dominate research in a way that blinds the
researcher to potentially informative observation” (Greenwald et al., 1986, p. 217). When
hypotheses are not supported, a common practice of researchers is to proceed as if the
theory underlying the hypotheses is still likely to be correct. In proceeding in this manner,
many researchers fail to realize that their research methodology no longer can be
described as theory testing but theory confirming.
Notwithstanding, confirmation bias, per se, does not necessarily pose a threat to
internal validity. It threatens internal validity at the data interpretation stage only when there
exists one or more plausible rival explanations to underlying findings that might be
demonstrated to be superior if given the opportunity. Conversely, when no rival
explanations prevail, confirmation bias helps to provide support for the best or sole
explanation of results (Greenwald et al., 1986). However, because rival explanations
typically permeate educational research studies, researchers should be especially
cognizant of the role of confirmation bias on the internal validity of the results at the data
Q7
Framework for Internal and External Validity 47
interpretation stage.
Statistical regression. When a study involves extreme group selection, matching,
statistical equating, change scores, time-series studies, or longitudinal studies, researchers
should be especially careful when interpreting data because, as noted above, findings from
such investigation often reflect some degree of regression toward the mean (Campbell &
Kenny, 1999).
Distorted graphics. Researchers should be especially careful when interpreting
graphs. In particular, when utilizing graphs (e.g., histograms) to check model assumptions,
in a desire to utilize parametric techniques, it is not unusual for researchers to conclude
incorrectly that these assumptions hold. Thus, when possible graphical checks should be
triangulated by empirical evaluation. For example, in addition to examining histograms,
analysts could examine the skewness and kurtosis coefficients, and even undertake
statistical tests of normality (e.g., the Shapiro-Wilk test; Shapiro & Wilk, 1965; Shapiro,
Wilk, & Chen, 1968).
Illusory correlation. The illusory correlation represents a tendency to overestimate
the relationship among variables that are only slightly related or not related at all. Often,
the illusory correlation stems from a confirmation bias. The illusory correlation also may
arise from a false consensus bias, in which researchers have the false belief that most
other individuals share their interpretations of a relationship. Such an illusory correlation
poses a serious threat to internal validity at the data interpretation stage.
Crud factor. As noted by Onwuegbuzie and Daniel (in press), as the sample size
increases, so does the probability of rejecting the null hypothesis of no relationship
AS
Framework for Internal and External Validity 48
between two variables. Indeed, theoretically, given a large enough sample size, the null
hypothesis always will be rejected (Cohen, 1994). Hence, it can be argued that “everything
correlates to some extent with everything else” (Meehl, 1990, p. 204). Meehl referred to
this tendency to reject null hypotheses in the face of trivial relationships as the crud factor.
This crud factor leads some researchers to identify and to interpret relationships that are
not real but represent statistical artifacts, posing a threat to internal validity at the data
interpretation stage.
Positive manifold. Positive manifold refers to the phenomenon that individuals who
perform well on one ability or attitudinal measure tend to perform well on other measures
in the same domain (Neisser, 1998). Thus, researchers should be careful when interpreting
relationships found between two or more sets of cognitive test scores or attitudinal scores.
As noted by Onwuegbuzie and Daniel (in press), particular focus should be directed toward
effect sizes, as opposed to p-values.
Causal error. In interpreting statistically significant relationships, often infer cause-
and-effect relationships, even though such associations can, at best, only be determined
from experimental studies. Causality often can be inferred from scientific experiments when
the selected independent variable(s) are carefully controlled. Then if the dependent
variable is observed to change in a predictable way as the value of the independent
variable changes, the most plausible explanation would be a causal relationship between
the independent and the dependent variables. In the absence of such control and ability
to manipulate the independent variable, the plausibility that at least one more unidentified
variable is mediating the relationship between both variables will remain.
9
Framework for Internal and External Validity 49
Interestingly, Kenny (1979) distinguished between correlational and causal
inferences, noting that four conditions must exist before a researcher may justifiably claim
that X causes Y: (a) time precedence (X must precede Yin time), (b) functional relationship
(Y should be conditionally distributed across X), (c) nonspuriousness (there must not be
a third variable Z that causes both X and Y, such that when Z is controlled for, the
relationship between X and Y vanishes, and (d) vitality (a logistical link between X and Y
that substantiates the likelihood of a causal link (such as would be established via
controlled experimental conditions). However, it is extremely difficult for these four
conditions to be met simultaneously in correlational designs. Consequently, substantiating
causal links in uncontrolled (correlational and intervention) studies is a very difficult task
(Onwuegbuzie & Daniel, in press). Thus, researchers should pay special attention when
interpreting findings stemming from non-experimental research. Unfortunately, some
researchers and policy makers are prone to ignore threats to internal validity when
interpreting relationships among variables.
Threats to External Validity
As illustrated in Figure 1, the following three threats to external validity occur at the
data interpretation stage: population validity, ecological validity, and temporal validity. All
of these threats have been discussed above. Thus, only a brief mention will be made of
each.
Population validity/Ecological validity/Temporal validity. When interpreting findings
stemming from small and/or non-random samples, researchers should be very careful not
to over-generalize their conclusions. Instead, researchers always should compare their
50
Framework for Internal and External Validity 50
findings to the extant literature as comprehensively as is possible, so that their results can
be placed in a realistic context. Only if findings are consistent across different populations,
locations, settings, times, and contexts can researchers be justified in making
generalizations to the target population. Indeed, researchers and practitioners must refrain
from assuming that one study, conducted without any external replications, can ever
adequately answer a research question. Thus, researchers should focus more on
advocating external replications and on providing directions for future research than on
making definitive conclusions. When interpreting findings, researchers should attempt to
do so via the use of disaggregated data, utilizing the condition-seeking methods, in which
a progression of qualifying conditions are made based on existing findings (Greenwald et
al., 1986). Such condition-seeking methods would generate a progression of research
questions, which, if addressed in future studies, would provide increasingly accurate and
generalizable conclusions. Simply put, researchers should attempt, at best, to make
qualified conclusions.
Summary and Conclusions
The present paper has sought to promote the dialogue about threats to internal and
external validity in educational research in general and empirical research in particular.
First, several rationales were provided for identifying and discussing threats to internal and
external validity not only in experimental studies, but for all other types quantitative
research designs (e.g., descriptive, correlational, causal-comparative). Specifically, it was
contended that providing information about sources of invalidity and rival explanations (a)
allows readers better to contextualize the underlying findings, (b) promotes external
51
Framework for Internal and External Validity 51
replications; (c) provides a directions for future research, and (d) advances the conducting
of validity meta analyses and thematic effect sizes.
Second, the validity frameworks of Campbell and Stanley (1963), Huck and Sandler
(1979), and McMillan (2000) were described. It was noted that these three sets of theorists
are the only ones who appear to have provided a list of internal and external validity
threats. Unfortunately, none of these frameworks identified sources of result invalidity that
were applicable across all types of quantitative research designs. Third, it was asserted
that in order to encourage empirical discussion of internal and external validity threats in
all empirical studies, a framework was needed that is more comprehensive than are the
existing ones, and that seeks to unify all quantitative research designs under one validity
umbrella.
Fourth, threats to internal and external validity were conceptualized as occurring at
the three major stages of the research process, namely, research design/data collection,
data analysis, and data interpretation. Using this conceptualization, and building on the
works of Campbell and Stanley (1963), Huck and Sandler (1979), and McMillan (2000), a
comprehensive model of dimensions of sources of validity was developed. This model was
represented as a 3 (stage of research process) x 2 (internal vs. external validity) matrix
comprising 49 unique dimensions of internal and external validity threats, with many of the
dimensions containing sub-dimensions (cf. Figure 1).
Although this model of sources of validity is comprehensive, it is by no means
exhaustive. Indeed, researchers and practitioners alike are encouraged to find ways to
improve upon this framework. Indeed, the author currently is formally assessing the internal
044
Framework for Internal and External Validity 52
and external validity of this model by attempting to determine how prevalent each of these
threats are in the extant educational literature.
Nevertheless, it is hoped that this paper makes it clear that every inquiry contains
multiple threats to internal and external validity, and that researchers should exercise
extreme caution when making conclusions based on one or a few studies. Additionally, it
is hoped tat this model highlights the importance of assessing sources of invalidity in every
research study and at different stages of the research process. For example, just because
threats to internal and external validity have been minimized at one phase of the research
study does not mean that sources of invalidity do not prevail at the other stages.
Moreover, it is hoped that the present model not only extends the dialogue on
threats to internal and external validity, but also provides a broader guideline for doing so
than has previously been undertaken. However, in order to promote further discussion of
these threats, journal editors must be receptive to this information, and not use it as a
vehicle to justify the rejection of manuscripts. Indeed, journal reviewers and editors should
strongly encourage all researchers to include a discussion of the major rival hypotheses
in their investigations. In order to motivate researchers to undertake this, it must be made
clear to them that such practice would improve the quality of their papers, not diminish it.
Indeed, future revisions of the American Psychological Association Publication Manual
(APA, 1994) should provide strong encouragement for all empirical research reports to
include a discussion of threats to internal and external validity. Additionally, the Manual
should urge researchers to furnish a summary of the major threats to internal and external
validity for some or even all of the studies that are included in their reviews of the related
53
Framework for Internal and External Validity 53
literature. Unless there is a greater emphasis on validity in research, threats to internal and
external validity will continue to prevail at various stages of the research design, and many
findings will continue to be misinterpreted and over-generalized. Thus, an increased focus
on internal and external validity in all empirical studies can only help the field of educational
research by helping investigators to be more reflective at every stage of the research
process.
54
Framework for Internal and External Validity 54
References
American Psychological Association. (1994). Publication manual of the American
Psychological Association (4th ed.). Washington, DC: Author.
Campbell, D.T. (1957). Factors relevant to the validity of experiments in social
settings. Psychological Bulletin, 54, 297-312.
Campbell, D.T., & Kenny, D.A. (1999). a primer on regression artifacts. New York:
The Guildford Press.
Campbell, D.T., & Stanley, J.C. (1963). Experimental and quasi-experimental
designs for research. Chicago: Rand McNally.
Cliff, N. (1987). Analyzing multivariate data. San Diego: Harcourt Brace Jovanovich.
Cohen, J. (1983). The cost of dichotomization. Applied Psychological Measurement,
7, 249-253.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. New York:
John Wiley.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-
1003.
Daniel, L.G. (1998a). Statistical significance testing: A historical overview of misuse
and misinterpretation with implications for editorial policies of educational journals.
Research in the Schools, 5, 23-32.
Daniel, L.G. (1998b). The statistical significance controversy is definitely not over:
A rejoinder to responses by Thompson, Knapp, and Levin. Research in the Schools, 5, 63-
65.
5 5
Framework for Internal and External Validity 55
Daniel, L.G., & Onwuegbuzie, A.J. (2000, November). Toward an extended typology
of research errors. Paper presented at the annual conference of the Mid-South Educational
Research
Association, Bowling Green, KY.
Ernest, J.M., & McLean, J.E. (1998). Fight the good fight: A response to Thompson,
Knapp, and Levin. Research in the Schools, 5,
59
-62.
Gay, L.R., & Airasian, P.W. (2000). Educational research: Competencies for
analysis and application (6th ed.). Englewood Cliffs, N.J.: Prentice Hall.
Glass, G.V., Peckham, P.D., & Sanders, J.R. (1972). Consequences of failure to
meet assumptions underlying the fixed effects analyses of variance and covariance.
Review of Educational Research, 42, 237-288.
Greenwald, A.G., Pratkanis, A.R., Leippe, M.R., & Baumgardner, M.H. (1986).
Under what conditions does theory obstruct research progress. Psychological Review, 93,
216-229.
Henson, R.K. (1998, November). ANCOVA with intact groups: Don’t do it! Paper
presented at the annual meeting of the Mid-South Educational Research Association, New
Orleans, LA.
Hill, K.T., & Wigfield, A. (1984). Test anxiety: A major educational problem and what
can be done about it. The Elementary School Journal, 85, 105-126.
Huberty, C.J. (1994). Applied discriminant analysis. New York: Wiley and Sons.
Huck, S.W., & Sandler, H.M. (1979). Rival hypotheses: Alternative interpretations
of data based conclusions. New York: Harper Collins.
Johnson, B., & Christensen, L. (2000). Educational research: Quantitative and
F6
Framework for Internal and External Validity 56
.qualitative approaches. Boston, MA: Allyn and Bacon.
Kenny, D. A. (1979). Correlation and causality. New York: John Wiley & Sons.
Kerlinger, F. N. (1986). Foundations of behavioral research (3rd ed.). New York:
Holt, Rinehart and Winston.
Keselman, H.J., Huberty, C.J., Lix, L.M., Olejnik, S., Cribbie, R.A., Donahue, B.,
Kowalchuk, R.K., Lowman, L.L., Petoskey, M.D., Keselman, J.C., & Levin, J.R. (1998).
Statistical practices of educational researchers: An analysis of their ANOVA, MANOVA,
and ANCOVA analyses. Review of Educational Research, 68, 350-386.
Knapp, T.R. (1998). Comments on the statistical significance testing articles.
Research in the Schools, 5, 39-42.
Levin, J.R. (1998). What if there were no more bickering about statistical
significance tests? Research in the Schools, 5, 43-54.
Lincoln, Y.S., & Guba, E.G. (1985). Naturalistic inquiry. Beverly Hills, CA: Sage.
Loftin, L.B., & Madison, S.Q. (1991). The extreme dangers of covariance
corrections. In B. Thompson (Ed.), Advances in educational research: Substantive findings,
methodological developments (Vol. 1, pp. 133-147). Greenwich, CT: JAI Press.
Lund, T. (1989). The statistical regression phenomenon: IL Application of a
metamodel. Scandinavian Journal of Psychology, 30, 2-11.
McLean, J.E., & Ernest, J.M. (1998). The role of statistical significance testing in
educational research. Research in the Schools, 5, 15-22.
McMillan, J.H. (1999). Unit of analysis in field experiments: Some design
considerations for educational researchers. (ERIC Document Reproduction Service No.
Framework for Internal and External Validity 57
ED 428 135)
McMillan, J.H. (2000, April). Examining categories of rival hypotheses for
educational research. Paper presented at the annual meeting of the American Educational
Research Association, New Orleans, LA.
Meehl, P. (1990). Why summaries of research on psychological theories are often
uninterpretable. Psychological Reports, 66, 195-244.
Miles, M.B., & Huberman, A.M. (1984). Drawing valid meaning from qualitative data:
Toward a shared craft. Educational Researcher, 13, 20-30.
Mundfrom, D.J., Shaw, D.G., Thomas, A., Young, S., & Moore, A.D. (1998, April).
Introductory graduate research courses: An examination of the knowledge base. Paper
presented at the annual meeting of the American Educational Research Association, San
Diego, CA.
Neisser, U. (1998). Rising test scores. In U. Neisser (Ed.), The rising curve (pp. 3-
22). Washington, DC: American Psychological Association.
Nix, T.W., & Barnette, J. J. (1998a). The data analysis dilemma: Ban or abandon.
A review of null hypothesis significance testing. Research in the Schools, 5, 3-14.
Nix, T.W., & Barnette, J. J. (1998b). A review of hypothesis testing revisited:
Rejoinder to Thompson, Knapp, and Levin. Research in the Schools, 5, 55-58.
Onwuegbuzie, A.J. (1999, September). Common analytical and interpretational
errors in educational research. Paper presented at the annual meeting of the European
Educational Research Association (EERA), Lahti, Finland.
Onwuegbuzie, A.J. (2000a). The prevalence of discussion of threats to internal and
Framework for Internal and External Validity 58
external validity in research reports. Manuscript submitted for publication.
Onwuegbuzie, A.J. (2000b, November). Positivists, post-positivists, post-
structuralists, and post-modernists: Why can’t we all get along? Towards a framework for
unifying research paradigms. Paper to be presented at the annual meeting of the
Association for the Advancement of Educational Research (AAER), Ponte Vedra, Florida.
Onwuegbuzie, A.J. (2000c, November). Effect sizes in qualitative research. Paper
to be presented at the annual conference of the Mid-South Educational Research
Association, Bowling Green, KY.
Onwuegbuzie, A.J., & Collins, K.M. (2000). Group heterogeneity and performance
in graduate-level educational research courses: The role of aptitude by treatment
interactions and Matthew effects. Manuscript submitted for publication.
Onwuegbuzie, A.J., & Daniel, L.G. (2000, April). Common analytical and
interpretational errors in educational research. Paper presented at the annual meeting of
the American Educational Research Association, New Orleans, LA.
Onwuegbuzie, A.J., & Daniel, L.G. (in press). Uses and misuses of the correlation
coefficient. Research in the Schools.
Onwuegbuzie, A.J., & Seaman, M. (1995). The effect of time and anxiety on
statistics achievement. Journal of Experimental Psychology, 63, 115-124.
Pedhazur, E.J. (1982). Multiple regression in behavioral research: Explanation and
prediction (2nd ed.). New York: Holt, Rinehart and Winston.
Prosser, B. (1990, January). Beware the dangers of discarding variance. Paper
presented at the annual meeting of the Southwest Educational Research Association,
59
Framework for Internal and External Validity 59
Austin, TX. (ERIC Reproduction Service No. ED 314 496)
Rogers, E.M. (1995). Diffusion of innovations (4th ed.). New York: The Free Press.
Shapiro, S.S., & Wilk, M.B. (1965). An analysis of variance test, for normality and
complete samples. Biometrika, 52, 592-
61
1.
Shapiro, S.S., Wilk, M.B., & Chen, H.J. (1968). A comparative study of various tests
for normality. Journal of the American Statistical Association, 63, 1343-1372.
Smith, M.L., & Glass, G.V. (1987). Research and evaluation in education and the
social sciences. Englewood Cliffs, NJ: Prentice Hall.
The American Statistical Association. (1999). Ethical guidelines for statistical
practice [On-line]. Available: http: / /www.amstat.org/ profession /ethicalstatistics.html
Thompson, B. (1992, April). Misuse of ANCOVA and related “statistical control”
procedures. Reading Psychology: An International Quarterly, 13, iii-xvii.
Thompson, B. (1994a). The pivotal role of replication in psychological research:
Empirically evaluating the replicability of sample results. Journal of Personality, 62, 157-
176.
Thompson, B. (1994b). Common methodological mistakes in dissertations, revisited.
Paper presented at the annual meeting of the American Educational Research Association,
New Orleans, LA (ERIC Document Reproduction Service No. ED 368 771)
Thompson, B. (1998b). Statistical testing and effect size reporting: Portrait of a
possible future. Research in the Schools, 5, 33-38.
Wilkinson, L. & the Task Force on Statistical Inference. (1999). Statistical methods
in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.
00
Framework for Internal and External Validity 60
Figure Caption
Figure 1. Major dimensions of threats to internal validity and external validity at the three
major stages of the research process.
61
Framework for Internal and External Validity 61
–Threats to External
Validity /External
Replication
Population Validity
Ecological Validity
Temporal Validity
Multiple-Treatment Interference
Researcher Bias
Reactive Arrangements
Order Bias
Matching Bias
Specificity of Variables
Treatment Diffusion
Pretest x Treatment Interaction
Selection x Treatment Interaction
Research
Design/Data
Collection
Population Validity
Researcher Bias
Specificity of Variables
Matching Bias
Mis-Specification Error
History
Maturation
Testing
Instrumentation
Statistical Regression
Differential Selection of Participants
Mortality
Selection Interaction Effects
Implementation Bias
Sample Augmentation Bias
Behavior Bias
Order Bias
Observational Bias
Researcher Bias
Matching Bias
Treatment Replication Error
Evaluation Anxiety
Multiple-Treatment Interference
Reactive Arrangements
Treatment Diffusion
Time x Treatment Interaction
History x Treatment Interaction
BEST COPY AVAILABLE
Population Validity
Ecological Validity
Temporal Validity
Data Data
Analysis Interpretation
Statistical Regression
Restricted Range
Mortality
Non-Interaction Seeking Bias
Type I – Type X Error
Observational Bias
Researcher Bias
Matching Bias
Treatment Replication Error
Violated Assumptions
Multicollinearity
Mis-Specification Error
62
Threats to Internal
Validity /Internal
Replication
V
Effect Size
Confirmation Bias
Statistical Regression
Distorted Graphics
Illusory Correlation
Positive Manifold
Causal Error
U.S. Department of Education
Office of Educational Research and Improvement (OEM)
National Library of Education (NLE)
Educational Resources Information Center (ERIC)
REPRODUCTION RELEASE
(Specific Document)
TM032235
ERIC
I. DOCUMENT
Ae_fiastict,-
Title:
Author(s): v\-luc. C) ,
Corporate Source:
Publication Date:
2_ o 0.0
H. REPRODUCTION RELEASE:
In order to disseminate
as widely as possible timely and significant
materials of Interest to the educational
community, documents announced in themonthly abstract Ioumal of the ERIC system, Resources In Education (RIE), are usually made available to users in microfiche, reproduced paper copy,and electronic media, and sold through the ERIC Document
Reproduction Service (EDRS). Credit is given to the source of each document, and, ifreproduction release is granted, one of the following notices is affixed to the document
If permission is granted to reproduce and disseminate the Identified document, please CHECK ONE of the following three options and sign at the bottom
of the page.
The sample sticker shown below will be
affixed to ell Lew I documents
1
PERMISSION TO REPRODUCE AND
DISSEMINATE THIS MATERIAL HAS
BEEN GRANTED BY
\e
Si6
TO THE EDUCATIONAL
RESOURCES
INFORMATION CENTER (ERIC)
Level 1
Chick hers hr LAIVIIl 1 rINS80. permitting
reproduction and dissernhatron
In microfiche or other
ERIC archhei media
electronic) and paper
Sign
here, -‘
please
wet.
The sample sticker shovel below will be
The sample slicker sham below wiN beeffete] to all Level 2A documents
affixed to ell Level 2B documents
PERMISSION TO REPRODUCE AND
DISSEMINATE THIS MATERIAL IN
MICROFICHE, AND IN ELECTRONIC MEDIA
FOR ERIC COLLECTION
SUBSCRIBERS ONLY,
HAS BEEN GRANTED BY
2A
e
TO THE EDUCATIONAL RESOURCES
INFORMATION CENTER (ERIC)
Level 2A
Check here for Level 2A release, permitting
reproduction and disserrenation in microfiche and in
electronic media for ERIC archival collection
subscribers only
PERMISSION TO REPRODUCE AND
DISSEMINATE THIS MATERIAL IN
MICROFICHE ONLY HAS BEEN GRANTED BY
TO THE EDUCATIONAL RESOURCES
INFORMATION CENTER (ERIC)
2B
Level 213
Check hens for Level 2B release, permitting
reproduction and disserneadion in microfiche only
Documents vrtil be processed u Indicated provided reproduction quality permits.If permission to reproduce is grated, but no box is checked,
documentsvAll be processed at Level 1.
I hereby grant to the Educational Resources Information Center (ERIC)
nonexclusive permission to reproduce and disseminate this document
as indicated above. Reproduction
from the ERIC microfiche
or electronic media by persons other than ERIC employees and its system
contractors requimspemlission from the copyright
troller. Exception is made fornon-pr o fi t reproduction by libraries and other service agencies
to satisfy intonation
needs of educators in
response to discrete inquiries-
Sargent
OrgsracaticrilAdetess:
Anthony J. Onwuegbuzie, Ph.D. F.S.S
Department of Educational Leadership
College of Education
Valdosta State University
Valdosta, Georgia 31698
Printed NarneePositknfT :
b-i tfo #.23 S ON),) NFE.-qt.,.1.P471_2.0
FA)( c( t- 7–y7 -232- t
2_.qc
(over)
A L-4- E.!!)
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 1/40
Learning Outcomes
By the end of this chapter, you should be able to:
Use appropriate terminology when discussing
experimental design
s.
Identify the key features of experiments for making causal statements.
Explain the importance of both internal and external validity in experiments.
Describe the threats to both internal and external validity in experiments.
Outline the most common types of experimental designs.
Describe methods for analyzing experimental data.
Summarize methods for avoiding Type I and Type II error.
One of the oldest debates within psychology concerns the relative contributions of biology and the environment in
shaping our thoughts, feelings, and behaviors. Do we become who we are because it is hard-wired into our DNA, or
because of our early experiences? Do people share their parents’ personality quirks because they carry their
parents’ genes, or because they grew up in their parents’ homes? Researchers can, in fact, address these types of
questions in several ways. A consortium of researchers at the University of Minnesota has spent the past three
decades comparing pairs of identical and fraternal twins, raised in the same versus different households, to tease
5 Experimental Designs—Explaining Behavior
Antonio Oquias/Hemera/Thinkstock
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 2/40
apart the contributions of genes and environment. Read more at the research group’s website,
http://mctfr.psych.umn.edu/ (http://mctfr.psych.umn.edu/) .
An alternative way to separate genetic and environmental in�luence is through the use of experimental designs,
which have the primary goal of explaining the causes of behavior. Recall from the design overview in Chapter 2 (2.1)
that experiments can address causal relationships because the experimenter has control over the environment as
well as over the manipulation of variables. One particularly ingenious example comes from the laboratory of Michael
Meaney, a professor of psychiatry and neurology at McGill University. Meaney used female rats as experimental
subjects (Francis, Dioro, Liu, & Meaney, 1999). His earlier research had revealed that the parenting ability of female
rats could be reliably classi�ied based on how attentive they were to their rat pups, as well as how much time they
spent grooming the pups. The question tackled in the 1999 study was whether these behaviors were learned from
the rats’ own mothers or transmitted genetically. To answer this question experimentally, Meaney and colleagues
had to think very carefully about the comparisons they wanted to make. To simply compare the offspring of good
and bad mothers would have been insuf�icient—this approach could not distinguish between genetic and
environmental pathways.
Instead, Meaney decided to use a technique called cross-fostering, or switching rat pups from one mother to another
as soon as they were born. The technique resulted in four combinations of rats: (1) those born to inattentive
mothers but raised by attentive ones, (2) those born to attentive mothers but raised by inattentive ones, (3) those
born and raised by attentive mothers, and (4) those born and raised by inattentive mothers. Meaney then tested the
rat pups several months later and observed the way they behaved with their own offspring. Meaney’s control over
all aspects of how the rat pups were raised was a critical element; he was able to keep everything the same except
for the combination of their genetics and rearing environment. The setup of this experiment allowed Meaney to
make clear comparisons between the in�luence of birth mothers and the rearing process. At the end of the study, the
conclusion was crystal clear: Maternal behavior is all about the environment. Those rat pups that ultimately grew up
to be inattentive mothers were those who had been raised by inattentive mothers.
This �inal chapter is dedicated to experimental designs, in which the primary goal is to explain behavior.
Experimental designs rank highest on the continuum of control (see Figure 5.1) because the experimenter can
manipulate variables, minimize extraneous variables, and assign participants to conditions. The chapter begins with
an overview of the key features of experiments and then explains the importance of both internal and external
validity of experiments. From there, the discussion moves to the process of designing and interpreting experiments
and concludes with a summary of strategies for minimizing error in experiments.
Figure 5.1: Experimental designs on the continuum of control
http://mctfr.psych.umn.edu/
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 3/40
5.1 Experiment Terminology
Before we dive into the details, it is important to cover the terminology that the chapter will use to describe different
aspects of experimental designs. Much of this will be familiar from Chapter 2, with a few new additions. First, we
will review the basics.
Recall that a variable is any factor that has more than one value. For example, height is a variable because people can
be short, tall, or anywhere in between. Depression is a variable because people can experience a wide range of
symptoms, from mild to severe. The independent variable (IV) is the variable that is manipulated by the
experimenter to test hypotheses about cause. The dependent variable (DV) is the variable that is measured by the
experimenter to assess the effects of the independent variable. For example, in an experiment testing the hypothesis
that fear causes prejudice, fear would be the independent variable and prejudice would be the dependent variable.
To keep these terms straight, it is helpful to think of the main goal of experimental designs. That is, we test
hypotheses about cause by manipulating an independent variable and then looking for changes in a dependent
variable. This means that we think the independent variable causes changes in the dependent variable; for example,
we hypothesize that fear causes changes in prejudice.
When we manipulate an independent variable, we will always have two or more versions of the variable; this is what
distinguishes experiments from, say, structured observational studies. One common way to describe the versions of
the IV is in terms of different groups, or conditions. The most basic experiments have two conditions: The
experimental condition receives a treatment designed to test the hypothesis, while the control condition does
not receive this treatment. In the fear and prejudice example above, the participants who make up the experimental
condition would be made to feel afraid, while the participants who make up the control condition would not. This
setup allows us to test whether introducing fear to one group of participants leads them to express more prejudice
than the other group of participants, who are not made fearful.
Another common way to describe these versions is in terms of levels of the independent variable. Levels describe
the speci�ic set of circumstances created by manipulating a variable. For example, in the fear and prejudice
experiment, the variable of fear would have two levels—afraid and not afraid. We have countless ways to
operationalize fear in this experiment. One option would be to adopt the technique used by the Stanford social
psychologist Stanley Schachter (1959), who led participants to believe they would be exposed to a series of painful
electric shocks. In Schachter’s study, the painful shocks never happened, but they did induce a fearful state as people
anticipated them. So, those at the “afraid” level of the independent variable might be told to expect these shocks,
while those at the “not afraid” level of the independent variable would not be given this expectation.
At this stage, having two sets of vocabulary terms—”levels” and “conditions”—for the same concept may seem odd.
However, with advanced experimental designs using multiple independent variables, there is a subtle difference in
how these terms are used. As the designs become more complex, it is often necessary to expand IVs to include
several groups and multiple variables. At that point, researchers need different terminology to distinguish between
the versions of one variable and the combinations of multiple variables. The chapter will later return to this
complexity, in the section “Experimental Design.”
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 4/40
Cannot load M3U8: Crossdomain access
denied
5.2 Key Features of Experiments
The overview of designs in Chapter 2 described the overall process of experiments in the following way:
Researchers control the environment as much as possible so that all participants have the same experience. The
researchers then manipulate, or change, one key variable, and then measure the outcomes in another key variable.
This section examines this process in more detail. Experiments can be distinguished from all other designs by three
key features: manipulating variables, controlling the environment, and assigning people to groups.
Manipulating Variables
The most crucial element of an experiment is researcher’s
manipulation, or change, of some key variable. To study the
effects of hunger, for example, a researcher could manipulate
the amount of food given to the participants, or to study the
effects of temperature, the experimenter could raise and
lower the temperature of the thermostat in the laboratory. In
both cases, recall that the researcher needs a way to
operationalize the concepts (hunger and temperature) into
measurable variables. For example, the experimenter could
de�ine “hungry” as being deprived of food for eight hours,
and de�ine a “hot” room as being 90 degrees Fahrenheit.
Because these factors are under the direct control of the
experimenters, they can feel more con�ident that changing
them contributes to changes in the dependent variables.
Chapter 2 discussed the main shortcoming of correlational
research: These designs do not allow researchers to make
causal statements. Recall from that chapter (as well as from
Chapter 4) that correlational research is designed to predict
one variable from another. One of the examples in Chapter 2
concerned the correlation between income levels and happiness, with the goal of trying to predict happiness levels
based on knowing people’s income level. If we measure these as they occur in the real world, we cannot say for sure
which variable causes the other. However, we could settle this question relatively quickly with the right experiment.
Suppose we bring two groups into the laboratory and give one group $100 and a second group nothing. If the �irst
group is happier at the end of the study, it would support the idea that money really does buy happiness. Of course,
this experiment is a rather simplistic look at the connection between money and happiness. Even so, because we
manipulate levels of money, this study would bring us closer to making causal statements about the effects of money.
To manipulate variables, it is necessary to have at least two versions of the variable. That is, to study the effects of
money, we need a comparison group that does not receive money. To study the effects of hunger, we would need
both a hungry and a not-hungry group. Having two versions of the variable distinguishes experimental designs from
the structured observations discussed in Chapter 3 (3.4), in which all participants receive the same set of conditions
in the laboratory. Even the most basic experiment must have two sets of conditions, which are often an experimental
group and a control group. However, as this chapter will later explain, experiments can become much more complex.
A study might have one experimental group and two control groups, or �ive degrees of food deprivation, ranging
from 0 to 12 hours without food. Decisions about the number and nature of these groups will depend on
consideration of both the hypotheses and previous literature.
Researchers have three options for manipulating variables. First,
environmental manipulations involve changing some aspect of the
setting. Environmental manipulations are perhaps the most common in
psychology studies, and they include everything from varying the room
temperature to varying the amount of money people receive. The key is
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 5/40
Monkey Business Images/Monkey Business/Thinkstock
Having a patient run on a treadmill to
measure cardiovascular stress is an
example of invasive manipulation.
to change the way that different groups of people experience their time
in the laboratory—it is either hot or cold, and they either receive or do
not receive $100.
Second, instructional manipulations involve changing the way a task is
described to change participants’ mindsets. For example, a researcher
might give the same math test to all participants but to one group,
describe it as an “intelligence test” and to another group, a “problem-
solving task.” Because an intelligence test is thought to have implications
for life success, the experimenter might expect participants in that group
to be more nervous about their scores.
Finally, an invasive manipulation involves taking measures to change
internal, physiological processes; it is usually conducted in medical
settings. For example, studies of new drugs involve administering the
drug to volunteers to determine whether it has an effect on some
physical or psychological symptom. Alternatively, studies of
cardiovascular health often involve having participants run on a
treadmill to measure how the heart functions under stress.
The rule that we must manipulate a variable has one quali�ication. In
many experiments, researchers divide participants based on a
preexisting difference (e.g., gender) or personality measures (e.g., self-
esteem or neuroticism) that capture stable individual differences among
people. The idea behind these personality measures is that someone
scoring high on a measure of neuroticism (for example) would be
expected to be more neurotic across situations than someone scoring
lower on the measure. Using this technique allows a researcher to compare how, for example, men and women or
people with high and low self-esteem respond to manipulations.
When researchers use preexisting differences in an experimental context, they are referred to as quasi-
independent variables—”quasi,” or “nearly,” because they are being measured, not manipulated, by the
experimenter, and thus do not meet the criteria for a regular independent variable. In fact, variables used in this way
are things that cannot be manipulated by an experimenter—either for practical or ethical reasons—including
gender, race, age, eye color, religion, and so forth. Instead, these are treated as independent variables in that
participants are divided into groups along these variables (e.g., male versus female; Catholic versus Protestant
versus Muslim).
Because these variables are not manipulated, an experimenter cannot make causal statements about them. For a
study to count as an experiment, these quasi-independent variables would have to be combined with a true
independent variable. This could be as simple as comparing how men and women respond to a new antidepressant
drug—gender would be quasi-independent while drug type would be a true independent variable.
Sometimes the line between true and quasi-experiments can be subtle. Imagine we want to study the effects on
people’s persistence at a second task based on winning versus losing a contest. In a quasi-experimental approach,
we could have two participants play a game, resulting in a natural winner and loser, and then compare how long
each one stuck with the next game. The approach’s limitation is that some preexisting condition might have affected
winning and losing the �irst game. Perhaps the winners had more self-con�idence and patience at the start. However,
we could improve the design to be a true experiment by having participants play a rigged game against a
confederate, thereby causing participants either to win or lose. In this case, we would be manipulating winning and
losing, and preexisting differences would be averaged out across the groups (more on this later in the chapter).
Controlling the Environment
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 6/40
The second important element of experimental designs is the researcher’s high degree of control over the
environment. In addition to manipulating variables, an experimenter has to ensure that the other aspects of the
environment are the same for all participants. For instance, if we were interested in the effects of temperature on
people’s mood, we could manipulate temperature levels in the laboratory so that some people experienced warmer
temperatures and other people cooler temperatures. However, it is equally important to make sure that other
potential in�luences on mood are the same for both groups. That is, we would want to make sure that the “warm”
and “cool” groups were tested in the same room, around the same time of day, and by similar experimenters.
The overall goal, then, is to control extraneous variables, or variables that add noise to the hypothesis test. In
essence, the more researchers can control extraneous variables, the more con�idence they can have in the results of
the hypothesis test. As the section “Validity and Control” will discuss, these extraneous variables can have different
degrees of impact on a study. Imagine we conduct the study on temperature and mood, and all of our participants
are in a windowless room with a �lickering �luorescent light. This environment would likely in�luence people’s mood
—making everyone a little bit grumpy—but it causes fewer problems for our hypothesis test because it affects
everyone equally. Table 5.1 shows hypothetical data from two variations of this study, using a 10-point scale to
measure mood ratings. In the top row, participants were in a well-lit room; notice that participants in the cooler
room reported being in a better mood (i.e., an 8 versus a 5). In the bottom row, all participants were in the
windowless room with �lickering lights. These numbers suggest that people were still in a better mood in the cooler
room (5) than a warm room (2), but the �lickering �luorescent light had a constant dampening effect on everyone’s
mood.
Table 5.1: In�luence of an extraneous variable
Cool Room Warm Room
Variation 1: Well-Lit 8 5
Variation 2: Flickering Fluorescent 5 2
Assigning People to Conditions
The third key feature of experimental designs is that the researcher can assign people to receive different
conditions, or versions, of the independent variable. This is an important piece of the experimental process:
Experimenters not only control the options—warm versus cool room, $100 versus no money, etc.—but they also
control which participants get each option. Whereas a correlational design might assess the relationship between
current mood and choosing the warm room, an experimental design will assign some participants to the warm room
and then measure the effects on their mood. In other words, experimenters are able to make causal statements
because they cause things to happen to a particular group of people.
The most common, and most preferable, way to assign people to conditions is through a process called random
assignment. An experimenter who uses random assignment makes a separate decision for each participant as to
which group he or she will be assigned to before the participant arrives. As the term implies, this decision is made
randomly—by �lipping a coin, using a random number table (for an example, see
http://stattrek.com/tables/random.aspx (http://stattrek.com/tables/random.aspx) ), drawing numbers out of an
envelope, or even simply alternating back and forth between experimental conditions. The overall goal is to try to
balance preexisting differences among people, as Figure 5.2 illustrates. So, for example, some people might generally
be more comfortable in warm rooms, while others might be more comfortable in cold rooms. If each person who
shows up for the study has an equal chance of being in either group, then the groups in the sample should re�lect the
same distribution of differences as the population.
Figure 5.2: Random assignment
The 24 participants in our sample consist of a mix of happy and sad people. The goal of
random assignment is to have these differences distributed equally across the experimental
http://stattrek.com/tables/random.aspx
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 7/40
conditions. Thus, the two groups on the right each consist of six happy and six sad people,
and our random assignment was successful.
Forming groups through random assignment also has the signi�icant advantage of helping to avoid bias in the
selection and assignment of subjects. For example, it would be a bad idea to assign people to groups based on a �irst
impression of them because participants might be placed in the cold room if they arrived at the laboratory dressed
in warm clothing. Experimenters who make decisions about condition assignments ahead of time can be more
con�ident that the independent variable is responsible for changes in the dependent variable.
Worth highlighting here is the difference here between random selection and random assignment (discussed in
Chapter 4). Random selection means that the sample of participants is chosen at random from the population, as
with the probability sampling methods discussed in Chapter 4. However, most psychology experiments use a
convenience sample of individuals who volunteer to complete the study. This means that the sample is often far from
fully random. However, a researcher can still make sure that the study involves random assignment to groups, so that
each condition contains an equal representation of the sample.
In some cases—most notably, when samples are small—random assignment may not be suf�icient to balance an
important characteristic that might affect the results of a particular study. Imagine conducting a study that
compared two strategies for teaching students complex math skills. In this example, it would be especially important
to make sure that both groups contained a mix of individuals with, say, average and above-average intelligence. For
this reason, the experimenter would necessarily take extra steps to ensure that intelligence was equally distributed
between the groups, which can be accomplished with a variation on random assignment called matched random
assignment. This kind of assignment requires the experimenter to obtain scores on an important matching variable
—in this case, intelligence—rank participants based on the matching variable, and then randomly assign people to
conditions. Figure 5.3 shows how this process would unfold in our math-skills study. First, the researcher gives
participants an IQ test to measure preexisting differences in intelligence. Second, the experimenter ranks
participants based on these scores, from highest to lowest. Third, the experimenter moves down this list in order
and randomly assigns each participant to one of the conditions. This process still contains an element of random
assignment, but adding the extra step of rank ordering ensures a more balanced distribution of intelligence test
scores across the conditions.
Figure 5.3 Matched random assignment
The 20 participants in our sample represent a mix of very high, average, and very low
intelligence test scores (measured 1–100). The goal of matched random assignment is to
ensure that this variation is distributed equally across the two conditions. The experimenter
would �irst rank participants by intelligence test scores (top box), and then distribute these
participants alternately between the conditions. The end result is that both groups (lower
boxes) contain a good mix of high, average, and low scores.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 8/40
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-38… 9/40
Digital Vision/Photodisc/Thinkstock
5.3 Experimental Validity
Chapter 2 discussed the concept of validity, or the degree to which the measures used in a study capture the
constructs that they were designed to capture. That is, a measure of happiness needs to capture differences in
people’s levels of happiness. This section returns to the subject of validity in an experimental context, assessing
whether experimental results demonstrate the causal relationships that researchers think they are demonstrating.
We will discuss two types of validity that are relevant to experimental designs. The �irst is internal validity, which
assesses the degree to which results can actually be attributed to the independent variables. The second is external
validity, which assesses how well the results generalize to situations beyond the speci�ic conditions laid out in the
experiment. Taken together, internal and external validity provide a way to assess the merits of an experiment.
However, each kind has its own threats and remedies, as the following sections explain.
Internal Validity
To have a high degree of internal validity, experimenters strive for maximum control over extraneous variables. That
is, they try to design experiments so that the independent variable is the only cause of differences between groups.
But, of course, no study is ever perfect, and some degree of error is always in place. In many cases, errors are the
result of unavoidable random causes, such as the health or mood of the participants on the day of the experiment. In
other cases, errors are due to factors that are, in fact, within the experimenter’s control. This section focuses on
several of these more manageable threats to internal validity and discusses strategies for reducing their in�luence.
Experimental Confounds
To avoid threats to the internal validity of an experiment, it is important to control and minimize the in�luence of
extraneous variables that might add noise to a hypothesis test. In many cases, extraneous variables can be
considered relatively minor nuisances, as when the mood experiment was inadvertently run in a depressing room.
Now, though, suppose we conduct our study on temperature and mood, and due to a lack of careful planning,
accidentally place all of the “warm room” participants in a sunny room, and the “cool room” participants in a
windowless room. We might very well �ind that the warm-room participants are in a much better mood. Still, is this
the result of warm temperatures or the result of exposure to sunshine? Unfortunately, we would be unable to tell the
difference because of a confounding variable (or confound)—a variable that changes systematically with the
independent variable. In this example, room lighting is confounded with room temperature because all of the warm-
room participants are also exposed to sunshine, and all of the cool-room participants are not. This confounding
combination of variables leaves us unable to determine which variable actually has the effect on mood. In other
words, because our groups differ in more than one way, we cannot clearly say that the independent variable of
interest (the room) caused the dependent variable (mood) to change.
This observation may seem oversimpli�ied, but the way to
avoid confounds is to be very careful in designing
experiments. By ensuring that groups are alike in every way
but the experimental condition, an experimenter can
generally prevent confounds. Nevertheless, avoiding
confounds is somewhat easier said than done because they
can come from unexpected places. For example, most studies
involve the use of multiple research assistants who manage
data collection and interact with participants. Some of these
assistants might be more or less friendly than others, so it is
important to make sure each of them interacts with
participants in all conditions. The friendliest assistant’s
always running participants in the warm-room group, for
example, would result in a confounding variable (friendly
versus unfriendly assistants) between room and research
assistant. Consequently, the experimenter would be unable to
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 10/40
Friendliness of the research assistant is a
variable that can affect the outcome of an
experiment.
separate the in�luence of the independent variable (the
room) from that of the confound (the research assistant).
Selection Bias
Internal validity can also be threatened when groups differ before the manipulation, a condition known as selection
bias. Selection bias causes problems because these preexisting differences might be the driving factor behind the
results. Imagine someone is investigating a new program that will help people stop smoking. The experimenter
might decide to ask for volunteers who are ready to quit smoking and put them through a six-week program. But by
asking for volunteers—a remarkably common error—the researcher gathers a group of people who are already
somewhat motivated to stop smoking. Thus, it is dif�icult to separate the effects of the new program from the effects
of this preexisting motivation.
One easy way to avoid this problem is through either random or matched random assignment. In the stop-smoking
example, a researcher could still ask for volunteers, but then randomly assign these volunteers either to the new
program or to a control group. Both groups consisting of people motivated to quit smoking would help to cancel out
the effects of motivation. Another way to minimize selection bias is to use the same people in both conditions so that
they serve as their own control. In the stop-smoking example, the experimenter could assign volunteers �irst to one
program and then to the other. However, this approach might present a problem: Participants who successfully quit
smoking in the �irst program would not bene�it from the second program. This technique is known as a within-
subject design, and we will discuss its advantages and disadvantages in the section “Within-Subject Designs.”
Differential Attrition
Despite researchers’ best efforts at random assignment, they could still have a biased sample at the end of a study as
a result of differential attrition. The problem of differential attrition occurs when subjects drop out of
experimental groups for different reasons. Say we are conducting a study of the effects of exercise on depression
levels. We manage to randomly assign people either to one week of regular exercise or to one week of regular
therapy. At �irst glance, it appears that the exercise group shows a dramatic drop in depression symptoms. But then
we notice that about one-third of the people in this group dropped out before completing the study. Chances are we
are left with the participants who are most motivated to exercise, to overcome their depression, or both. Thus, it is
dif�icult to isolate the effects of the independent variable on depression symptoms. Although we cannot prevent
people from dropping out of our study, we can look carefully at those who do. In many cases, researchers can spot a
pattern and use it to guide future research. For example, it may be possible to create a pro�ile of people who dropped
out of the exercise study and use this knowledge to increase retention for the next attempt.
Outside Events
As much as experimenters strive to control the laboratory environment, participants are often in�luenced by events
in the outside world. These events—sometimes called history effects—are often large scale and include political
upheavals and natural disasters. History effects threaten research because they make it dif�icult to tell whether
participants’ responses are due to the independent variable or to the historical event(s). A paper published by social
psychologist Ryan Brown, now a professor at the University of Oklahoma, offers a remarkable example. Brown et
al.’s paper discussed the effects of receiving different types of af�irmative action as people were selected for a
leadership position. The goal was to determine the best way to frame af�irmative action to avoid undermining the
recipient’s con�idence (Brown, Charnsangavej, Keough, Newman, & Rentfrow, 2000). For about a week during the
data-collection process, students at the University of Texas where the study was being conducted protested on the
school’s main lawn about a controversial lawsuit regarding af�irmative-action policies. One side effect of these
protests was that participants arriving for Brown’s study had to pass through a swarm of people holding signs that
either denounced or supported af�irmative action. These types of outside events are dif�icult, if not impossible, to
control. But, because these researchers were aware of the protests, they made a decision to exclude data gathered
from participants during the week of the protests from the study, thus minimizing the effects of outside events.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 11/40
Expectancy Effects
One �inal set of threats to internal validity results from the in�luence of expectancies on people’s behavior. This
in�luence can cause trouble for experimental designs in three related ways. First, experimenter expectancies can
cause researchers to see what they expect to see, leading to subtle bias in favor of their hypotheses. In a clever
demonstration of this phenomenon, the psychologist Robert Rosenthal asked his graduate students at Harvard
University to train groups of rats to run a maze (Rosenthal & Fode, 1963). He also told them that based on a pretest,
the rats had been classi�ied as either bright or dull. As might be surmised, these labels were pure �iction, but they
still in�luenced the way that the students treated the rats. Those labeled bright were given more encouragement and
learned the maze much more quickly than rats labeled dull. Rosenthal later extended this line of work to teachers’
expectations of their students (Rosenthal & Jacobson, 1992) and found support for the same conclusion: People
often bring about the results they expect by behaving in a particular way.
One common way to avoid experimenter expectancies is to have participants interact with a researcher who is
“blind” to (i.e., unaware of) the condition in which each participant is. Blind researchers may be fully aware of the
general research hypothesis, but their behavior is less likely to affect the results if they are unaware of the speci�ic
conditions. In the Rosenthal and Fode (1963) study, the graduate students’ behavior only in�luenced the rats’
learning speed because they were aware of the labels bright and dull. If these had not been assigned, the rats would
have been treated fairly equally across the conditions.
Second, participants in a research study often behave differently based on their own expectancies about the goals of
the study. These expectancies often develop in response to demand characteristics, or cues in the study that lead
participants to guess the hypothesis. In a well-known study conducted at the University of Wisconsin, psychologists
Leonard Berkowitz and Anthony LePage (1967) found that participants would behave more aggressively—by
delivering electric shocks to another participant—if a gun was in the room than if there were no gun present. This
�inding has some clear implications for gun-control policies, suggesting that the mere presence of guns increases the
likelihood of gun violence. However, a common critique of this study contends that participants may have quickly
clued in to its purpose and �igured out how they were “supposed” to behave. That is, the gun served as a demand
characteristic, possibly making participants act more aggressively because they thought the researchers expected
them to do so.
To minimize demand characteristics, researchers use a variety of techniques, all of which attempt to hide the true
purpose of the study from participants. One common strategy is to use a cover story, or a misleading statement
about what is being studied. Chapter 1 (1.3) discussed Milgram’s famous obedience studies, which discovered that
people were willing to obey orders to deliver dangerous levels of electric shocks to other people. To disguise the
purpose of the study, Milgram described it to participants as a study of punishment and learning. To give another
example, Ryan Brown and colleagues (2000) presented their af�irmative-action study as a study of leadership styles.
These cover stories aimed to give participants a compelling explanation for what they experienced during the study
and to direct their attention away from the research hypothesis.
Another strategy for avoiding demand characteristics is to use the unrelated-experiments technique, which leads
participants to believe that they are completing two different experiments during one laboratory session. The
experimenter can use this bit of deception to pre-sent the independent variable during the �irst experiment and then
measure the dependent variable during the second experiment. For example, a study by Harvard psychologist
Margaret Shih and colleagues (Shih, Pittinsky, & Ambady, 1999) recruited Asian-American females and asked them
to complete two supposedly unrelated studies. In the �irst, they were asked to read and form impressions of one of
two magazine articles; these articles were designed to make them focus on either their Asian-American identity or
their female identity. In the second experiment, they were asked to complete a math test as quickly as possible. The
goal of this study was to examine the effects of priming different aspects of identity on math performance. Based on
previous research, these authors predicted that priming an Asian-American identity would remind participants of
positive stereotypes regarding Asians and math performance, whereas priming a female identity would remind
participants of negative stereotypes regarding women and math performance. As researchers expected, priming an
Asian-American identity led this group of participants to do better on a math test than did priming a female identity.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 12/40
Martin Poole/Digital Vision/Thinkstock
The placebo effect can test whether
alcohol affects behavior, or whether
people just expect it to and exhibit
changed behavior based on their
expectations.
The unrelated-experiments technique was especially useful for this study because it kept participants from
connecting the independent variable (magazine article prime) with the dependent variable (math test).
A �inal way in which expectancies shape behavior is the placebo effect,
meaning that change can result from the mere expectation that change
will occur. Imagine we want to test the hypothesis that alcohol causes
people to become aggressive. One relatively easy way to do this would be
to give alcohol to a group of volunteers (aged 21 and older) and then
measure how aggressively they behave in response to being provoked.
The problem with this approach is that people also expect alcohol to
change their behavior, and so we might see changes in aggression simply
because of these expectations. Fortunately, the problem has an easy
solution: add a placebo control group to the study that mimics the
experimental condition in every way but one. In this case, we might tell
all participants that they will be drinking a mix of vodka and orange juice
but only add vodka to half of the participants’ drinks. The orange-juice-
only group serves as our placebo control. Any differences between this
group and the alcohol group can be attributed to the alcohol itself.
External Validity
To have a high degree of external validity in experiments, researchers
strive for maximum realism in the laboratory environment. External
validity means that the results extend beyond the particular set of
circumstances created in a single study. Recall that science is a
cumulative discipline and that knowledge grows one study at a time.
Thus, each study is more meaningful: 1) to the extent that it sheds light
on a real phenomenon; and 2) to the extent that the results generalize to
other studies. This section examines each of these criteria separately.
Mundane Realism
The �irst component of external validity is the extent to which an experiment captures the real-world phenomenon
under study. Inspired by a string of school shootings in the 1990s, one popular question in the area of aggression
research asks whether rejection by a peer group leads to aggression. That is, when people are rejected from a group,
do they lash out and behave aggressively toward the members of that group? Researchers must �ind realistic ways to
manipulate rejection and measure aggression without infringing on participants’ welfare. Given the need to strike
this balance, how real can conditions be in the laboratory? How do we study real-world phenomena without
sacri�icing internal validity?
The answer is to strive for mundane realism, meaning that the research replicates the psychological conditions of
the real-world phenomenon (sometimes referred to as ecological validity). In other words, we need not recreate the
phenomenon down to the last detail; instead, we aim to make the laboratory setting feel like the real-world
phenomenon. Researchers studying aggressive behavior and rejection have developed some rather clever ways of
doing this, including allowing participants to administer loud noise blasts or serve large quantities of hot sauce to
those who reject them. Psychologically, these acts feel like aggressive revenge because participants are able to lash
out against those who rejected them—with the intent of causing harm—even though the behaviors themselves may
differ from the ways people exact revenge in the real world.
In a 1996 study, Tara MacDonald and her colleagues at Queen’s University in Ontario, Canada, examined the
relationship between alcohol and condom use (MacDonald, Zanna, & Fong, 1996). The authors were intrigued by a
puzzling set of real-world data: Most people self-reported that they would use condoms when engaging in casual
sex, but actual rates of unprotected sex (i.e., having sexual intercourse without a condom) were also remarkably
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 13/40
high. In this study, the authors found that alcohol was a key factor in causing “common sense to go out the window”
(p. 763), resulting in a decreased likelihood of condom use. But how on earth might they study this phenomenon in
the laboratory? In the authors’ words, “even the most ambitious of scientists would have to conclude that it is
impossible to observe the effects of intoxication on actual condom use in a controlled laboratory setting” (p. 765).
To solve this dilemma, MacDonald and colleagues developed a clever technique for studying people’s intentions to
use condoms. Participants were randomly assigned to either an alcohol or placebo condition, and then they viewed a
video depicting a young couple faced with the dilemma of whether to have unprotected sex. At the key decision
point in the video, the tape was stopped and participants were asked what they would do in the situation. As
predicted, participants who were randomly assigned to consume alcohol said they would be more willing to proceed
with unprotected sex. While this laboratory study does not capture the full experience of making decisions about
casual sex, it does a pretty nice job of capturing the psychological conditions involved.
Generalizing Results
The second component of external validity, generalizability, refers to the extent to which the results extend to
other studies by using a wide variety of populations and a wide variety of operational de�initions (sometimes
referred to as population validity). If we conclude that rejection causes people to become more aggressive, for
example, this conclusion should ideally carry over to other studies of the same phenomenon, studies that use
different ways of manipulating rejection and different ways of measuring aggression. If we want to conclude that
alcohol reduces the intention to use condoms, we would need to test this relationship in a variety of settings—from
laboratories to nightclubs—using different measures of intentions.
Thus, each single study researchers conduct is limited in its conclusions. For a particular idea to take hold in the
scienti�ic literature, it must be replicated, or repeated in different contexts. Replication can take one of four forms.
First, exact replication involves trying to recreate the original experiment as closely as possible to verify the
�indings. This type of replication is often the �irst step following a surprising result, and it helps researchers to gain
more con�idence in the patterns.
The second and much more common method, conceptual replication, involves testing the relationship between
conceptual variables using new operational de�initions. Conceptual replications would include testing aggression
hypotheses using new measures or examining the link between alcohol and condom use in different settings. For
example, rejection might be operationalized in one study by having participants be chosen last for a group project. A
conceptual replication might take a different approach, operationalizing rejection by having participants be ignored
during a group conversation or voted out of the group. Likewise, a conceptual replication might change the
operationalization of aggression, with one study measuring the delivery of loud blasts of noise and another
measuring the amount of hot sauce that people give to their rejecters. Each variation studies the same concept
(aggression or rejection) but uses slightly different operationalizations. If all of these variations yield similar results,
this further supports the underlying ideas—in this case, that rejection causes people to be more aggressive.
The third method, participant replication, involves repeating the study with a new population of participants.
These types of replication are usually driven by a compelling theory as to why the two populations differ. For
example, we might reasonably hypothesize that the decision to use condoms is guided by a different set of
considerations among college students than among older, single adults. Or, we might hypothesize that different
cultures around the world might have different responses to being rejected from a group.
Finally, constructive replication re-creates the original experiment but adds elements to the design. These
additions are typically designed to either rule out alternative explanations or extend knowledge about the variables
under study. Our rejection and aggression example might compare the impact of being rejected by a group versus by
an individual.
Internal Versus External Validity
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 14/40
This chapter has focused on two ways to assess validity in the context of experimental designs. Internal validity
assesses the degree to which results can be attributed to independent variables; external validity assesses how well
results generalize beyond the speci�ic conditions of the experiment. In an ideal world, studies would have a high
degree of both of these. That is, we would feel completely con�ident that our independent variable was the only
cause of differences in our dependent variable, and our experimental paradigm would perfectly capture the real-
world phenomenon under study.
Reality, though, often demands a trade-off between internal and external validity. In MacDonald et al.’s (1996) study
on condom use, the researchers sacri�iced some realism in order to conduct a tightly controlled study of
participants’ intentions. In Berkowitz and LePage’s (1967) study on the effect of weapons, the researchers risked the
presence of a demand characteristic in order to study reactions to actual weapons. These types of trade-offs are
always made based on the goals of the experiment.
Research: Applying Concepts
Balancing Internal Versus External Validity
To give you a better sense of how researchers make the compromises involving internal and external validity,
consider the following �ictional scenarios.
Scenario 1—Time Pressure and Stereotyping
Dr. Bob is interested in whether people are more likely to rely on stereotypes when they are in a hurry. In a
well-controlled laboratory experiment, he asks participants to categorize ambiguous shapes as either
squares or circles, and half of these participants are given a short time limit to accomplish the task. The
independent variable is the presence or absence of time pressure, and the dependent variable is the extent to
which people use stereotypes in their classi�ication of ambiguous shapes. Dr. Bob hypothesizes that people
will be more likely to use stereotypes when they are in a hurry because they will have fewer cognitive
resources to consider carefully all aspects of the situation. Dr. Bob takes great care to have all participants
meet in the same room. He uses the same research assistant every time, and the study is always conducted in
the morning. Consistent with his hypothesis, Dr. Bob �inds that people seem to use shape stereotypes more
under time pressure.
The internal validity of this study appears high—Dr. Bob has controlled for other in�luences on participants’
attention span by collecting all of his data in the morning. He has also minimized error variance by using the
same room and the same research assistant. In addition, Dr. Bob has created a tightly controlled study of
stereotyping through the use of circles and squares. Had he used photographs of people (rather than shapes),
the attractiveness of these people might have in�luenced participants’ judgments. The study, however, has a
trade-off: By studying the social phenomenon of stereotyping using geometric shapes, Dr. Bob has removed
the social element of the study, thereby posing a threat to mundane realism. The psychological meaning of
stereotyping shapes is rather different from the meaning of stereotyping people, which makes this study
relatively low in external validity.
Scenario 2—Hunger and Mood
Dr. Jen is interested in the effects of hunger on mood; not surprisingly, she predicts that people will be
happier when they are well fed. She tests this hypothesis with a lengthy laboratory experiment, requiring
participants to be con�ined to a laboratory room for 12 hours with very few distractions. Participants have
access to a small pile of magazines to help pass the time. Half of the participants are allowed to eat during
this time, and the other half is deprived of food for the full 12 hours. Dr. Jen—a naturally friendly person—
collects data from the food-deprivation groups on a Saturday afternoon, while her grumpy research assistant,
Mike, collects data from the well-fed group on a Monday morning. Her independent variable is food
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 15/40
deprivation, with participants either not deprived of food or deprived for 12 hours. Her dependent variable
consists of participants’ self-reported mood ratings. When Dr. Jen analyzes the data, she is shocked to
discover that participants in the food-deprivation group are much happier than those in the well-fed group.
Compared to our �irst scenario, this study seems high on external validity. To test her predictions about food
deprivation, Dr. Jen actually deprives her participants of food. One possible problem with external validity is
that participants are con�ined to a laboratory setting during the deprivation period with only a small pile of
magazines to read. That is, participants may be more affected by hunger when they do not have other things
to distract them. In the real world, people are often hungry but distracted by paying attention to work, family,
or leisure activities. Dr. Jen, though, has sacri�iced some external validity for the sake of controlling how
participants spend their time during the deprivation period. The larger problem with her study has to do
with internal validity. Dr. Jen has accidentally confounded two additional variables with her independent
variable: Participants in the deprivation group have a different experimenter and data are collected at a
different time of day. Thus, Dr. Jen’s surprising results most likely re�lect that everyone is in a better mood on
Saturday than on Monday and that Dr. Jen is more pleasant to spend 12 hours with than Mike is.
Scenario 3—Math Tutoring and Graduation Rates
Dr. Liz is interested in whether specialized math tutoring can help increase graduation rates among female
math majors. To test her hypothesis, she solicits female volunteers for a math-skills workshop by placing
�liers around campus, as well as by sending email announcements to all math majors. The independent
variable is whether participants are in the math skills workshop, and the dependent variable is whether
participants graduate with a math degree. Those who volunteer for the workshop are given weekly skills
tutoring, along with informal discussion groups designed to provide encouragement and increase motivation.
At the end of the study, Liz is pleased to see that participants in the workshops are twice as likely as
nonparticipants to stick with the major and graduate.
The obvious strength of this study is its external validity. Dr. Liz has provided math tutoring to math majors,
and she has observed a difference in graduation rates. Thus, this study is very much embedded in the real
world. However, this external validity comes at a cost to internal validity. The study’s biggest �law is that Dr.
Liz has recruited volunteers for her workshops, resulting in selection bias for her sample. People who
volunteer for extra math tutoring are likely to be more invested in completing their degree and might also
have more time available to dedicate to their education. Dr. Liz would also need to be mindful of how many
people drop out of her study. If signi�icant numbers of participants withdraw, she could have a problem with
differential attrition, so that the most motivated people stayed with the workshops. Dr. Liz can �ix this study
with relative ease by asking for volunteers more generally and then randomly assigning these volunteers to
take part in either the math tutoring workshops or a different type of workshop. While the sample might still
be less than random, Dr. Liz would at least have the power to assign participants to different groups.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 16/40
5.4 Experimental Design
The process of designing experiments boils down to deciding what to manipulate and how to do it. This section
covers two broad issues related to experimental design: deciding how to structure the levels, or different versions of
an independent variable, and deciding on the number of independent variables necessary to test the hypotheses.
While these decisions may seem tedious, they are at the crux of designing successful experiments, and are, therefore,
the key to performing successful tests of hypotheses.
Levels of the Independent Variable
The primary goal in designing experiments is to ensure that the levels of independent variables are equivalent in
every way but one. This is what allows researchers to make causal statements about the effects of that single change.
These levels can be formed in one of two broad ways: representing two distinct groups of people or representing the
same group of people over time.
Between-Subject Designs
In most of the examples discussed so far, the levels of independent variables have represented two distinct groups—
participants are in either the control group or the experimental group. This type of design is referred to as a
between-subject design because the levels differ between one subject and the next. Each participant who enrolls in
the experiment is exposed to only one level of the independent variable—for example, either the experimental or the
control group. Most of the examples so far have been illustrations of between-subject designs: participants receive
either alcohol or a placebo; students read an article designed to prime either their Asian or their female identity;
and graduate students train rats that are falsely labeled either bright or dull. The “either-or” between-subject
approach is common and has the advantage of using distinct groups to represent each level of the independent
variable. In other words, participants who are asked to consume alcohol are completely distinct from those asked to
consume the placebo drink. However, the between-subject approach is only one option for structuring the levels of
the independent variable. This section examines two additional ways to structure these levels.
Within-Subject Designs
In some cases, the levels of the independent variable can represent the same participants at different time periods.
This type of design is referred to as a within-subject design because the levels differ within individual participants.
Each participant who enrolls in the experiment would be exposed to all levels of the independent variable. That is,
every participant would be in both the experimental and the control group. Within-subject designs are often used to
compare changes over time in response to various stimuli. For example, a researcher might measure anxiety
symptoms before and after people are locked in a room with a spider, or measure depression symptoms before and
after people undergo drug treatment.
Within-subject designs have two main advantages over between-subject designs. First, because the same people
constitute both levels of the IV, these designs require fewer participants. Suppose we decide to collect data from 20
participants at each level of an IV. In a between-subject design with three levels, we would need 60 people. However,
if we run the same experiment as a within-subject design—exposing the same group of people to three different sets
of circumstances—we would need only 20 people. Thus, within-subject designs are often a good way to conserve
resources.
Second, participants also serve as their own control group, allowing the researcher to minimize a major source of
error variance. Remember that one key feature of experimental design is the researcher’s power to assign people to
groups to distribute subject differences randomly across the levels of the IV. Using a within-subject design solves the
problem of subject differences in another way, by examining changes within people. For instance, in the study of
spiders and anxiety, some participants are likely to have higher baseline anxiety than others. By measuring changes
in anxiety in the same group of people before and after spider exposure, we are able to minimize the effects of
individual differences.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 17/40
Wavebreakmedia/iStock/Thinkstock
Carryover effects can be understood through the
example of monitoring people’s reactions to
different �ilm clips. How they feel about one
image may in�luence how they react to the next
image.
Disadvantages of Within-Subject Designs
Within-subject designs also have two clear disadvantages compared to between-subject designs. First, they pose the
risk of carryover effects, in which the effects of one level are still present when another level is introduced. Because
the same people are exposed to all levels of the IV, it can be dif�icult to separate the effects of one level from the
effects of the others. One common paradigm in emotion research is to show participants several �ilm clips that elicit
different types of emotion. People might view one clip showing a puppy playing with a blanket, another showing a
child crying, and another showing a surgical amputation. Even without seeing these clips in full color, we can
imagine that it would be hard to shake off the disgust triggered by the amputation to experience the joy triggered by
the puppy.
When researchers use a within-subject design, they take
steps to minimize carryover effects. In studies of emotion, for
example, researchers typically show a brief neutral clip—like
waves rolling onto a beach—after each emotional clip, so that
participants experience each emotion after viewing a benign
image. Another simple technique is to collect data from the
baseline control condition �irst whenever possible. In the
study of spiders and anxiety, it would be important to
measure baseline anxiety at the start of the experiment
before exposing people to spiders. Once people have been
surprised by a spider, it will be hard to get them to relax
enough to collect control ratings of anxiety.
Second, within-subject designs risk order effects, meaning
that the order in which levels are presented can moderate
their effects. Order effects fall into two categories. The
practice effect happens when participants’ performance
improves over time simply due to repeated attempts. This is a
particular problem in studies that examine learning. Say we
use a within-subject design to compare two techniques for
teaching people to solve logic problems. Participants would learn technique A, then take a logic test, then learn
technique B, and then take a second logic test. The possible problem is that participants will have had more
opportunities to practice logic problems by the time they take the second test. This makes it dif�icult to separate the
effects of practicing the logic problems from the effects of using different teaching techniques.
The �lipside of practice effects is the phenomenon of the fatigue effect, which happens when participants’
performance decreases over time due to repeated testing. Imagine running a variation of the above experiment,
teaching people different ways to improve their reaction time. Participants might learn each technique and have
their reaction time tested several times after each one. The problem is that people gradually start to tire, and their
reaction times slow down due to fatigue. Thus, it would be dif�icult to separate the effects of fatigue from the effects
of the different teaching techniques.
The result of both types of order effects is in confounding the order of presentation with the level of the
independent variable. Fortunately, researchers have a relatively easy way to avoid both carryover and fatigue effects:
a process called counterbalancing. Counterbalancing involves varying the order of presentation to groups of
participants. The simplest approach is to divide participants into as many groups as combinations of levels in the
experiment. That is, we create a group for each possible order, allowing us to identify the effects of encountering the
conditions in different orders. In the examples above, the learning experiments involved two techniques, A and B. To
counterbalance these techniques across the study, we divide the participants into two groups. We expose one group
to A and then B; we expose the other group to B and then A. When it is time to analyze the data, we will be able to
examine the effects of both presentation order and teaching technique. If the order of presentation made a
difference, then the A/B group would differ from the B/A group in some way.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 18/40
Mixed Designs
The third common way to structure the levels of an IV is using a mixed design, which contains at least one between-
subject variable and at least one within-subject variable. So, in the previous example, participants would be exposed
to both teaching techniques (A and B) but in only one order of presentation. In this case, teaching technique is a
within-subject variable because participants experience both levels, and presentation order is a between-subject
variable because participants experience only one level. Because we have one of each in the overall experiment, it is
a mixed design.
Studies that compare the effects of different drugs commonly use mixed designs. Imagine we want to compare three
new drugs—Drug X, Drug Y, and a placebo control—to determine which has the strongest effects on reducing
depression symptoms. To perform this study, we would want to measure depression symptoms on at least three
occasions: before starting drug treatment, after a few months of taking the drug, and then again after a few months
of stopping the drug (to assess relapse rates). So, our participants would be given one of three possible drugs and
then measured at each of three time periods. In this mixed design, measurement time is a within-subject variable
because participants are measured at all possible times, while the drug is a between-subject variable because
participants experience only one of three possible drugs.
Figure 5.4 shows the hypothetical results of this study. Observe that the placebo pill has no effect on depression
symptoms; depression scores in this group are the same at all three measurements. Drug X appears to cause
signi�icant improvement in depression symptoms; depression scores drop steadily across measurements in this
group. Strangely, Drug Y seems to make depression worse; depression scores increase steadily across measurements
in this group. The mixed design allows us both to track people over time and to compare different drugs in one
study.
Figure 5.4: Example of a mixed-subjects design
Research: Thinking Critically
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 19/40
Outwalking Depression
Follow the link below to an article from Psychology Today, describing a 2011 research study from the Journal
of Psychiatric Research. The study provides new evidence of the bene�its of exercise for people with
depression. As you read the article, consider what you have learned so far about the research process, and
then respond to the questions below.
https://www.psychologytoday.com/blog/exercise-and-mood/201107/outwalking-depression
(https://www.psychologytoday.com/blog/exercise-and-mood/201107/outwalking-depression)
Think About It
1. Identify the following essential aspects of this experimental design:
a) What are the IV and DV in this study?
b) How many levels does the IV have?
c) Is this a between-subjects, within-subjects, or mixed design?
d) Draw a simple table labeling each condition.
2. a) What preexisting differences between groups should the researchers be sure to take into account?
Name as many as you can.
b) How should the researchers assign participants to the conditions in order to ensure that
preexisting differences cannot account for the results?
3. How might expectancy effects in�luence the results of this study? Can you think of any ways to
control for this?
4. Brie�ly state how you would replicate this study in each of the following ways:
a) exact replication
b) conceptual replication
c) participant replication
d) constructive replication
One-Way Versus Factorial Designs
The second big issue in creating experimental designs is to decide how many independent variables to manipulate.
In some cases, we can test our hypotheses by manipulating a single IV and measuring the outcome—such as giving
people either alcohol or a placebo drink and measuring the intention to use condoms. In other cases, hypotheses
involve more complex combinations of variables. Earlier, the chapter discussed research �indings that people tend to
act more aggressively after a peer group has rejected them—a single independent variable. Researchers could,
however, extend this study and ask what happens when people are rejected by members of the same sex versus
members of the opposite sex. We could go one step further and test whether the attractiveness of the rejecters
matters, for a total of three independent variables. These examples illustrate two broad categories of experimental
design, known as one-way and factorial designs.
One-Way Designs
If a study involves assigning people to either an experimental or control group and measuring outcomes, it has a
one-way design, or a design that has only one independent variable with two or more levels to the variable. These
tend to be the simplest experiments and have the advantage of testing manipulations in isolation. The majority of
https://www.psychologytoday.com/blog/exercise-and-mood/201107/outwalking-depression
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 20/40
drug studies use one-way designs. These types of study compare the effects on medical outcomes for people
randomly assigned, for instance, to take the antidepressant drug Prozac or a placebo. Note that a one-way design can
still have multiple levels—in many cases it is preferable to test several different doses of a drug. So, for example, we
might test the effects of Prozac by assigning people to take doses of 5 mg, 10 mg, 20 mg, or a placebo control. The
independent variable would be the drug dose, and the dependent variable would be a change in depression
symptoms. This one-way design would allow us to compare all three of the drug doses to a placebo control, as well
as to test the effects of varying doses of the drug. Figure 5.5 shows hypothetical results from this study. We can see
that even those receiving the placebo showed a drop in depression symptoms, with the 10-mg dose of Prozac
producing the maximum bene�it.
Figure 5.5: Comparing drug doses in a one-way design
Factorial Designs
Despite the appealing simplicity of one-way designs, experiments conducted in the �ield of psychology with only one
IV are relatively rare. The real world is much more complicated, so studies that focus on people’s thoughts, feelings,
and behaviors must somehow capture this complexity. Thus, the rejection-and-aggression example above is not that
farfetched. If a researcher wanted to manipulate the occurrence of rejection, the gender of the rejecters, and the
attractiveness of the rejecters in a single study, the experiment would have a factorial design. Factorial designs are
those that have two or more independent variables, each of which has two or more levels. When experimenters use
a factorial design, their purpose is to observe both the effects of individual variables and the combined effects of
multiple variables.
Factorial designs have their own terminology to re�lect the fact that they include both individual variables and
combinations of variables. The beginning of this chapter explained that the versions of an independent variable are
referred to as both levels and conditions, with a subtle difference between the two. This difference becomes relevant
to the discussion of factorial designs. Speci�ically, levels refer to the versions of each IV, while conditions refer to the
groups formed by combinations of IVs. Consider one variation of the rejection-and- aggression example from this
perspective: The �irst IV has two levels because participants are either rejected or not rejected. The second IV also
has two levels because members of the same sex or the opposite sex do the rejecting. To determine the number of
conditions in this study, we calculate the number of different experiences that participants can have in the study.
This is a simple matter of multiplying the levels of separate variables, so two multiplied by two, for a total of four
conditions.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 21/40
Researchers also have a way to quickly describe the number of variables in their design: A two-way design has two
independent variables; a three-way design has three independent variables; an eight-way design has eight
independent variables, and so on. Even more useful, the system of factorial notation offers a simple way to describe
both the number of variables and the number of levels in experimental designs. For instance, we might describe our
design as a 2 × 2 (pronounced “two by two”), which instantly communicates two things: (1) the study uses two
independent variables, indicated by the presence of two separate numbers and (2) each IV has two levels, indicated
by the number 2 listed for each one.
The 2 × 2 Design
One of the most common factorial designs also happens to be the simplest one—the 2 × 2 design. As noted above,
these designs have two independent variables, with two levels each, for a total of four experimental conditions. The
simplicity of these designs makes them a useful way to become more comfortable with some of the basic concepts of
experiments. This section will explore an example of a 2 × 2 and analyze it in detail.
Beginning in the late 1960s, social psychologists developed a keen interest in understanding the predictors of
helping behavior. This interest was inspired, in large part, by the tragedy of Kitty Genovese, who was killed outside
her apartment building while none of her neighbors called the police (Gansberg, 1964). As Chapter 2 (2.1)
discussed, in one representative study, Princeton psychologists John Darley and Bibb Latané examined people’s
likelihood of responding to a staged emergency. Participants were led to believe that they were taking part in a
group discussion over an intercom system, but in reality, all of the other participants were prerecorded. The key
independent variable was the number of other people supposedly present, ranging from two to six. A few minutes
into the conversation, one participant appeared to have a seizure. The recording went like this (actual transcript;
Darley & Latané, 1968):
I could really-er-use some help so if someone would-er-give me a little h-hel-puh-er-er-er c-could
somebody er-er-hel-er-uh-uh-uh [choking sounds] . . . I’m gonna die-er-er-I’m . . . gonna die-er-hel-er-er-
seizure-er [chokes, then quiet].
What do people do in this situation? Do they help? How long does it take? Darley and Latané discovered that two
things happen as the group became larger: People were less likely to help at all, and those who did help took
considerably longer to do so. Researchers concluded from this and other studies that people are less likely to help
when other people are present because the responsibility for helping is “diffused” among the members of the crowd
(Darley & Latané, 1968).
Building on this earlier conclusion, the sociologist Jane Piliavin and her colleagues (Piliavin, Piliavin, & Rodin, 1975)
explored the in�luence of two additional variables on helping behavior. The experimenters staged an emergency on a
New York City subway train in which a person who was in on the study appeared to collapse in pain. Piliavin and her
team manipulated two variables in their staged emergency. The �irst independent variable was the presence or
absence of a nearby medical intern, who could be easily identi�ied in blue scrubs. The second independent variable
was the presence or absence of a large dis�iguring scar on the victim’s face. The combination of these variables
resulted in four conditions, as Table 5.2 shows. The dependent variable in this study was the percentage of people
taking action to help the confederate.
Table 5.2: 2 × 2 Design of the Piliavin et al. study
No intern Intern
No scar 1 2
Scar 3 4
The authors predicted that bystanders would be less likely to help if a perceived medical professional was nearby
since he or she was considered more quali�ied to help the victim. They also predicted that people would be less
likely to help when the confederate had a large scar because previous research had demonstrated convincingly that
people avoid contact with those who are dis�igured or have other stigmatizing conditions (e.g., Goffman, 1963). As
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 22/40
Figure 5.6: Sample 2 × 2 design:
Results from
Piliavan et al. (1975)
Piliavan et al. (1975)
Figure 5.6 reveals, the results supported these hypotheses. Both the presence of a scar and the presence of a
perceived medical professional reduced the percentage of people who came to help. Nevertheless, something else is
apparent in these results: When the confederate was not scarred, having an intern nearby led to a small decrease in
helping (from 88% to 84%). However, when the confederate had a large facial scar, having an intern nearby
decreased helping from 72% to 48%. In other words, it seems these variables are having a combined effect on
helping behavior. The next section examines these combined effects more closely.
Main Effects and Interactions
When experiments involve only one independent variable, the
analyses can be as simple as comparing two group means—as did
the example in Chapter 1, which compared the happiness levels of
couples with and without children. But what about cases where the
design has more than one independent variable?
A factorial design has two types of effects: A main effect refers to
the effect of each independent variable on the dependent variable,
averaging values across the levels of other variables. A 2 × 2 design
has two main effects; a 2 × 2 × 2 design has three main effects
because there are three IVs. An interaction occurs when the
variables have a combined effect; that is, the effects of one IV are
different depending on the levels of the other IV. So, applying this
new terminology to the Piliavin et al. (1975) “subway emergency”
study, produces three possible results (“possible,” because we
would need to use statistical analyses to verify them):
1. The main effect of scar: Does the presence of a scar affect
helping behavior?
Yes. More people help in absence of a facial scar. Figure 5.6
indicates that the bars on the left (no scar) are, on average,
higher than those on the right (scar).
2. The main effect of intern: Does the presence of an intern
affect helping behavior?
Yes. More people help when no medical intern is on hand.
Note that in Figure 5.6, the red bars (no intern) are, on
average, higher than the tan bars (intern).
3. The interaction between scar and intern: Does the effect of
one variable depend on the effect of another variable?
Yes. Refer to Figure 5.6 and observe that the presence of a
medical intern matters more when the victim has a facial
scar. In visual terms, the gap between red and tan bars is
much larger in the bars on the right. This indicates an
interaction between scar and intern.
Consider a �ictional example. Imagine we are interested in people’s perceptions of actors in different types of
movies. We might predict that some actors are better suited to comedy and others are better suited to action movies.
A simple experiment to test this hypothesis would show four movies in a 2 × 2 design, using the same two actors in
two movies (for a total of four conditions). The �irst IV would be the movie type, with two levels: action and comedy.
The second IV would be the actor, with two levels: Will Smith and Arnold Schwarzenegger. The dependent variable
would be the ratings of each movie on a 10-point scale. This design produces three possible results:
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 23/40
1. The main effect of actor: Do people generally prefer Will Smith or Arnold Schwarzenegger, regardless of the
movie?
2. The main effect of movie type: Do people generally prefer action or comedy movies, regardless of the actor?
3. The interaction between actor and movie type: Do people prefer each actor in a different kind of movie? (i.e.,
are ratings affected by the combination of actor and movie type?)
After collecting data from a sample of participants, we end up with the following average ratings for each movie,
which Table 5.3 shows.
Table 5.3: Main effects and marginal means: the actor study
Remember that main effects represent the effects of one IV, averaging across the levels of the other IV. To average
across levels, we calculate the marginal means, or the combined mean across levels of another factor. In other
words, the marginal mean for action movies is calculated by averaging together the ratings of both Arnold
Schwarzenegger and Will Smith in action movies. The marginal mean for Arnold Schwarzenegger is calculated by
averaging together ratings of Arnold Schwarzenegger in both action and comedy movies. Performing these
calculations for our 2 × 2 design results in four marginal means, which are presented alongside the participant
ratings in Table 5.3. To verify these patterns would require statistical analyses, but it appears that people have a
slight preference for comedy over action movies, as well as a slight preference for Arnold Schwarzenegger’s acting
over Will Smith’s acting.
What about the interaction? The main hypothesis here posits that some actors perform best in some genres of
movies (e.g., action or comedy) than they do in other genres, which suggests that the actor and the movie type have
a combined effect on people’s ratings of the movies. Examining the means in Table 5.3 conveys a sense of this �inding,
but it is much easier to appreciate in a graph. Figure 5.7 shows the mean of participants’ ratings across the four
conditions. If we focus �irst on the ratings of Arnold Schwarzenegger, we can see that participants did have a slight
preference for him in action (6) versus comedy (5) roles. Then, examining ratings of Will Smith, we can see that
participants had a strong preference for him in comedy (8) versus action (1.5) roles. Together, this set of means
indicates an interaction between actor and movie type because the effects of one variable depend on another. In
plain English: People’s perceptions of an actor depend on the type of movie in which he or she performs. This
pattern of results nicely �its for the hypothesis that certain actors are better suited to certain types of movie: Arnold
should probably stick to action movies, and Will should de�initely stick to comedies.
Figure 5.7: Interaction in the actor study
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 24/40
Before moving on to the logic of analyzing experiments, consider one more example from a published experiment. A
large body of research in social psychology suggests that stereotypes can negatively affect performance on cognitive
tasks (e.g., tests of math and verbal skills). According to Stanford social psychologist Claude Steele and his
colleagues, individuals’ fear of con�irming negative stereotypes about their group acts as a distraction. This
distraction—which the researchers term stereotype threat—makes it hard to concentrate and perform well, and
thus leads to lower scores on a cognitive test (Steele, 1997). One of the primary implications of this research is that
ethnic differences in standardized-test scores can be viewed as a situational phenomenon—change the situation,
and the differences go away. In the �irst published study of stereotype threat, Claude Steele and Josh Aronson (1995)
found that when African-American students at Stanford were asked to indicate their race before taking a
standardized test, this was enough to remind them of negative stereotypes, and they performed poorly. When the
testing situation was changed, however, and participants were no longer asked their race, the students performed at
the same level as Caucasian students. Worth emphasizing is that these were Stanford students and had therefore
met admissions standards for one of the best universities in the nation. Even this group of elite students was
susceptible to situational pressure but performed at their best when the pressure was eliminated.
In a great application of stereotype threat, social psychologist Jeff Stone
at the University of Arizona asked both African-American and Caucasian
college students to try their hands at putting on a golf course (Stone,
Lynch, Sjomeling, & Darley, 1999). Putting was described as a test of
natural athletic ability to half of the participants and as a test of sports
intelligence to the other half. Thus, the experiment had two independent
variables: the race of the participants (African-American or Caucasian)
and the description of the task (“athletic ability” or “sports intelligence”).
Note that “race” in this study is technically a quasi-independent variable
because it is not manipulated. This design resulted in a total of four
conditions, and the dependent variable was the number of putts that
participants managed to make. Stone and colleagues hypothesized that
describing the task as a test of athletic ability would lead Caucasian
participants to worry about the stereotypes regarding their poor athletic
ability. In contrast, describing the task as a test of intelligence would lead
African-American participants to worry about the stereotypes regarding
their lower intelligence.
Consistent with their hypotheses, Stone and colleagues found an
interaction between race and task description but no main effects. That
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 25/40
Comstock Images/Stockbyte/Thinkstock
Skill on the golf course was used to
study stereotypes in an experiment
conducted by Jeff Stone at the
University of Arizona.
is, neither race was better at the putting task overall, and neither task
description had an overall effect on putting performance. The
combination of these variables, though, proved fascinating. When
researchers described the task as measuring sports intelligence, the
African-American participants did poorly due to fear of con�irming
negative stereotypes about their overall intelligence. Conversely, when
researchers described the task as measuring natural athletic ability, the
Caucasian participants did poorly due to fear of con�irming negative
stereotypes about their athleticism. This study beautifully illustrates an
interaction; the effects of one variable (task description) depend on the
effects of another (race of participants). The results further con�irm the
power of the situation: Neither group did better or worse overall, but
both were responsive to a situationally induced fear of con�irming
negative stereotypes.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 26/40
Figure 5.8: Comparing sources of
variance
5.5 Analyzing Data From Experiments
So far, we have been drawing conclusions about experimental �indings using conceptual terms. But naturally, before
we actually make a decision about the status of our hypotheses, we have to conduct statistical analyses. This section
provides a conceptual overview of the most common statistical techniques for analyzing experimental data.
Dealing With Multiple Groups
Why do researchers need a special technique for experimental designs? After all, we learned in Chapter 2 (2.4) that
we can compare two pairs of means using a t test; why not use several t tests to analyze our experimental designs?
For the movie ratings study, we could analyze the data using a total of six t tests to capture every possible pair of
means:
Arnold Schwarzenegger in a comedy versus Will Smith in a comedy;
Arnold Schwarzenegger in an action movie versus Will Smith in an action movie;
Arnold Schwarzenegger in a comedy versus an action movie;
Will Smith in a comedy versus an action movie;
Will Smith in a comedy versus Arnold Schwarzenegger in an action movie;
and �inally Will Smith in an action movie versus Arnold Schwarzenegger in a comedy.
This approach, however, presents a problem. The odds of making a Type I error (getting excited about a false
positive) increase with every statistical test. Researchers typically set their alpha level at 0.05 for a t test, meaning
that they are comfortable with a 5% chance of a Type I error. Unfortunately, if we conduct six t tests, each one has a
5% chance of a Type I error, meaning that we have a greater chance of a false-positive result somewhere in the study.
In short, we need a statistical approach that reduces the number of comparisons we perform. Fortunately, a
statistical technique called the analysis of variance (ANOVA) tests for differences by comparing the amount of
variance explained by the independent variables to the variance explained by error.
The Logic of ANOVA
The logic behind an analysis of variance is rather straightforward.
As the course has discussed throughout, variability in a dataset can
be divided into systematic and error variance. That is, we can
attribute some of the variability to the factors being studied, but a
degree of random error will always be present. In our movie
ratings study, some of the variability in these ratings can be
attributed to the independent variables (differences in actors and
movie types), while some of the variability is due to other factors—
perhaps some people simply like movies more than other people.
The ANOVA works by comparing the in�luence of these different
sources of variance. We always want to explain as much of the
variance as possible through the independent variables. If the
independent variables have more in�luence than random error
does, this is good news. If, on the other hand, error variance has
more in�luence than the independent variables, this is bad news for
the hypotheses. Comparing the three pie charts in Figure 5.8
conveys a sense of this problem. The proportion of variance
explained by our independent variables is shaded in tan, while the
proportion explained by error is shaded in red. In the top graph,
the independent variables explain approximately 80% of the
variance, which we can view as a good result. In the middle graph,
however, variance is explained equally by the independent
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 27/40
variables and by error, and in the bottom graph, the independent
variables explain only 20% of the variance. Thus, in the latter two
graphs, the independent variables do no better than random error
at explaining the results.
One more analogy may be helpful. In the �ield of engineering, the
term signal-to-noise ratio is used to describe the amount of light,
sound, energy, etc., that is detectable above and beyond
background noise. This ratio is high when the signal comes through
clearly and low when it is mixed with static or other interference.
Likewise, when someone tries to tune in a favorite radio station,
the goal is to �ind a clear signal that is not covered up by static.
Believe it or not, the ANOVA statistic (symbolized F) is doing the
same thing. That is, the analysis tells us whether differences in
experimental conditions (signal) are detectable above and beyond
error variance (noise).
Research: Thinking Critically
Love Ballad Leaves Women More Open to a Date
Follow the link below to a press release describing a 2010 study from the journal Psychology of Music. The
study suggests that listening to love ballads may make women more likely to give their phone number to
someone they have just met. As you read the article, consider what you have learned so far about the
research process, and then respond to the questions below.
http://www.sciencedaily.com/releases/2010/06/100618112139.htm
(http://www.sciencedaily.com/releases/2010/06/100618112139.htm)
Think About It
1. In this experiment, the type of song (love song or neutral song) is confounded with at least one other
variable. Try to identify one. Do you think that this confounded variable would make a difference?
How would you design a study that overcomes this?
2. Describe how demand characteristics might compromise the internal validity of this study. Can you
think of any ways around this?
3. Toward the end of the article, the authors suggest that one explanation for these results is that the
romantic music put the women into a more positive mood, and that this in turn made them more
receptive to the men. How could you design a study that tests this hypothesis?
4. Given the nature of the DV in this study, would an ANOVA test be appropriate? What would be the
more appropriate statistical test, and why?
Exploring the Data
http://www.sciencedaily.com/releases/2010/06/100618112139.htm
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 28/40
Statistics courses cover ANOVA in more detail, but, despite its elegant simplicity, the test has a notable limitation.
After conducting an ANOVA, we have a yes-or-no answer to the following question: Do our experimental groups have
a systematic effect on the dependent variable? The answer lets us decide whether to reject the null hypothesis, but it
does not tell us everything we want to know about the data. In essence, a signi�icant F value tells us that the groups
have a signi�icant difference, but it does not tell us what the difference is. Conducting an ANOVA on our movie-ratings
study would reveal a signi�icant interaction between actor and movie, but we would need to take additional steps to
determine the meaning of this interaction.
This section will describe the process of exploring and interpreting ANOVA results to make sense of the data. The
example is drawn from a published study by Newman, Sellers, and Josephs (2005), which was designed to explore
the effects of testosterone on cognitive performance. Previous research had suggested that testosterone was
involved in two types of complex human behavior. On one hand, people with higher testosterone tend to perform
better on tests of spatial skills, such as having to rotate objects mentally, and perform worse on tests of verbal skills,
such as listing all the synonyms for a particular word. These patterns are thought to re�lect the in�luence of
testosterone on developing brain structures. On the other hand, people with higher testosterone are also more
concerned with gaining and maintaining high status relative to other people. Testosterone correlates with a person’s
position in the hierarchy and tends to rise and fall when people win and lose competitions, respectively. Sociologist
Alan Mazur and his colleagues measured testosterone levels before, during, and after a series of professional chess
matches. They found that testosterone rose in both players in anticipation of the competition, then rose even further
in the winners, but plummeted in the losers (Mazur, Booth, & Dabbs, 1992).
Newman and colleagues (2005) set out to test the combination of these variables. Based on previous research, they
hypothesized that people with higher testosterone would be uncomfortable when they were placed in a low-status
position, leading them to perform worse on cognitive tasks. The researchers tested this hypothesis by randomly
assigning people to a high status, low status, or control condition, and then administering a spatial and a verbal test.
The resulting between-subjects design was a 2 (testosterone: high or low) × 3 (condition: high status, low status,
control), for a total of six groups. Note that “testosterone” in this study is a quasi-independent variable, because it is
measured rather than manipulated by the experimenters.
Once the results were in, the ANOVA revealed an interaction between testosterone and status but no main effects.
Figure 5.9 shows the results of the study. These bars represent z scores that combine the spatial and verbal tests
into one number. So, what do these numbers mean? How do we make sense out of the patterns? Doing so involves a
combination of comparing means and calculating effect sizes, as we discuss next.
Figure 5.9: Exploring the data: Results from
Newman et al. (2005)
Newman et al. (2005)
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 29/40
Mean Comparisons
The �irst step in interpreting results is to compare the various pairs of means within the design. This might seem
counterintuitive, since the whole point of the ANOVA was to test for effects without comparing individual means.
Our goal, therefore, is to somehow explore differences in conditions without in�lating Type I error rates. Achieving
this balance involves two strategies.
Planned comparisons (also called a priori comparisons) involve comparing only the means for which differences
were predicted by the hypothesis. In the experiment by Newman et al. (2005), the hypothesis explicitly stated that
high-testosterone people should perform better in a high-status position than a low-status position. So, a planned
comparison for this prediction would involve comparing two means with a t test: high T, high status (the highest red
bar); and high T, low status (the lowest tan bar). Consistent with the researchers’ hypothesis, high-testosterone
people did perform higher on both tests, t(27) = 2.35, p = 0.01, but only in a high-status position. Type I errors are of
less concern with planned comparisons because only a small number of theoretically driven comparisons are being
conducted.
Referring to the graph of these results in Figure 5.9 and comparing high- with low- testosterone people reveals
another interesting pattern: In a high-status position, high-testosterone people do better than low-testosterone
people, but in a low-status position, this pattern is reversed, and high-testosterone people do worse. However, the
researchers did not predict these mean comparisons, so to do planned contrasts would be cheating. Instead, they
would use a second strategy called a post hoc comparison, which controls the overall alpha by taking into account
the fact that multiple comparisons are being performed. In most cases, research only permits post hoc tests if the
overall F test is signi�icant.
One popular way to conduct post hoc tests while minimizing the error rate is to use a technique called a Bonferroni
correction. This technique, named after the Italian mathematician who developed it, involves simply adjusting the
alpha level by the number of comparisons that are performed. For example, imagine we want to conduct 10 follow-
up post hoc tests to explore the data. The Bonferroni correction would involve dividing the alpha level (0.05) by the
number of comparisons (10), for a corrected alpha level of 0.005. Then, rather than using a cutoff of 0.05 for each
test, we use this more conservative Bonferroni-corrected value of 0.005. Translation: Rather than accepting a Type I
error rate of 5%, we are moving to a more conservative 0.5% cutoff to correct for the number of comparisons that
we are performing.
Another popular alternative to the Bonferroni correction is called Tukey’s HSD (for Honestly Signi�icant Difference).
This test works by calculating a critical value for mean comparisons (the HSD), and then using this critical value to
evaluate whether mean comparisons are signi�icantly different. The test manages to avoid in�lating Type I error
because the HSD is calculated based on the sample size, the number of experimental conditions, and the MSWG,
which essentially tests all the comparisons at once. In the study by Newman et al. (2005), both of these post hoc
tests were signi�icant: Compared to those low in testosterone, high-testosterone people did better in a high-status
position but worse in a low-status position, suggesting that high testosterone magni�ies the effect of testing
situations on cognitive performance.
Effect Size
Statistical signi�icance is only part of the story; researchers also want to know how big the effects of their
independent variables are. Researchers can calculate effect size using several ways, but in general, bigger values
mean a stronger effect. One of these statistics, Cohen’s d, is calculated as the difference between two means divided
by their pooled standard deviation. The resulting values can therefore be expressed in terms of standard deviations;
a d of 1 means that the means are one standard deviation apart. How big should we expect our effects to be? Based
on Cohen’s analyses of typical effect sizes in the social sciences, he suggests the following benchmarks: d = 0.20 is a
small effect; d = 0.40 is a moderate effect; and d = 0.60 is a large effect. In addition to these qualitative categories,
effect-size values can be interpreted in terms of standard deviation units. So, a d of 1 is equivalent to a standard
deviation of 1. In other words, a large effect in social and behavioral sciences accounts for a little more than half of a
standard deviation.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 30/40
In interpreting the results of their testosterone experiment, Newman and colleagues (2005) computed effect-size
measurements for two of the key mean comparisons. First, they compared high-testosterone people in the high- and
low-status conditions; the size of this effect was a d = 0.78. Second, they compared the high- and low-testosterone
people in the low-status condition; the size of this effect was a d = 0.61. Both of these effects fall in the “large” range
based on Cohen’s benchmarks. More important, taken together with the mean comparisons, they help us to
understand the way testosterone affects behavior. The authors conclude that cognitive performance stems from an
interaction between biology (testosterone) and environment (assigned status) such that high-testosterone people
are more responsive to their status in a given situation. When they are placed in a high-status position, they relax
and perform well. Conversely, when placed in a low-status position, they become distracted and perform poorly.
Researchers reach this nuanced conclusion only through an exploration of the data, using mean comparisons and
effect-size measures.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 31/40
5.6 Wrap-Up: Avoiding Error
As this �inal chapter concludes, it is worth thinking back to one of the key concepts in Chapter 2 (2.4): Type I and
Type II errors. Regardless of the research question, the hypothesis, or the particulars of the research design, all
studies have the goal of making accurate decisions about the hypotheses. That is, we need to be able to correctly
reject the null hypothesis when it is false, and fail to reject the null when it is true. Still, from time to time and
despite our best efforts, we make mistakes when we draw conclusions about our hypotheses, as Table 5.4
summarizes. A Type I error, or “false positive,” involves falsely rejecting a null hypothesis and becoming excited
about an effect that is due to chance. A Type II error, or “false negative,” involves failing to reject the null hypothesis
and missing an effect that is real and interesting. (For a refresher on these terms, refer back to Chapter 2.)
Table 5.4: Review of Type I and Type II errors
Researcher’s Decision
Reject Null Fail to Reject Null
Null is FALSE Correct Decision Type II Error
Null is TRUE Type I Error Correct Decision
This section takes a problem-solving approach to minimizing both of these errors in an experimental context. It
turns out that each error is primarily under the researcher’s control at different stages in the research process,
which means reducing each error calls for different strategies.
Avoiding Type I Error
Type I errors occur when results are due to chance but are mistakenly interpreted as signi�icant. We can generally
reduce the odds of this happening by setting our alpha level at p < 0.05, meaning that we will only be excited about
results that have less than a 5% chance of Type I error. However, Type I errors can still occur as a result of either
extremely large samples or large numbers of statistical comparisons. Large samples can make small effects seem
highly signi�icant, so it is important to set a more conservative alpha level in large-scale studies. And, this chapter
has discussed, the odds of Type I error are compounded with each statistical test we conduct.
What this means is that Type I error is primarily under researchers’ control during statistical analysis—the smarter
the statistics, the lower the odds of Type I error. This chapter has discussed several examples of “smart” statistics:
Instead of conducting lots of t tests, we use an ANOVA to test for differences across the entire design simultaneously.
Instead of conducting t tests to compare means after an ANOVA, we use a mix of planned contrasts (for comparisons
that we predicted) and post hoc tests (for other comparisons we want to explore). More advanced statistical
techniques take this a step further. For example, the multivariate analysis of variance (MANOVA) statistic
analyzes sets of dependent variables to reduce further the number of individual tests. Researchers use this approach
when dependent variables represent different measures of a related concept, such as using heart rate, blood
pressure, and muscle tension to capture the stress response. The MANOVA works, broadly speaking, by computing a
weighted sum of these separate DVs (called a canonical variable) and using this new variable as the dependent
variable. To learn more about this and other advanced statistical techniques, see the excellent volume by James
Stevens (2002), Applied Multivariate Statistics.
Avoiding Type II Error
Type II errors occur when a real underlying relationship exists between the variables, but the statistical tests are
nonsigni�icant. The primary sources of this error are small samples and bad design. Small samples may fail to
capture enough variability and may therefore lead to nonsigni�icant p values in testing an otherwise signi�icant
effect. Both large and small mistakes in experimental designs can add noise to the dataset, making it dif�icult to
detect the real effects of independent variables.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 32/40
This means that Type II error is primarily under the researcher’s control during the design process—the smarter
the research designs, the lower the odds of Type II error. First, as Chapter 2 discussed, it is relatively simple to
estimate the sample size needed for our research using a power calculator. These tools take basic information about
the number of conditions in the research design and the estimated size of the effect and then estimate the number of
people needed to detect this effect. (See Chapter 2, Figure 2.5, for an annotated example using one of these online
calculators.)
Second, as every chapter has discussed, it is the experimenter’s responsibility to take steps to minimize extraneous
variables that might interfere with the hypothesis test. Whether researchers are conducting an observation, a survey
study, or an experiment, the overall goal is to ensure that the variables of interest are the main cause of changes in
the dependent variable. This is perhaps easiest in an experimental context because these designs are usually
conducted in a controlled setting where the experimenter has control over the independent variables. Nonetheless,
as the chapter discussed earlier, many factors can threaten the internal validity of an experiment—from confounds
to sample bias to expectancy effects. In essence, the more we can control the in�luence of these extraneous variables,
the more con�idence we can have in the results of the hypothesis test.
Table 5.5 presents a summary of the information in this section, listing the primary sources of Type I and Type II
errors, as well as the time period when these are under experimenter control.
Table 5.5: Summary—avoiding error
Error De�inition Main Source When You Can Control
Type I False-positive Lots of tests; lots of people Conducting stats
Type II False-negative Bad measures; not enough people Designing experiments
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 33/40
Summary and Resources
Chapter Summary
This chapter focused on experimental designs, in which the primary goal is to explain behavior in causal terms. The
chapter began with an overview of experimental terminology and the key features of experiments. Three key
features distinguish experiments from other research designs. First, researchers manipulate a variable, giving them
a fair amount of con�idence that the independent variable (IV) causes changes in the dependent variable (DV).
Second, researchers control the environment, ensuring that everything about the experimental context is the same
for different groups of participants—except for the level of the independent variable. Finally, the researchers have
the power to assign participants to conditions using random assignment. This process helps to ensure that
preexisting differences among participants (e.g., in mood, motivation, intelligence, etc.) are balanced across the
experimental conditions.
Next, the chapter explained the concept of experimental validity. When evaluating experiments, researchers must
take into account both internal validity—or the extent to which the IV is the cause of changes in the DV—and
external validity—or the extent to which the results generalize beyond the speci�ic laboratory setting. Several
factors can threaten internal validity, including experimental confounds, selection bias, and expectancy effects. The
common thread among these threats is that they add noise to the hypothesis test and cast doubt on the direct
connection between IV and DV. External validity involves two components, the realism of the study and the
generalizability of the �indings. Psychology experiments are designed to study real-world phenomena, but
sometimes compromises have to be made to study these phenomena in the laboratory. Research often achieves this
balance via mundane realism, or replicating the psychological conditions of the real phenomenon. Last, researchers
have more con�idence in the �indings of a study when they can be replicated, or repeated in different settings with
different measures.
In designing the nuts and bolts of experiments, researchers have to make decisions about both the nature and
number of independent variables. First, designs can be described as between-subject, within-subject, or mixed. In a
between-subject design, participants are in only one experimental condition and receive only one combination of
the independent variables. In a within-subject design, participants are in all experimental conditions and receive all
combinations of the independent variables. Finally, a mixed design contains a combination of between- and within-
subject variables. In addition, research designs can be described as either one-way or factorial. One-way designs
consist of only one IV with at least two levels; factorial designs consist of at least two IVs, each having at least two
levels. A factorial design produces several results to examine: the main effect of each IV plus the interaction, or
combination, of the IVs.
The chapter also discussed the logic of analyzing experimental data, using the analysis of variance (ANOVA) statistic.
This test works by simultaneously comparing sources of variance and therefore avoids the risk of in�lated Type I
error. The ANOVA (or F) is calculated as a ratio of systematic variance to error variance, or, more speci�ically, of
between-groups variance to within-groups variance. The bigger this ratio, the more experimental manipulations
contribute to overall variability in scores. However, the F statistic suggests only that differences exist in the design;
further analyses are necessary to explore these differences. The chapter described an example from a published
study, discussing the process of comparing means and calculating effect sizes. In comparing means, researchers use
a mix of planned contrasts (for comparisons that they predicted) and post hoc tests (for other comparisons they
want to explore).
Finally, the chapter concluded by referring to two recurring concepts, Type I error (false positive) and Type II error
(false negative). These errors interfere with the broad goal of making correct decisions about the status of a
hypothesis. Thus, the purpose of this �inal section was to review ways to minimize errors. Type I errors are
primarily in�lated by large samples and lots of statistical analyses. Consequently, this error is under the
experimenter’s control at the data-analysis stage. Type II errors are primarily in�lated by small samples and �laws in
the experimental design. Consequently, this error is under the experimenter’s control at the design and planning
stage.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 34/40
Key Terms
analysis of variance (ANOVA)
A statistical procedure that tests for differences by comparing the variance explained by systematic factors to the
variance explained by error.
between-subject design
Experimental design in which each group of participants is exposed to only one level of the independent variable.
Bonferroni correction
A post hoc test that involves adjusting the alpha level by the number of comparisons to set a more conservative
cutoff.
carryover effect
Effects of one level are present when another level is introduced, making it dif�icult to separate the effects of
different levels.
conceptual replication
Testing the relationship between conceptual variables using new operational de�initions.
condition
One of the versions of an independent variable, forming different groups in the experiment; in a factorial design,
refers to the groups formed by combinations of IVs.
confounding variable (or confound)
A variable that changes systematically with the independent variable.
constructive replication
Recreation of the original experiment that adds elements to the design; usually designed to rule out alternative
explanations or extend knowledge about the variables under study.
control condition
Group within the experiment that does not receive the experimental treatment.
counterbalancing
Variation of the order of presentation among participants to reduce order effects.
cover story
A misleading statement to participants about what is being studied to prevent effects of demand characteristics.
demand characteristic
Cue in the study that leads participants to guess the hypothesis.
differential attrition
Loss of participants, who drop out of experimental groups for different reasons.
environmental manipulation
Changing some aspect of the experimental setting.
exact replication
Recreation of the original experiment as closely as possible to verify the �indings.
experimental condition
Group within the experiment that receives a treatment designed to test a hypothesis.
experimental design
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 35/40
Design whose primary goal is to explain causes of behavior.
experimenter expectancy
Researchers see what they expect to see, leading to subtle bias in favor of their hypotheses; threat to internal
validity.
external validity
A metric that assesses generalizability of results beyond the speci�ic conditions of the experiment.
extraneous variable
Variable that adds noise to a hypothesis test.
factorial design
A design that has two or more independent variables, each with two or more levels.
factorial notation
A system for describing the number of variables and the number of levels in experimental designs.
fatigue effect
Decline of participants’ performance as a result of repeated testing.
generalizability
The extent to which results extend to other studies, using a wide variety of populations and of operational
de�initions.
instructional manipulation
Changing the way a task is described to change participants’ mind-sets.
interaction
The combined effect of variables in a factorial design; the effects of one IV are different depending on the levels of
the other IV.
internal validity
A metric that assesses the degree to which results can be attributed to independent variables.
invasive manipulation
Taking measures to change internal, physiological processes; usually conducted in medical settings.
level
Another way to describe the versions of an independent variable; describes the speci�ic circumstances created by
manipulating a variable.
main effect
The effect of each independent variable on the dependent variable, collapsing across the levels of other variables.
marginal mean
The combined mean of one factor across levels of another factor.
matched random assignment
A variation on random assignments; ensures that an important variable is equally distributed between or among
the groups; the experimenter obtains scores on an important matching variable, ranks participants on this
variable, and then randomly assigns participants to conditions.
mixed design
Experimental design that contains at least one between-subject variable and at least one within-subject variable.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 36/40
multivariate analysis of variance (MANOVA)
A statistic that analyzes sets of dependent variables to reduce the number of individual tests.
mundane realism
Research that replicates the psychological conditions of the real-world phenomenon; criterion for judging
external validity.
one-way design
A design that has only one independent variable, with two or more levels to the variable.
order effect
Moderation of the effects because of the order in which levels occur.
participant replication
Repetition of the study with a new population of participants; usually driven by a compelling theory as to why the
two populations differ.
placebo control
Group added to a study to reduce placebo effects; mimics the experimental condition in every way but one.
placebo effect
Change resulting from the mere expectation that change will occur.
planned comparison (or a priori comparison)
Comparisons that involve comparing only the means for which differences were predicted by the hypothesis.
post hoc comparison
Comparison that controls the overall alpha by taking into account that multiple comparisons are being performed;
usually allowed only if the overall F test is signi�icant.
practice effect
Improvement of participants’ performance as a result of repeated testing.
quasi-independent variable
Preexisting difference used to divide participants in an experimental context; referred to as “quasi” because
variables are being measured, not manipulated, by the experimenter.
random assignment
A technique for assigning participants to conditions; before participants arrive, the experimenter makes a random
decision for each participant’s placement in a group.
replication
Repetition of research results in different contexts and/or different laboratories.
selection bias
Occurs when groups are different before the manipulation; problematic because preexisting differences might be
the driving factor behind the results.
Tukey’s HSD (Honestly Signi�icant Difference)
A post hoc test that calculates a critical value for mean comparisons (the HSD) and then uses this critical value to
evaluate whether mean comparisons are signi�icantly different.
unrelated-experiments technique
A strategy for preventing the effects of demand characteristics, leading participants to believe that they are
completing two experiments during one session; experimenter can use this to present the independent variable
during the �irst experiment and measure the dependent variable during the second experiment.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 37/40
within-subject design
Experimental design in which each group of participants is exposed to all levels of the independent variable.
Chapter 5 Flashcards
Apply Your Knowledge
1. List and brie�ly describe the three distinguishing features of an experiment.
a.
b.
c.
2. List the three types of expectancy effect that can affect experimental results, and name one way to avoid
each type.
a.
b.
c.
3. The following designs are described using factorial notation. For each one, state (a) the number of variables
in the design, (b) the number of levels each variable has, and (c) the total number of experimental
conditions.
3 × 3 × 3
a.
b.
Elige un modo de estudioVer esta unidad de estudio
https://quizlet.com/
https://quizlet.com/142581403/research-methods-in-psychology-2e-chapter-5-flash-cards/
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 38/40
c.
2 × 3 × 4
a.
b.
c.
4 × 4
a.
b.
c.
2 × 2 × 2 × 2
a.
b.
c.
4. Forty students were asked to rate two authors according to their knowledge of certain topic areas. Each
student was given two passages to read. In one passage (“Brain”), the author discussed the roles of various
brain structures in perceptual-motor coordination. In the second passage (“Motivation”), the author
described ways to enhance motivation in preschool children. For half the students, both passages were
written by male authors. For the other half of the students, both passages were written by a female author.
After reading the passages, students rated the authors’ knowledge of their topic areas on a scale ranging
from 1 (displays very little knowledge) to 10 (displays a thorough knowledge).
5.
Male Author Female Author
Brain 9 4
Motivation 6 7
(1) Identify the following information about the design:
(2) Describe the design using factorial notation (e.g., 4 × 3).
(3) Identify the total number of conditions.
(4) Identify the design (circle one): between-subject within-subject mixed
6. For each of the following scenarios, identify what a Type I error and a Type II error would look like. Then,
determine which type would be a bigger problem for that scenario.
a. A large international airport has received a bomb threat. In response, the airport police have
tightened security and now check every piece of luggage manually.
(1) Type I:
(2) Type II:
(3) Bigger problem:
b. Your friend purchases a pregnancy test.
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 39/40
Research Scenarios: Try It
(1) Type I:
(2) Type II:
(3) Bigger problem:
Critical Thinking Questions
1. Explain the advantages and disadvantages of a within-subject design.
2. Compare and contrast the following terms. Your answers should demonstrate that you understand each
term. Be sure to give some kind of context (e.g., “both are types of . . .”) or provide an example, and state how
they are different.
a. internal versus external validity
b. between-subjects versus within-subject design
c. level versus condition
3. Explain the difference between Type I and Type II errors. How can each type of error be minimized?
1/12/2018 Imprimir
https://content.ashford.edu/print/Newman.2681.16.1?sections=navpoint-32,navpoint-33,navpoint-34,navpoint-35,navpoint-36,navpoint-37,navpoint-3… 40/40