5D1-9 – Summative and Formative Evaluations – see details below. Please follow instructions given and answer all questions.

Discussion Instructions:

Based upon Program Evaluation and Performance Measurement text chapters 11 & 12 Readings.

1. Explain why summative evaluations are more challenging to do than formative evaluations.

The Relationship Between Formative and Summative
Assessment—In the Classroom and Beyond

This chapter discusses the relationships between formative and summative
assessments—both in the classroom and externally. In addition to teachers, site-and
district-level administrators and decision makers are target audiences. External test
developers also may be interested.

Teachers inevitably are responsible for assessment that requires them to report on
student progress to people outside their own classrooms. In addition to informing and
supporting instruction, assessments communicate information to people at multiple
levels within the school system, serve numerous accountability purposes, and provide
data for placement decisions. As they juggle these varied purposes, teachers take on
different roles. As coach and facilitator, the teacher uses formative assessment to help
support and enhance student learning. As judge and jury, the teacher makes summative
judgments about a student’s achievement at a specific point in time for purposes of
placement, grading, accountability, and informing parents and future teachers about
student performance. Often in our current system, all of the purposes and elements of
assessment are not mutually supportive, and can even be in conflict. What seems
effective for one purpose may not serve, or even be compatible with, another. Review
Table 2-1 in Chapter 2.

The previous chapters have focused primarily on the ongoing formative assessment
teachers and students engage in on a daily basis to enhance student learning. This
chapter briefly examines summative assessment that is usually prescribed by a local,
district, or state agency, as it occurs regularly in the classroom and as it occurs in large-
scale testing. The chapter specifically looks at the relationship between formative and

Page 60 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science

https://www.nap.edu/read/9847/chapter/4#p200047d7ttt00001

https://www.nap.edu/read/9847/chapter/6

Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save

Cancel

summative assessment and considers how inherent tensions between the different
purposes of assessment may be mitigated.

HOW CAN SUMMATIVE ASSESSMENT SERVE THE
STANDARDS?

The range of understanding and skill called for in the Standards acknowledges the
complexity of what it means to know, to understand, and to be able to do in science.
Science is not solely a collection of facts, nor is it primarily a package of procedural skills.
Content understanding includes making connections among various concepts with
which scientists work, then using that information in specific context. Scientific
problem-solving skills and procedural knowledge require working with ideas, data, and
equipment in an environment conducive to investigation and experimentation. Inquiry,
a central component of the Standards, involves asking questions, planning, designing
and conducting experiments, analyzing and interpreting data, and drawing conclusions.

If the Standards are to be realized, summative as well as formative assessment must
change to encompass these goals. Assessment for a summative purpose (for example,
grading, placement, and accountability) should provide students with the opportunity to
demonstrate conceptual understanding of the important ideas of science, to use
scientific tools and processes, to apply their understanding of these important ideas to
solve new problems, and to draw on what they have learned to explain new
phenomena, think critically, and make informed decisions (NRC, 1996). The various
dimensions of knowing in science will require equally varied assessment strategies, as
different types of assessments capture different aspects of learning and achievement
(Baxter & Glaser, 1998; Baxter & Shavelson, 1994; Herman, Gearhart, & Baker, 1993;
Ruiz-Primo & Shavelson, 1996; Shavelson, Baxter, & Pine, 1991; Shavelson & Ruiz-Primo,
1999).

FORMS OF SUMMATIVE ASSESSMENT IN THE
CLASSROOM

As teachers fulfill their different roles as assessors, tensions between formative and
summative purposes of assessment can be significant (Bol and Strange, 1996). However,
teachers often are in the position of being able to tailor assessments for both
summative and formative purposes.

Performance Assessments

Any activity undertaken by a student provides an opportunity for an assessment of the
student’s performance. Performance assessment often implies a more formal
assessment of a student as he or she engages in a performance-

Page 61 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save
Cancel

based activity or task. Students are often provided with apparatus and are expected to
design and conduct an investigation and communicate findings during a specified period
of time. For example, students may be given the appropriate material and asked to
investigate the preferences of sow bugs for light and dark, and dry or damp
environments (Shavelson, Baxter, & Pine, 1991). Or, a teacher could observe while
students design and conduct water-quality tests on a given sample of water to

https://www.nap.edu/read/9847/chapter/6

determine what variables the students measure, and what those variables indicate to
them, and how they explain variable interaction. Observations can be complemented by
assessing the resultant products, including data sheets, graphs, and analysis. In some
cases, computer simulations can replace actual materials and journals in which students
include results, interpretations, and conclusions can serve as proxies for observers
(Shavelson, Baxter, & Pine, 1991).

By their nature, these types of assessments differ in a variety of ways from the
conventional types of assessments. For one, they provide students with opportunities to
demonstrate different aspects of scientific knowledge (Baxter & Shavelson, 1994;
Baxter, Elder, & Glaser, 1996; Ruiz-Primo & Shavelson, 1996). In the sow bug
investigation, for example, students have the opportunity to demonstrate their ability to
design and conduct an experiment (Baxter & Shavelson, 1994). The investigation of
water quality highlights procedural knowledge as well as the content knowledge
necessary to interpret tests, recognize and explain relationships, and provide analysis.
Because of the numerous opportunities to observe students at work and examine their
products, performance assessments can be closely aligned with curriculum and
pedagogy.

Portfolios

Duschl and Gitomer (1997) have conducted classroom-based research on portfolios as
an assessment tool to document progress and achievement and to contribute to a
supportive learning environment. They found that many aspects of the portfolio and the
portfolio process provided assessment opportunities that contributed to improved work
through feedback, conversations about content and quality, and other assessment-
relevant discussions. The collection also can serve to demonstrate progress and inform
and support summative evaluations. The researchers document the challenges as well
as the successes of building a learning environment around portfolio assessment. They
suggest that the relationship between assessment and instruction requires
reexamination so that information gathered from student discussions can be used for
instructional purposes. For

Page 62 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science

https://www.nap.edu/read/9847/chapter/6

Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save
Cancel

this purpose, a teacher’s conception and depth of subject-matter knowledge need to be
developed and cultivated so that assessment criteria derive from what is considered
important in the scientific field that is being studied, rather than from poorly connected
pieces of discrete information.

Researchers at Harvard’s Graduate School of Education (Seidel, Walters, Kirby, Olff,
Powell, Scripp, & Veenema, 1997) suggest that the following elements be included in
any portfolio system:

• collection of student work that demonstrates what students have learned and
understand;

• an extended time frame to allow progress and effort to be captured;
• structure or organizing principles to help organize as well as interpret and

analyze; and
• student involvement in not only the selection of the materials but also in the

reflection and assessment.

An example for the contents for a portfolio of a science project could be as follows:

• the brainstorming notes that lead to the project concept;
• the work plan that the student followed as a result of a time line;
• the student log that records successes and difficulties;
• review of actual research results;
• photograph of finished project; and
• student reflection on the overall project (p. 32).

Using Traditional Tests Differently

Certain kinds of traditional assessments that are used for summative purposes contain
useful information for teachers and students, but these assessments are usually too
infrequent, come too late for action, and are too coarse-grained. Some of the activities
in these summative assessments provide questions and procedures that might, in a
different context, be useful for formative purposes. For example, rescheduling
summative assessments can contribute to their usefulness to teachers and students for
formative purposes. Tests that are given before the end of a unit can provide both
teacher and student with useful information on which to act while there is still
opportunity to revisit areas where students were not able to perform well.
Opportunities for revisions on tests or any other type of assessment give students
another chance to work through, think about, and come to understand an area they did
not fully understand or clearly articulate the previous time. In reviewing for a test, or
preparing for essay questions, students can begin to make connections between aspects
of subject matter that they may not have related previously to one another. Sharing
designs before an experiment gets under way during a peer-assessment session gives
each student a chance to comment on and to improve his or her own investigation as
well as

Page 63 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save
Cancel

those of their classmates. When performed as a whole class, reviewing helps make
explicit to all students the key concepts to be covered.

https://www.nap.edu/read/9847/chapter/6

Selected response and written assessments, homework, and classwork all serve as
valuable assessment activities as part of a teacher ‘s repertoire if used appropriately.
The form that the assessment takes should coincide with careful consideration of the
intended purpose. Again, the use of the data generated by and through the
assessment is important so that it feeds back into the teaching and learning.

As shown in Table 4-1, McTighe and Ferrara (1998) provide a useful framework for
selecting assessment approaches and methods. The table accents the range of
common assessments available to teachers. Although their framework serves all
subject-matter areas, the wide variety of assessments and assessment-rich activities
could be applicable for assessments in a science classroom.

TABLE 4-1 Framework of Assessment Approaches and Methods

HOW MIGHT WE ASSESS STUDENT LEARNING IN THE
CLASSROOM?
Selecte
d-
Respon
se
Format

Constructed-Response Format

 Mul
tipl
e-
choi
ce

 Tru
e-
fals
e

Brief
Construct
ed
Response

Performance-Based Assessment

 Fill in
the
blank

 Word(s
)

Product Performanc
e

Process-
Focused
Assessm
ent

https://www.nap.edu/read/9847/chapter/6#p200047d7ttt00003

 Mat
chi
ng

 Enh
anc
ed
mul
tipl
e
choi
ce

 Phrase
(s)

 Short
answe
r

 Senten
ce(s)

 Paragr
aphs

 Label a
diagra
m

 “Show
your
work”

 Researc
h paper

 Story/pl
ay

 Poem
 Portfoli

o
 Art

exhibit
 Science

project
 Model

 Oral
presenta
tion

 Dance/
moveme
nt

 Science
lab
demonst
ration

 Athletic
skill
perform
ance

 Deba
te

 Musi
cal
recita
l

 Keyb
oardi
ng

 Teac
h-a-
lesso
n

 Visual
repres
entatio
n

 Essay

 Video/a
udiotap
e

 Spreads
heet

 Lab
report

 Dramati
c
reading

 Enactme
nt

 Oral
quest
ionin
g

 Obse
rvatio
n
(“kid
watc
hing”)

 Inter
view

 Conf
erenc
e

 Proce
ss
descr
iption

 “Thin
k
aloud
”

 Learn
ing
log

SOURCE: McTighe and Ferrara (1998).
Page 64 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save
Cancel

GRADING AND COMMUNICATING ACHIEVEMENT

One common summative purpose of assessment facing most teachers is the need to
communicate information on student progress and achievement to parents, school
board officials, members of the community, college admissions officers. In addition to
scores from externally mandated tests, teacher-assigned grades traditionally serve this
purpose.

A discussion in Chapter 2 defends the use of descriptive, criterion-based feedback as
opposed to numerical scoring (8/10) or grades (B). A study cited (Butler, 1987) showed
that the students who demonstrated the greatest improvement were the ones who
received detailed comments (only) on their returned pieces of work. However, grading
and similar practices are the reality for the majority of teachers. How might grading be
used to best support student learning?

https://www.nap.edu/read/9847/chapter/6

https://www.nap.edu/read/9847/chapter/4#p200047d7ddd0000011

Though they are the primary currency of our current summative-assessment system,
grades typically carry little meaning because they reduce a great deal of information to a
single letter. Furthermore, there is often little agreement between the difference
between an A and a B, a B and a C, a D and an F or what is required for a particular letter
grade (Loyd & Loyd, 1997).

Grades may symbolize achievement, yet they often incorporate other factors as well,
such as work habits, which may or may not be related to level of achievement. They are
often used to reward or motivate students to display certain behaviors (Loyd & Loyd,
1997). Without a clear understanding of the basis for the grade, a single letter often will
provide little information on how work can be improved. As noted previously, grades
will only be as meaningful as the underlying criteria and the quality of assessment that
produced them.

A single-letter grade or the score on an end-of-unit test does not make student progress
explicit, nor does either provide students and teachers with information that might
further their understandings or inform their learning. A “C” on a project or on a report
card indicates that a student did not do exemplary work, but beyond that, there is
plenty of room for interpretation and ambiguity. Did the student show thorough
content understanding but fall short in presentation? Did the student not convey clear
ideas? Or did the student not provide adequate explanation of why a particular
phenomenon occurred? Without any information about these other dimensions, a
single-letter grade does not provide specific guidance about how work can be improved.

Page 65 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save

https://www.nap.edu/read/9847/chapter/6

Cancel

Surrounded by ambiguity, a letter grade without discussion and an understanding of
what it constitutes does little to provide useful information to the student, or even give
an indication of the level of performance. Thus, when a teacher establishes criteria for
individual assessments and makes them explicit to students, they also need to do so for
grading criteria. The criteria also should be clear to those who face interpreting them,
such as parents and future teachers, and incorporate priorities and goals important to
science as a school subject area.

Careful documentation can allow formative assessments to be used for summative
purposes. The manner in which summative assessments are reported helps determine
whether they can be easily translated for formative purposes—especially by the
student, teacher, and parents. In the vignette in Chapter 3, a middle school science
teacher confers with students as they engage in an ongoing investigation. She keeps
written notes of these exchanges as well as from the observations she makes of the
students at work. When it is time for this teacher to assign student grades for the
project, she can refer to these notes to provide concrete examples as evidence. Using
ongoing assessments to inform summative evaluations is particularly important for
inquirybased work, which cannot be captured in most one-time tests. Many teachers
give students the opportunity to make test corrections or provide other means for
students to demonstrate that they understand material previously not mastered.
Documenting these types of changes over time will show progress and can be used as
evidence of understanding for summative purposes.

Teachers face the challenge of overcoming the common obstacle of assigning classroom
grades and points in such a way that they drive classroom activity to the detriment of
other, often more informative and useful, types of assessment that foster standards-
based goals. Grading practices can be modified, however, so that they adhere to
acceptable standards for summative assessments and at the same time convey
important information that can be used to improve work in a way that is relatively easy
to read and understand. Mark Wilson and colleagues at the University of California,
Berkeley, have devised one such plan for the assessment system designed for the SEPUP
(Science Education for Public Understanding Program) middle school science curriculum
(Wilson & Sloane, 1999; Roberts, Wilson, & Draney, 1997; Wilson & Draney, 1997).

The SEPUP assessment system serves as an example of possible alternatives to the
traditional, current single-letter grade scheme. As shown in Table 4-2, the SEPUP
assessment blueprint indicates that a single assessment will not capture all of the skills
and content desired in any particular curricular unit. However, teachers do not need to

https://www.nap.edu/read/9847/chapter/5#p200047d7ddd0000022

https://www.nap.edu/read/9847/chapter/6#p200047d7ttt00004

be concerned about getting all the assessment information they need at a single time
with any single assessment.

Page 66 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save
Cancel

TABLE 4-2 SEPUP Assessment Blueprint

Teacher’s Guide
Part 1: Water Usage and Safety
Designing and

Conducting
Investigations

• Designing
investigation

• Selecting and
Recording
Procedures

Evidence and
Tradeoffs

• Using
Evidence

https://www.nap.edu/read/9847/chapter/6

• Organizing
Data

• Analyzing and
Interpreting
Data

to Make
Tradeoffs

1 Drinking-Water
Quality

2 Exploring
Sensory
Thresholds

3 Concentration
4 Mapping Death
5 John Snow A: Using

Evidence (p.
52)

6 Contaminated
Water

√: Designing
Investigation (p. 61)

7 Chlorination A: All elements (p.
66)

8 Chicken Little,
Chicken Big

9 Lethal Toxicity √: Organizing Data
(p. 94)

10 Risk
Comparison

√: Analyzing and
Interpreting Data
(p. 109)

11 Injection
Problem

√: Both
elements (p.
120)

12 Peru Story A: Organizing Data
and Analyzing and
Interpreting Data
(p. 130)

A: Both
elements (p.
132)

SOURCE: Science Education for Public Understanding
Program (1995).
Page 67 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save
Cancel
Sections A and B

https://www.nap.edu/read/9847/chapter/6

Understand
ing
Concepts

• Recogniz
ing
Relevant
Content

• Applying
Relevant
Content

Communicat
ing Scientific
Information

• Organizat
ion

• Technical
Aspects

Group Interaction

• Time Management
• Role

Performance/Partici
pation

• Shared Opportunity

1
2 √: Both

elements (p.
16)
Measurement
and Scale★

3 √: Applying
Relevant
Content (p.
28)
Measurement
and Scale★

4 √: Time Management;
Shared Opportunity (p.
38)

https://www.nap.edu/read/9847/chapter/6#p200047d7nnn00001

5 A: Both
elements
(p.52)

6
7
8 √: Shared Opportunity

(p. 76)
9 A: Applying

Relevant
Content (p.
97)
Measurement
and Scale★

1
0

√: Applying
Relevant
Content (p.
111)
Measurement
and Scale★

1
1

1
2

A: Both
elements (p.
132)

★Indicates content concepts assessed

https://www.nap.edu/read/9847/chapter/6#p200047d7nnn00001

SOURCE: Science Education for Public Understanding
Program (1995).
Page 68 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save
Cancel

By using the same scale for the entire unit, the SEPUP assessment system allows
teachers to obtain evidence about the students’ progress. Without the context or
criteria that the SEPUP scoring guide (Table 4-3) provides, a score of “2” on an
assessment, could be interpreted as inadequate, even if the scale is 0-4. However, as the
scoring guide indicates, in this example, a “2” represents a worthwhile step on the road
to earning a score of “4”. In practice, the specific areas that need additional attention
are conveyed in the scoring guide, thus a student could receive a “2” as feedback and
know what they need to do to improve the piece of work. The scoring guide also can
provide summative assessments at any given point.

TABLE 4-3 SEPUP Scoring Guide

Scoring Guide: Evidence and Tradeoffs (ET) Variable

Score Using Evidence
Response uses
objective reason(s)

Using Evidence to Make
Tradeoffs

https://www.nap.edu/read/9847/chapter/6

https://www.nap.edu/read/9847/chapter/6#p200047d7ttt00006

based on relevant
evidence to argue for
or against a choice.

Response recognizes
multiple perspectives
of issue and explains
each perspective using
objective reasons,
supported by evidence,
in order to make a
choice.

4 Response
accomplishes level 3,
AND goes beyond in
some significant way,
e.g. questioning or
justifying the source,
validity, and/or
quantity of the
evidence.

Accomplishes Level 3
AND goes beyond in
some significant way,
e.g., suggesting
additional evidence
beyond the activity that
would influence choices
in specific ways, OR
questioning the source,
validity, and/or quantity
of the evidence and
explaining how it
influences choice.

3 Provides major
objective reasons AND
supports each with
relevant and accurate
evidence.

Uses relevant and
accurate evidence to
weigh the advantages
and disadvantages of
multiple option, and

makes a choice
supported by the
evidence.

2 Provides some
objective reasons AND
some supporting
evidence, BUT at least
one reason is missing
and/or part of the
evidence is
incomplete.

States at least two
options AND provides
some objective reasons
using some relevant
evidence BUT reasons or
choices are incomplete
and/or part of the
evidence is missing; OR
only one complete and
accurate perspective has
been provided.

1 Provides only
subjective reasons
(opinions) for choice;
uses unsupported
statements; OR uses
inaccurate or
irrelevant evidence
from the activity.

States at least one
perspective BUT only
provides subjective
reasons and/or uses
inaccurate or irrelevant
evidence.

0 Missing; illegible, or
offers no reasons AND
no evidence to
support choice made.

Missing, illegible, or
completely lacks reasons
and evidence.

X Student had no opportunity to respond.
SOURCE: Science Education for Public Understanding
Program (1995).
Page 69 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save
Cancel

The SEPUP assessment system provides one such example, but teachers can employ
other forms of assessment that capture progress as well as achievement at a specific
point in time. Keyed to standards and goals, such systems can be strong on meaning for
teachers and students and still convey information to different levels of the system in a
relatively straightforward and plausible manner that is readily understood. Teachers can
use the standards or goals to help guide their own classroom assessments and
observations and also to help them support work or learning in a particular area where
sufficient achievement has not been met.

Devising a criterion-based scale to record progress and make summative judgments
poses difficulties of its own. The levels of specificity involved in subdividing a domain to
assure that the separate elements together represent the whole is a crucial and
demanding task (Wiliam, 1996). This becomes an issue whether considering
performance assessments or ongoing assessment data and needs to be articulated in
advance of when students engage in activities (Quellmalz, 1991; Gipps, 1994).

https://www.nap.edu/read/9847/chapter/6

Specific guidelines for the construction and selection of test items are not offered in this
document. Test design and selection are certainly important aspects of a teacher’s
assessment responsibility and can be informed by the guidelines and discussions
presented in this document (see also Chapter 3). Item-writing recommendations and
other test specifications are topics of a substantial body of existing literature (for
practitioner-relevant discussions, see Airasian, 1991; Cangelosi, 1990; Cunningham,
1997; Doran, Chan, and Tamir, 1998; Gallagher, 1998; Gronlund, 1998; Stiggins, 2001).
Appropriate design, selection, interpretation and use of tests and assessment data were
emphasized in the joint effort of the American Federation of Teachers (AFT), the
National Council on Measurement in Education (NCME), and the National Education
Association (NEA) to specify pedagogical skills necessary for effective assessment (AFT,
NCME, & NEA, 1990).

VALIDITY AND RELIABILITY IN SUMMATIVE ASSESSMENTS

Regardless of what form a summative assessment takes or when it occurs, teachers
need to keep in mind validity and reliability, two important technical elements of both
classroomlevel assessments and external or large-scale assessments (AERA, APA, &
NCME, 1999). These concepts also are discussed in Chapter 3.

Validity and reliability are judged using different criteria, although the two are related.
Validity has different

Page 70 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save

https://www.nap.edu/read/9847/chapter/5#p200047d7ddd0000022

https://www.nap.edu/read/9847/chapter/6

Cancel

dimensions, including content (does the assessment measure the intended content
area?), construct (does the assessment measure the intended construct or ability?) and
instructional (was the material on the assessment taught?). It is important to consider
the uses of assessment and the appropriateness of resulting inferences and actions as
well (Messick, 1989). Reliability has to do with generalizing across tasks (is this a
generalizable measure of student performance?) and can involve variability in
performance across tasks, between settings, as well as in the consistency of scoring or
grading.

What these terms mean operationally varies slightly for the kinds of assessments that
occur each day in the classroom and in the form of externally designed exams. For
example, the ongoing classroom assessment that relies on immediate feedback provides
different types of opportunities for follow-up when compared to a typical testing
situation where follow-up questioning for clarification or to ensure proper
interpretation on the part of the respondent usually is not possible (Wiliam & Black,
1996). The dynamic nature of day-to-day teaching affords teachers with opportunities to
make numerous assessments, take relevant action, and to amend decisions and
evaluations if necessary and with time. Wiliam and Black (1996) write, “the fluid action
of the classroom, where rapid feedback is important, optimum validity depends upon
the self-correcting nature of the consequent action ” (pp. 539-540).

With a single-test score, especially from a test administered at the end of the school
year, a teacher does not have the opportunity to follow a response with another
question, either to determine if the previous question had been misinterpreted or to
probe misunderstandings for diagnostic reasons. With a standardized test, where on-
the-spot interpretation of the student’s response by the teacher and follow-up action is
impossible, the context in which responses are developed is ignored. Measures of
validity are decontextualized, depending almost entirely on the collection and nature of
the actual test items. More important, all users of assessment data (teachers,
administrators and policy makers) need to be aware of what claims they make about a
student’s understanding and the consequential action based on any one assessment.

Relying on a variety of assessments, in both form and what is being assessed, will go a
long way to ensuring validity. Much of what is called for in the standards, such as
inquiry, cannot be assessed in many of the multiplechoice, short-answer, or even two-
hour performance assessments that are currently employed. Reliability, though more
straightforward, may be more difficult to ensure than validity. On external tests, even
when scorers

Page 71 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save
Cancel

are carefully calibrated (or done by a machine), variations in a student’s performance
from day to day, or from question to question, poses threats to reliability.

Viable systems that command the same confidence as the current summative system
but are free of many of the inherent conflicts and contradictions are necessary to make
decisions psychometrically sound. The confidence that any assessment can demand will
depend, in large part, on both reliability and validity (Baron, 1991; Black, 1997). As Box
4-1 indicates, there are some basic questions to be asked of both teacher-made and
published assessments. Teachers need to consider the technical aspect of the
summative assessments they use in the classroom. They also should look for evidence
that disproves earlier judgments and make necessary accommodations. Likewise, they
should be looking for further assessment data that could help them to support their
students ‘ learning.

LARGE-SCALE, EXTERNAL ASSESSMENT—THE CURRENT
SYSTEM AND NEED FOR REFORM

Large-scale assessments at the district, state and national levels are conducted for
different purposes: to formulate policy, monitor the effects of policies and enforce
them, make

https://www.nap.edu/read/9847/chapter/6

https://www.nap.edu/read/9847/chapter/6#p200047d7bbb00013

BOX 4-1 Applying Validity and Reliability
Concerns to Classroom Teaching

 What am I interested in measuring? Does this
assessment capture that?

 Have the students experienced this material as
part of their curriculum?

 What can I say about a student’s understandings
based on the information generated from the
assessment? Are those claims legitimate?

 Are the consequences and actions that result from
this performance justifiable?

 Am I making assumptions or inferences about
other knowledge, skills or abilities that this
assessment did not directly assess?

 Are there aspects of this assessment not relevant
to what I am interested in assessing that may be
influencing performance?

 Have I graded consistently?
 What could be unintended consequences

associated with this assessment?

comparisons, monitor progress towards goals, evaluate programs, and for accountability
purposes (NRC, 1996). As a key element in the success of education-improvement
systems, accountability has become one of the most important issues in educational
policy today (NRC, 1999b). Accountability is a means by which policy makers at the state

and district levels—and parents and taxpayers—monitor the performance of students
and schools.

Most states use external assessments for accountability purposes (Bernauer & Cress,
1997). These

Page 72 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save
Cancel

standardized, externally designed tests are either norm-referenced tests (NRTs),
criterion-referenced tests (CRTs), or some combination of the two. A “standardized” test
is one that is to be carried out in the same way for all individuals tested, scored in the
same way, and scores interpreted in the same way (Gipps, 1994). NRTs are developed
by test publishers to measure student performance against the norm. Results from
these tests describe what students can do relative to other students and are used for
comparing groups of students. The norm is a rank, the 50th percentile. For national
tests, the norm is constructed by testing students all over the country. (It also is the
score that test-makers call “at grade level” [Bracey, 1998]). On a norm-referenced test,
half of all students in the norm sample will score at or above the 50th percentile, or
above grade level, and half will score below the 50th percentile, or below grade level.
These tests compare students to other students, rather than measuring student mastery
of content standards or curricular objectives (Burger, 1998).

https://www.nap.edu/read/9847/chapter/6

Increasingly, states and districts are moving towards criterion-referenced tests (CRTs),
usually developed by state departments of education and districts, which compare
student performance to a set of established criteria (for example, district, state or
national standards) rather than comparing them to the performance of other students.
CRT’s allow all students who have acquired skills and knowledge to receive high scores
(Burger, 1998).

A well-designed and appropriately used standardized test can generate data that can be
used to inform different parts of the system and to assess a range of understandings
and skills. Currently, they generally concentrate on the knowledge most amenable to
scoring in multiple-choice and short-answer formats. These formats most easily capture
factual knowledge (Shavelson & Ruiz-Primo, 1999) and are the most inexpensive in
terms of resources necessary for test development, administration, and scoring (Hardy,
1995). Although many of the current standardized tests are intended to assess student
achievement, too often they are used only to stimulate competition among students,
teachers or schools, or to make other judgments that are not justified by student scores
on such tests.

The lack of coherence among the different levels of assessment within the system, often
leaves teachers, schools and districts torn between mandated external testing policies
and practices, and the responsibilities of teachers to use assessment in the service of
learning. These large-scale tests, which often command greater esteem than classroom
assessments, create a tension for formative and summative assessment and a challenge
for exemplary classroom

Page 73 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

https://www.nap.edu/read/9847/chapter/6

Save
Cancel

practice (Black, 1997; Frederiksen, 1984; Smith & Rottenberg, 1991). Teachers are left
facing serious dilemmas.

BUILDING AN EXTERNAL STANDARDS-BASED
SUMMATIVE ASSESSMENT SYSTEM

The foundations for a standards-based summative assessment system are assessments
that are systemically valid: aligned to the recommendations of the national standards,
grounded in the educational system, and congruent with the educational goals for
students. Alignment of assessment to curriculum and standards ensures that the
assessments match the learning goals embodied in the standards and enables the
students, parents, teachers and the public to determine student progress toward the
standards (NRC, 1999b).

Assessment and accountability systems cannot be isolated from their purpose: to
improve the quality of instruction and ultimately the learning of students (NRC, 1999b).
They also must be well understood by the interested parties and based on standards
acceptable to all (Stecher & Herman, 1997).

An effective system will provide students with the opportunity to demonstrate their
understanding and skills in a variety of ways and formats. The form the assessment
takes must follow its purpose. Multiple-choice tests are easy to grade and can quickly
assess some forms of science-content knowledge. Other areas may be better tapped
through open-ended questions or performance-based assessments, where students
demonstrate their abilities and understandings such as with an actual hands-on
investigation (Shavelson & Ruiz-Primo, 1999). Assessing inquiry skills may require
extended investigations and can be documented through portfolios of work as it
unfolds.

Educators need to be cautious, deliberate, and aware of the strong influence of high-
stakes, external tests on classroom practice specific to the instruction emphasis and its
assessment (Frederiksen, 1984; Gifford & O’Connor, 1992; Goodlad, 1984; Popham,
1992; Resnick & Resnick, 1991; Rothman, 1995; Shepard, 1995; Smith et al., 1992; Wolf
et al., 1991) when considering, implementing, and evaluating large-scale assessment
systems. No assessment form is immune from negative influences. Messick (1994)
concludes

It is not just that some aspects of multiple-choice testing may have adverse
consequences for teaching and learning, but that some aspects of all testing, even
performance testing, may have adverse as well as beneficial educational consequences.
And if both positive and negative aspects, whether intended or unintended, are not
meaningfully addressed in the validation process, then the concept of validity loses its
force as a social value. (p. 22)

Page 74 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save
Cancel

Even well-designed assessments will need to be augmented by other assessments. Most
criterion-referenced tests are multiple-choice or short-answer tests. Although they may
align closely to a standards-based system, other assessment components, such as
performance measures, where students demonstrate their understanding by doing
something educationally desirable, also are necessary to measure standards-based
outcomes. A long-term inquiry that constitutes a genuine scientific investigation, for
example, cannot be captured in a single test or even in a performance assessment
allotted for a single class period.

LEARNING FROM CURRENT REFORM

Beyond a Single Test

https://www.nap.edu/read/9847/chapter/6

Several states and districts are making strides in expanding external testing beyond
traditional notions of testing to include more teacher involvement and to better align
classroom and external summative assessments, so to better support teaching and
learning. The state of Vermont (VT) was one pioneer. The state sought to develop an
assessment system that served accountability purposes as well as generated data that
would inform instruction and improve individual achievement (Mills, 1996). The system
had three components: Students and teachers gathered work for portfolios, teachers
submitted a “best piece” sample for each student, and students took a standardized
test. Scoring rubrics and exemplars were used by groups of teachers around the state to
score the portfolios and student work samples. Despite the different pieces in place
(which also included professional development) the VT experiment faced mixed results
and is still evolving. The scoring of the portfolios and student work samples lacked an
adequate reliability (in the technical sense) to be used for accountability purposes
(Koretz, Stecher, Klein, & McCaffrey, 1994). Many teachers saw a positive impact on
student learning, due in part to the focus and feedback on specific pieces of student
work that teachers provided to students during the collection and preparation process
(Asp, 1998) but also acknowledged the additional time needed for portfolio preparation
(Koretz, Stecher, Klein, McCaffrey, & Deibert, 1993).

Kentucky (KY) is another state that made changes to their system and faced similar
challenges. The portfolio and performance-based assessment system in that state also
did not achieve consistently reliable scores (Hambleton et al., 1995). Both states
demonstrate that consistency across scores for samples of work requires training and
time. Research on performance assessments in large-scale systems shows that
variability in student performance across tasks also can be significant (Baron, 1991).

Page 75 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

https://www.nap.edu/read/9847/chapter/6

Save
Cancel

Involving Teachers

Teachers who are privy to student discussions and able to making ongoing observations
are in the best position to assess many of the educational goals including areas such as
inquiry. Therefore, teachers need to become more involved in summative assessments
for purposes beyond reporting on student progress and achievement to others in the
system. Practices within the United States and in other countries provide us with
possibilities of how to better tap into teachers ‘ summative assessments to augment or
complement external exams.

In Queensland, Australia, for example, the state moved away from their state-wide
examination and placed the certification of students in the hands of teachers (Butler,
1995). Teachers meet in regional groups to exchange results and assessment methods
with colleagues. They justify their assessments and deliberate with colleagues from
other schools to help ensure that the different schools are holding their students to
comparable standards and levels of achievement. Additional examples of the role of
teacher judgment in external assessment in other countries are discussed in the next
chapter.

Accountability efforts that exclude teachers from assessing their students’ work are
often justified on grounds that teachers could undermine the reliability by injecting
undue subjectivity and personal bias. This argument has some support based on results
of efforts in VT and KY. However, as the teachers in Queensland engage in deliberation
and discussion (a procedure called moderation), steps are taken that mitigate the
possible loss of reliability. To help ensure consistency among different teachers in
moderation sessions, teachers exchange samples of student work and discuss their
respective assessments of the work. These deliberations, in which the standards for
judging quality work are discussed, have proved effective in developing consistency in
scoring by the teachers. Moderation also serves as an effective form of professional
development because teachers sharpen their perspectives about the quality of student
work that might be expected, as is illustrated in the next chapter. In the United States,
teacher-scoring committees for Advanced Placement exams follow this model.

Moderation is expensive and not always practical. There are other ways to maintain
reliability and involve teachers in summative assessments that serve accountability and

reporting purposes. In Connecticut, the science portion of the state assessment system
involves teachers selecting from a list of tasks and using them in conjunction with their
own curriculum and contexts. The state provides the teachers with exemplars and
criteria, and the teachers are responsible for scoring

Page 76 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

Save
Cancel

their own student work. Teachers can use the criteria in other areas of their curriculum
throughout the year.

Douglas County Schools in Colorado rely heavily on teacher judgments for accountability
purposes (Asp, 1998). Teachers collect a variety of evidence of student progress towards
district standards. Teacher-developed materials that include samples of work,
evaluation criteria, and possible assessment tasks guide them. The county uses these
judgments to communicate to parents and district-level monitors and decision makers.

Examples and research can help inform large-scale assessment models so that systems
produce useful data that inform the necessary purposes while not creating obstacles for
quality teaching and learning. Policy and decision makers must look to and learn from
reforms underway. After examining large scale testing practices, Asp (1998) offers keys
to building compatibility between classroom and large-scale summative assessment
systems. His recommendations include the following:

https://www.nap.edu/read/9847/chapter/6

• make large-scale assessment more accessible to classroom teachers;
• embed large-scale assessment in the instructional program of the classroom in a

meaningful way; and
• use multiple measures at several levels within the system to assess individual

student achievement (pp. 41-42).

When data on individual achievement is not the desired aim (as is often the case when
accountability concerns focus on an aggregate level, such as the school, district or
region), the use of sampling procedures to test fewer students and to test less
frequently can be options.

The assessment systems and features discussed above are not flawless, yet there is
much to learn from the experiences of these reforms. Current strategies and systems
need to be modified without compromising the goal of a more aligned system. Changes
of any kind will require support from the system and resources for designing and
evaluating options, informing and training teachers and administrators, and educating
the public

KEY POINTS

• Tensions between formative and summative assessment do exist, but there are
ways in which these tensions can be reduced. Some productive steps for reducing
tensions include relying on a variety of assessment forms and measures and
considering the purposes for the assessment and the subsequent form the
assessment and its reporting takes.

• Test results should be used appropriately, not to make other judgments that are
not justified by student scores on such tests.

Page 77 Share Cite
Suggested Citation:”4 The Relationship between
Formative and Summative Assessment — In the
Classroom and Beyond.” National Research Council.
2001. Classroom Assessment and the National Science
Education Standards. Washington, DC: The National
Academies Press. doi: 10.17226/9847.
×

https://www.nap.edu/read/9847/chapter/6

Save
Cancel

• A testing program should include criterion-referenced exams and reflect the
quality and depth of curriculum advocated by the standards.

• For accountability purposes, external testing should not be designed in such a
way as to be detrimental to learning, such as by limiting curricular and teaching
activities.

• A teacher’s position in the classroom provides opportunities to gain useful
information for use in both formative and summative assessments. These teacher
assessments need to be developed and tapped to best utilize the information
that only teachers possess to augment even the best designed paper-and-pencil
or performance-based test.

• System-level changes are needed to reduce tensions between formative and
summative assessments.

The Relationship Between Formative and Summative Assessment—In the Classroom and Beyond

HOW CAN SUMMATIVE ASSESSMENT SERVE THE STANDARDS?

FORMS OF SUMMATIVE ASSESSMENT IN THE CLASSROOM

Performance Assessments

Portfolios

Using Traditional Tests Differently

The Relationship Between Formative and Summative Assessment—In the Classroom and Beyond

VALIDITY AND RELIABILITY IN SUMMATIVE ASSESSMENTS

LARGE-SCALE, EXTERNAL ASSESSMENT—THE CURRENT SYSTEM AND NEED FOR REFORM

BUILDING AN EXTERNAL STANDARDS-BASED SUMMATIVE ASSESSMENT SYSTEM

LEARNING FROM CURRENT REFORM

Beyond a Single Test

Involving Teachers

KEY POINTS

failing to anticipate the myriad situations inevitable in practice (Bamberger, Rugh, & Mabry,
2012)—hence the call for cultivating sound professional judgment (through reflective practice)
in applying the principles and guidelines.

Like other professional judgment decisions, appropriate ethical practice occurs throughout
the evaluation process. It usually falls to the evaluator to lead by example, ensuring that ethical
principles are adhered to and are balanced with the goals of the stake-holders. Brandon, Smith,
and Hwalek (2011), in discussing a successful private evaluation firm, describe the process this
way:

Ethical matters are not easily or simply resolved but require working out viable solutions
that balance professional independence with client service. These are not technical matters
that can be handed over to well-trained staff or outside contractors, but require the
constant, vigilant attention of seasoned evaluation leaders. (p. 306)

In contractual engagements, the evaluator has to make a decision to move forward with a
contract or, as Smith (1998) describes it, to determine if an evaluation contract may be “bad for
business” (p. 178). Smith goes on to recommend declining a contract if the desired work is not
possible at an “acceptable level of quality” (Smith, 1998, p. 178). For internal evaluators,
turning down an evaluation contract may have career implications. The case study at the end of
this chapter explores this dilemma. Smith (1998) cites Mabry (1997) in describing the
challenge of adhering to ethical principles for the evaluator:

Evaluation is the most ethically challenging of the approaches to research inquiry because
it is the most likely to involve hidden agendas, vendettas, and serious professional and
personal consequences to individuals. Because of this feature, evaluators need to exercise
extraordinary circumspection before engaging in an evaluation study. (Mabry, 1997, p. 1,
cited in Smith, 1998, p. 180)

Cultural Competence in Evaluation Practice

While issues of cultural sensitivity are addressed in Chapter 5, cultural sensitivity is as
important for quantitative evaluation as it is for qualitative evaluation. We are including
cultural competence in this section on ethics, as cultural awareness is an important feature of
not only development evaluation where we explicitly work across cultures, but also virtually
any evaluations conducted in our increasingly multicultural society. Evaluations in the health,
education, or social sectors, for example, would commonly require that the evaluator have
cultural awareness and sensitivity.

There is evidence of a growing sense of the importance and the relevance of
acknowledging cultural awareness for evaluations. Schwandt (2007) notes that “the Guiding
Principles (as well as most of the ethical guidelines of academic and professional associations
in North America) have been developed largely against the foreground of a Western
framework of moral understandings” (p. 400) and are often framed in terms of individual
behaviors, largely ignoring the normative influences of social practices and institutions. The
AEA Guiding Principles for Evaluators include the following caveat to address the cross-
cultural limitations of their principles:

These principles were developed in the context of Western cultures, particularly the
United States, and so may reflect the experiences of that context. The relevance of these

principles may vary across other cultures, and across subcultures within the United States.
(AEA, 2004)

Schwandt (2007) notes that “in the Guiding Principles for evaluators, cultural competence
is one dimension of a general principle (‘competence’) concerned with the idea of fitness or
aptitude for the practice of evaluation” (p. 401); however, he challenges the adequacy of this
dimension, asking “Can we reasonably argue for something like a cross cultural professional
ethic for evaluators, and if so, what norms would it reflect?” (p. 401). Schwandt (2007) notes
that the focus on cultural competence in evaluation has developed out of concern for “attending
to the needs and interests of an increasingly diverse, multicultural society and the challenges of
ensuring social equity in access to and quality of human service programs” (p. 401). In an
imagined dialogue between two evaluators, Schwandt and Dahler-Larsen (2006) discuss
resistance to evaluation and the practical implications for performing evaluation in
communities. They conclude that “perhaps evaluators should listen more carefully and respond
more prudently to voices in communities that are hesitant or skeptical about evaluation […]
Evaluation is not only about goals and criteria, but about forms of life” (p. 504).

THE PROSPECTS FOR AN EVALUATION PROFESSION

In this chapter, we have emphasized the importance of acknowledging and cultivating sound
professional judgment as part of what we believe is required to move evaluation in the
direction of becoming a profession. In some professions, medicine being a good example, there
is growing recognition that important parts of sound practice are tacitly learned, and that
competent practitioners need to cultivate the capacity to reflect on their experience to develop
an understanding of their own subjectivity and how their values, beliefs, expectations, and
feelings affect the ways that they make decisions in their practice.

Some evaluation associations, the Canadian Evaluation Society (CES) being the most
prominent example, have embarked on a professionalization path that has included identifying
core competencies for evaluators and offering members the option of applying for a
professional designation. Knowledge (formal education), experience, and professional
reputation are all included in the assessment process conducted by an independent panel, and
successful applicants receive a Credentialed Evaluator designation (CES, 2012b).

Other evaluation associations, with their emphasis on guidelines and standards for
evaluation practice, are also embarking on a process that moves the field toward becoming
more professional. Efforts are being made to identify core competencies (King et al., 2001),
and discussions have outlined some of the core epistemological and methodological issues that
would need to be addressed if evaluation is to move forward as a profession (Bickman, 1997;
Patton, 2008). The evaluation field continues to evolve as academic and practice-based
contributors offer new ideas, critique each other’s ideas, and develop new approaches.
Becoming more like a profession will mean balancing the norms of professional practice (core
body of knowledge, ethical standards, and perhaps even entry to practice requirements) with
the ferment that continues to drive the whole field and makes it both challenging and exciting.

Although many evaluators have made contributions that suggest we are moving toward
making evaluation into a profession, we are not there yet. Picciotto (2011) concludes the
following:

Evaluation is not a profession today but could be in the process of becoming one. Much
remains to be done to trigger the latent energies of evaluators, promote their expertise,

protect the integrity of their practice and forge effective alliances with well wishers in
government, the private sector and the civil society. It will take strong and shrewd
leadership within the evaluation associations to strike the right balance between autonomy
and responsiveness, quality and inclusion, influence and accountability. (p. 179)

SUMMARY

Program evaluation is partly about learning methods and how to apply them. But, because most
evaluation settings offer only roughly appropriate opportunities to apply tools that are often
designed for social science research settings, it is essential that evaluators learn the craft of
working with square pegs for round holes. Evaluators and managers have in common the fact
that they are often trained in settings that idealize the applications of the tools that they learn.
When they enter the world of practice, they must adapt what they have learned. What works is
determined by the context and their experiences. Experience becomes the foundation not only
of when and how to apply tools but, more important, the essential basis for interpreting the
information that is gathered in a given situation.

Evaluators have the comparative luxury of time and resources to examine a program or
policy that managers usually have to judge in situ, as it were. Even for evaluators, there are
rarely sufficient resources to apply the tools that would yield the highest quality of data. That is
a limitation that circumscribes what we do, but does not mean that we should stop asking
whether and how programs work.

This chapter emphasizes the central role played by professional judgment in the practice of
professions, including evaluation, and the importance of cultivating sound professional
judgment. Michael Patton, through his alter ego Halcolm, puts it this way (Patton, 2008, p.
501):

Forget “judge not and ye shall not be judged.”
The evaluator’s mantra: Judge often and well so that you get better at it.

—Halcolm

It follows that professional programs, courses in universities, and textbooks should
underscore for students the importance of developing and continuously improving their
professional judgment skills, as opposed to focusing only on learning methods, facts, and
exemplars. Practicing the craft of evaluation necessitates developing knowledge and skills that
are tacit. These are learned through experience, refined through reflective practice, and applied
along with the technical and rational knowledge that typically is conveyed in books and in
classrooms. Practitioners in a profession

begin to recognize that practice is much more messy than they were led to believe [in
school], and worse, they see this as their own fault—they cannot have studied sufficiently
well during their initial training.… This is not true. The fault, if there is one, lies in the
lack of support they receive in understanding and coping with the inevitably messy world
of practice. (Fish & Coles, 1998, p. 13)

Fish and Coles continue thus:

Learning to practice in a profession is an open capacity, cannot be mastered and goes on
being refined forever. Arguably there is a major onus on those who teach courses of
preparation for professional practice to demonstrate this and to reveal in their practice its
implications. (p. 43)

The ubiquity of different kinds of judgment in evaluation practice suggests that as a nascent
profession we need to do at least three things. First, we need to fully acknowledge the
importance of professional judgment and the role it plays in the diverse ways we practice
evaluation. Second, we need to understand how our professional judgments are made—the
factors that condition our own judgments. Reflective practice is critical to reaping the potential
from experience. Third, we need to work toward self-consciously improving the ways we
incorporate, into the education and training of evaluators, opportunities for current and future
practitioners to improve their professional judgments. Embracing professional judgment is an
important step toward more mature and self-reflective evaluation practice.

Ethical evaluation practice is a part of cultivating sound judgment. Although national and
international evaluation associations have developed principles and guidelines that include
ethical practice, these guidelines are general and are not enforceable. Individual evaluators
need to learn, through their reflective practice, how to navigate the ethical tradeoffs in
situations, understanding that appropriate ethical practice will weigh the risks and benefits for
those involved.

DISCUSSION QUESTIONS

1. Take a position for or against the following proposition and develop a strong one-page
argument that supports your position. This is the proposition: “Be it resolved that
experiments, where program and control groups are randomly assigned, are the Gold
Standard in evaluating the effectiveness of programs.”

2. What do evaluators and program managers have in common? What differences can you
think of as well?

3. What is tacit knowledge? How does it differ from public knowledge?
4. In this chapter, we said that learning to ride a bicycle is partly tacit. For those who want

to challenge this statement, try to describe learning how to ride a bicycle so that a person
who has never before ridden a bicycle could get on one and ride it right away.

5. What is mindfulness, and how can it be used to develop sound professional judgment?
6. Why is teamwork an asset for persons who want to develop sound professional

judgment?
7. What do you think would be required to make evaluation more professional, that is, have

the characteristics of a profession?

APPENDIX

Appendix A: Fiona’s Choice: An Ethical Dilemma for a Program Evaluator

Fiona Barnes did not feel well as the deputy commissioner’s office door closed behind her.
She walked back to her office wondering why bad news seems to come on Friday afternoons.

Sitting at her desk, she went over the events of the past several days and the decision that lay
ahead of her. This was clearly the most difficult situation that she had encountered since her
promotion to the position of director of evaluation in the Department of Human Services.

Fiona’s predicament had begun the day before, when the new commissioner, Fran Atkin,
had called a meeting with Fiona and the deputy commissioner. The governor was in a difficult
position: In his recent election campaign, he had made potentially conflicting campaign
promises. He had promised to reduce taxes and had also promised to maintain existing health
and social programs, while balancing the state budget.

The week before, a loud and lengthy meeting of the commissioners in the state government
had resulted in a course of action intended to resolve the issue of conflicting election promises.
Fran Atkin had been persuaded by the governor that she should meet with the senior staff in
her department, and after the meeting, a major evaluation of the department’s programs would
be announced. The evaluation would provide the governor with some post-election breathing
space. But the evaluation results were predetermined—they would be used to justify program
cuts. In sum, a “compassionate” but substantial reduction in the department’s social programs
would be made to ensure the department’s contribution to a balanced budget.

As the new commissioner, Fran Atkin relied on her deputy commissioner, Elinor Ames.
Elinor had been one of several deputies to continue on under the new administration and had
been heavily committed to developing and implementing key programs in the department,
under the previous administration. Her success in doing that had been a principal reason why
she had been promoted to deputy commissioner.

On Wednesday, the day before the meeting with Fiona, Fran Atkin had met with Elinor
Ames to explain the decision reached by the governor, downplaying the contentiousness of the
discussion. Fran had acknowledged some discomfort with her position, but she believed her
department now had a mandate. Proceeding with it was in the public’s interest.

Elinor was upset with the governor’s decision. She had fought hard over the years to build
the programs in question. Now she was being told to dismantle her legacy—programs she
believed in that made up a considerable part of her budget and person-year allocations.

In her meeting with Fiona on Friday afternoon, Elinor had filled Fiona in on the political
rationale for the decision to cut human service programs. She also made clear what Fiona had
suspected when they had met with the commissioner earlier that week—the outcomes of the
evaluation were predetermined: They would show that key programs where substantial
resources were tied up were not effective and would be used to justify cuts to the department’s
programs.

Fiona was upset with the commissioner’s intended use of her branch. Elinor, watching
Fiona’s reactions closely, had expressed some regret over the situation. After some hesitation,
she suggested that she and Fiona could work on the evaluation together, “to ensure that it
meets our needs and is done according to our standards.” After pausing once more, Elinor
added, “Of course, Fiona, if you do not feel that the branch has the capabilities needed to
undertake this project, we can contract it out. I know some good people in this area.”

Fiona was shown to the door and asked to think about it over the weekend.
Fiona Barnes took pride in her growing reputation as a competent and serious director of a

good evaluation shop. Her people did good work that was viewed as being honest, and they
prided themselves on being able to handle any work that came their way. Elinor Ames had
appointed Fiona to the job, and now this.

Your Task

Analyze this case and offer a resolution to Fiona’s dilemma. Should Fiona undertake the
evaluation project? Should she agree to have the work contracted out? Why?

In responding to this case, consider the issues on two levels: (1) look at the issues taking
into account Fiona’s personal situation and the “benefits and costs” of the options available to
her and (2) look at the issues from an organizational standpoint, again weighing the “benefits
and the costs.” Ultimately, you will have to decide how to weigh the benefits and costs from
both Fiona’s and the department’s standpoints.

REFERENCES

Abercrombie, M. L. J. (1960). The anatomy of judgment: An investigation into the processes of
perception and reasoning. New York: Basic Books.

Altschuld, J. (1999). The certification of evaluators: Highlights from a report submitted to the
Board of Directors of the American Evaluation Association. American Journal of
Evaluation, 20(3), 481–493.

American Evaluation Association. (1995). Guiding principles for evaluators. New Directions
for Program Evaluation, 66, 19–26.

American Evaluation Association. (2004). Guiding principles for evaluators. Retrieved from
http://www.eval.org/Publications/GuidingPrinciples.asp

Ayton, P. (1998). How bad is human judgment? In G. Wright & P. Goodwin (Eds.),
Forecasting with judgement (pp. 237–267). Chichester, West Sussex, UK: John Wiley.

Bamberger, M., Rugh, J., & Mabry, L. (2012). Real world evaluation: Working under budget,
time, data, and political constraints (2nd ed.). Thousand Oaks, CA: Sage.

Basilevsky, A., & Hum, D. (1984). Experimental social programs and analytic methods: An
evaluation of the U.S. income maintenance projects. Orlando, FL: Academic Press.

Berk, R. A., & Rossi, P. H. (1999). Thinking about program evaluation (2nd ed.). Thousand
Oaks, CA: Sage.

Bickman, L. (1997). Evaluating evaluation: Where do we go from here? Evaluation Practice,
18(1), 1–16.

Brandon, P., Smith, N., & Hwalek, M. (2011). Aspects of successful evaluation practice at an
established private evaluation firm. American Journal of Evaluation, 32(2), 295–307.

Campbell Collaboration. (2010). About us. Retrieved from
http://www.campbellcollaboration.org/about_us/index.php

Campbell, D. T. (1991). Methods for the experimenting society. Evaluation Practice, 12(3),
223–260.

Canadian Evaluation Society. (2012a). CES guidelines for ethical conduct. Retrieved from
http://www.evaluationcanada.ca/site.cgi?s=
5&ss=4&_lang=en

Canadian Evaluation Society. (2012b). Program evaluation standards. Retrieved from
http://www.evaluationcanada.ca/site.cgi?s=
6&ss=10&_lang=EN

Canadian Institutes of Health Research, Natural Sciences and Engineering Research Council of
Canada, & Social Sciences and Humanities Research Council of Canada. (2010). Tri-

council policy statement: Ethical conduct for research involving humans, December 2010.
Retrieved from http://www.pre.ethics.gc.ca/pdf/eng/tcps2/
TCPS_2_FINAL_Web

Chen, H. T., Donaldson, S. I., & Mark, M. M. (2011). Validity frameworks for outcome
evaluation. New Directions for Evaluation, 2011(130), 5–16.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues for
field settings. Chicago, IL: Rand McNally.

Cook, T. D., Scriven, M., Coryn, C. L., & Evergreen, S. D. (2010). Contemporary thinking
about causation in evaluation: A dialogue with Tom Cook and Michael Scriven. American
Journal of Evaluation, 31(1), 105–117.

Cooksy, L. J. (2008). Challenges and opportunities in experiential learning. American Journal
of Evaluation, 29(3), 340–342.

Cronbach, L. J. (1980). Toward reform of program evaluation (1st ed.). San Francisco, CA:
Jossey-Bass.

Cronbach, L. J. (1982). Designing evaluations of educational and social programs (1st ed.).
San Francisco, CA: Jossey-Bass.

Epstein, R. M. (1999). Mindful practice. Journal of the American Medical Association, 282(9),
833–839.

Epstein, R. M. (2003). Mindful practice in action (I): Technical competence, evidence-based
medicine, and relationship-centered care. Families, Systems & Health, 21(1), 1–9.

Epstein, R. M., Siegel, D. J., & Silberman, J. (2008). Self-monitoring in clinical practice: A
challenge for medical educators. Journal of Continuing Education in the Health
Professions, 28(1), 5–13.

Eraut, M. (1994). Developing professional knowledge and competence. Washington, DC:
Falmer Press.

Fish, D., & Coles, C. (1998). Developing professional judgement in health care: Learning
through the critical appreciation of practice. Boston, MA: Butterworth-Heinemann.

Ford, R., Gyarmati, D., Foley, K., Tattrie, D., & Jimenez, L. (2003). Can work incentives pay
for themselves? Final report on the Self-Sufficiency Project for welfare applicants. Ottawa,
Ontario, Canada: Social Research and Demonstration Corporation.

Garvin, D. A. (1993). Building a learning organization. Harvard Business Review, 71(4), 78
–90.

Ghere, G., King, J. A., Stevahn, L., & Minnema, J. (2006). A professional development unit
for reflecting on program evaluator competencies. American Journal of Evaluation, 27(1),
108–123.

Gibbins, M., & Mason, A. K. (1988). Professional judgment in financial reporting. Toronto,
Ontario, Canada: Canadian Institute of Chartered Accountants.

Gustafson, P. (2003). How random must random assignment be in random assignment
experiments? Ottawa, Ontario, Canada: Social Research and Demonstration Corporation.

Henry, G. T., & Mark, M. M. (2003). Toward an agenda for research on evaluation. New
Directions for Evaluation, 97, 69–80.

Higgins, J., & Green, S. (Eds.). (2011). Cochrane handbook for systematic reviews of
interventions: Version 5.0.2 (updated March 2011). The Cochrane Collaboration 2011.
Retrieved from www.cochrane-handbook.org

House, E. R., & Howe, K. R. (1999). Values in evaluation and social research. Thousand
Oaks, CA: Sage.

Human Resources Development Canada. (1998). Quasi-experimental evaluation (Publication
No. SP-AH053E-01–98). Ottawa, Ontario, Canada: Evaluation and Data Development
Branch.

Hurteau, M., Houle, S., & Mongiat, S. (2009). How legitimate and justified are judgments in
program evaluation? Evaluation, 15(3), 307–319.

Jewiss, J., & Clark-Keefe, K. (2007). On a personal note: Practical pedagogical activities to
foster the development of “reflective practitioners.” American Journal of Evaluation, 28
(3), 334–347.

Katz, J. (1988). Why doctors don’t disclose uncertainty. In J. Dowie & A. S. Elstein (Eds.),
Professional judgment: A reader in clinical decision making (pp. 544–565). Cambridge,
MA: Cambridge University Press.

Kelling, G. L. (1974a). The Kansas City preventive patrol experiment: A summary report.
Washington, DC: Police Foundation.

Kelling, G. L. (1974b). The Kansas City preventive patrol experiment: A technical report.
Washington, DC: Police Foundation.

King, J. A., Stevahn, L., Ghere, G., & Minnema, J. (2001). Toward a taxonomy of essential
evaluator competencies. American Journal of Evaluation, 22(2), 229–247.

Kitchener, K. S. (1984). Intuition, critical evaluation and ethical principles: The foundation for
ethical decisions in counseling psychology. The Counseling Psychologist, 12(3), 43–55.

Krasner, M. S., Epstein, R. M., Beckman, H., Suchman, A. L., Chapman, B., Mooney, C. J., &
Quill, T. E. (2009). Association of an educational program in mindful communication with
burnout, empathy, and attitudes among primary care physicians. Journal of the American
Medical Association, 302(12), 1284–1293.

Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago, IL: University of Chicago
Press.

Kundin, D. M. (2010). A conceptual framework for how evaluators make everyday practice
decisions. American Journal of Evaluation, 31(3), 347–362.

Larson, R. C. (1982). Critiquing critiques: Another word on the Kansas City preventive patrol
experiment. Evaluation Review, 6(2), 285–293.

Levin, H. M., & McEwan, P. J. (Eds.). (2001). Cost-effectiveness analysis: Methods and
applications (2nd ed.). Thousand Oaks, CA: Sage.

Mabry, L. (1997). Ethical landmines in program evaluation. In R. E. Stakes (Chair), Grounds
for turning down a handsome evaluation contract. Symposium conducted at the meeting of
the AERA, Chicago, IL.

Mark, M. M., Henry, G. T., & Julnes, G. (2000). Evaluation: An integrated framework for
understanding, guiding, and improving policies and programs (1st ed.). San Francisco,
CA: Jossey-Bass.

Mason, J. (2002). Qualitative researching (2nd ed.). Thousand Oaks, CA: Sage.
Mayne, J. (2008). Building an evaluative culture for effective evaluation and results

management. Retrieved from http://www.cgiar-ilac.org/files/publications/briefs/
ILAC_Brief20_Evaluative_Culture

Modarresi, S., Newman, D. L., & Abolafia, M. Y. (2001). Academic evaluators versus
practitioners: Alternative experiences of professionalism. Evaluation and Program
Planning, 24(1), 1–11.

Morris, M. (1998). Ethical challenges. American Journal of Evaluation, 19(3), 381–382.
Morris, M. (Ed.). (2008). Evaluation ethics for best practice: Cases and commentaries. New

York: Guilford Press.
Morris, M. (2011). The good, the bad, and the evaluator: 25 years of AJE ethics. American

Journal of Evaluation, 32(1), 134–151.
Mowen, J. C. (1993). Judgment calls: High-stakes decisions in a risky world. New York:

Simon & Schuster.
Newman, D. L., & Brown, R. D. (1996). Applied ethics for program evaluation. Thousand

Oaks, CA: Sage.
No Child Left Behind Act of 2001, Pub. L. No. 107-110, 115 Stat. 1425.
Office of Management and Budget. (2004). What constitutes strong evidence of a program’s

effectiveness? Retrieved from
http://www.whitehouse.gov/omb/part/2004_program_eval

Patton, M. Q. (1997). Utilization-focused evaluation: The new century text (3rd ed.). Thousand
Oaks, CA: Sage.

Patton, M. Q. (2008). Utilization-focused evaluation (4th ed.) Thousand Oaks, CA: Sage.
Pawson, R., & Tilley, N. (1997). Realistic evaluation. Thousand Oaks, CA: Sage.
Picciotto, R. (2011). The logic of evaluation professionalism. Evaluation, 17(2), 165–180.
Polanyi, M. (1958). Personal knowledge: Towards a post-critical philosophy. Chicago, IL:

University of Chicago Press.
Polanyi, M., & Grene, M. G. (1969). Knowing and being: Essays. Chicago, IL: University of

Chicago Press.
Rossi, P. H., Lipsey, M. W., & Freeman, H. E. (2004). Evaluation: A systematic approach.

Thousand Oaks, CA: Sage.
Sanders, J. R. (1994). Publisher description for the program evaluation standards: How to

assess evaluations of educational programs. Retrieved from
http://catdir.loc.gov/catdir/enhancements/
fy0655/94001178-d.html

Schön, D. A. (1987). Educating the reflective practitioner: Toward a new design for teaching
and learning in the professions (1st ed.). San Francisco, CA: Jossey-Bass.

Schön, D. A. (1988). From technical rationality to reflection-in-action. In J. Dowie & A. S.
Elstein (Eds.), Professional judgment: A reader in clinical decision making (pp. 60–77).
New York: Cambridge University Press.

Schwandt, T. A. (2000). Three epistemological stances for qualitative enquiry. In N. K. Denzin
& Y. S. Lincoln (Eds.), Handbook of qualitative research (2nd ed., pp. 189–213).
Thousand Oaks, CA: Sage.

Schwandt, T. A. (2007). Expanding the conversation on evaluation ethics. Evaluation and
Program Planning, 30(4), 400–403.

Schwandt, T. A. (2008). The relevance of practical knowledge traditions to evaluation practice.
In N. L. Smith & P. R. Brandon (Eds.), Fundamental issues in evaluation (pp. 29–40). New
York: Guilford Press.

Schwandt, T. A., & Dahler-Larsen, P. (2006). When evaluation meets the “rough ground” in
communities. Evaluation, 12(4), 496–505.

Schweigert, F. J. (2007). The priority of justice: A framework approach to ethics in program
evaluation. Evaluation and Program Planning, 30(4), 394–399.

Scriven, M. (1994). The final synthesis. Evaluation Practice, 15(3), 367–382.
Scriven, M. (2004). Causation. Unpublished manuscript, University of Auckland, Auckland,

New Zealand.
Scriven, M. (2008). A summative evaluation of RCT methodology & an alternative approach

to causal research. Journal of Multidisciplinary Evaluation, 5(9), 11–24.
Seiber, J. (2009). Planning ethically responsible research. In L. Bickman & D. Rog (Eds.), The

Sage handbook of applied social research methods (2nd ed., pp. 106–142). Thousand
Oaks, CA: Sage.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental
designs for generalized causal inference. Boston, MA: Houghton Mifflin.

Simons, H. (2006). Ethics in evaluation. In I. Shaw, J. Greene, & M. M. Mark (Eds.), The Sage
handbook of evaluation (pp. 243–265). Thousand Oaks, CA: Sage.

Skolits, G. J., Morrow, J. A., & Burr, E. M. (2009). Reconceptualizing evaluator roles.
American Journal of Evaluation, 30(3), 275–295.

Smith, M. L. (1994). Qualitative plus/versus quantitative: The last word. New Directions for
Program Evaluation, 61, 37–44.

Smith, N. L. (1998). Professional reasons for declining an evaluation contract. American
Journal of Evaluation, 19(2), 177–190.

Smith, N. L. (2007). Empowerment evaluation as evaluation ideology. American Journal of
Evaluation, 28(2), 169–178.

CHAPTER 11

PROGRAM EVALUATION AND PROGRAM
MANAGEMENT

Joining Theory and Practice

Introduction
Can Management and Evaluation Be Joined? An Overview of the Issues
Evaluators and Managers as Partners in Evaluation

Building an Evaluative Culture in Organizations: An Expanded Role for Evaluators

Creating Ongoing Streams of Evaluative Knowledge

Obstacles to Building and Sustaining an Evaluative Culture

Manager Involvement in Evaluations: Limits and Opportunities

Intended Evaluation Uses and Managerial Involvement

Evaluating for Accountability

Evaluating for Program Improvement

Manager Bias in Evaluations: Limits to Manager Involvement

Striving for Objectivity in Program Evaluations

Can Program Evaluators Be Objective?

Looking for a Defensible Definition of Objectivity

A Natural Science Definition of Objectivity

Implications for Evaluation Practice

Criteria for High-Quality Evaluations: The Varying Views of Evaluation Associations
Summary
Discussion Questions
References

INTRODUCTION

Chapter 11 explores the relationship between program managers and evaluators, and how that
relationship is influenced by evaluation purposes and organizational contexts. We begin by
reviewing Wildavsky’s (1979) seminal work on this relationship. Because Wildavsky was
skeptical of how organizations could be self-evaluating, we then look at organizational cultures
that support evaluation. Given that many evaluators do their work as participants in the
organizations in which they do evaluations, we describe the ways in which internal evaluations
can occur in such organizations. An evaluative culture is a special case where evaluative
thinking and practices have been suffused throughout the organization, and we discuss the
prospects for realizing such cultures in contemporary public sector organizations. We then turn
to the limitations and opportunities for how managers can be involved in evaluations and how
the differences between formative and summative evaluations offer incentives that can bias
manager involvement in evaluations of their own programs.

The last part of Chapter 11 looks at the question of whether program evaluations can be
objective. We discuss what it would take for evaluations to be objective and whether it is
possible to claim that evaluations are objective. Finally, based on the guidelines and principles
offered by evaluation associations, we offer some general guidance for evaluators in
positioning themselves as practitioners able to make claims for doing high-quality evaluations.

Program evaluation is intended to be a flexible and situation-specific means of answering
program questions, testing hypotheses, and understanding program processes and outcomes.
Evaluations can focus on a broad range of issues, spanning needs, to program resources, to
program outcomes. They generally are intended to yield information that reduces the level of
uncertainty about the issues that prompted the evaluation.

As we learned in Chapter 1, program evaluations can be formative; that is, they can aim at
producing findings, conclusions, and recommendations that are intended to improve the
program. Formative evaluations are typically done with a view to offering program and
organizational managers information that they can use to improve the efficiency and/or the
effectiveness of an existing program. Generally, questions about the continuation of support for
the program itself are not part of formative evaluation agendas.

Program evaluations can also be summative—that is, intended to render judgments on the
value of the program. Summative evaluations are more directly linked to accountability
requirements that are often built into the program management cycle, which was introduced in
Chapter 1. Summative evaluations can focus on issues that are similar to those included in
formative evaluations (e.g., program effectiveness), but the intention is to produce information
that can be used to make decisions about the program’s future, such as whether to reallocate
resources elsewhere or whether to terminate the program. Typically, summative program
evaluations entail some kind of external reporting that may include government central
agencies as a key stakeholder. In Canada, for example, most program evaluations conducted by
federal departments and agencies are made public, and Treasury Board, as the principal central
agency responsible for expenditure management across the government, is a recipient of the
evaluations.

The purposes of an evaluation affect the relationships between evaluators, managers, and
other stakeholders. Generally, managers are more likely to view formative evaluations as
“friendly” evaluations and, hence, are more likely to be willing to cooperate with the
evaluators. They have an incentive to do so because the evaluation is intended to assist them
without raising questions that could result in major changes, including reductions to or even
the elimination of a program.

Summative evaluations are generally viewed quite differently. Program managers face
different incentives in providing information or even participating in such an evaluation.
Notwithstanding the efforts by some organizations to build evaluative cultures (Mayne, 2008;
Mayne & Rist, 2006) wherein managers are encouraged to treat mistakes and perhaps even
program-related failures as opportunities to learn, the future of their programs may be at stake.

From an evaluator’s standpoint, then, the experience of conducting a formative evaluation
can be quite different from conducting a summative evaluation. The type of evaluation can also
affect the evaluator’s relationship with the program manager(s). Typically, program evaluators
depend on program managers to provide key information and to arrange access to people, data
sources, and other sources of evaluation information (Chelimsky, 2008). Securing and
sustaining cooperation is affected by the purposes of the evaluation—managerial reluctance or
strategies to “put the best foot forward” might well be expected where the stakes include the
future of the program itself. As Norris (2005) says, “Faced with high-stakes targets and the
paraphernalia of the testing and performance measurement that goes with them, practitioner
and organizations sometimes choose to dissemble” (p. 585).

CAN MANAGEMENT AND EVALUATION BE JOINED? AN
OVERVIEW OF THE ISSUES

How does program evaluation, as a part of the performance management cycle, relate to
program management? Are program evaluation and program management compatible roles in
public and nonprofit organizations?

Wildavsky (1979), in his seminal book Speaking Truth to Power, introduced his discussion
of management and evaluation this way:

Why don’t organizations evaluate their own activities? Why don’t they seem to manifest
rudimentary self-awareness? How long can people work in organizations without
discovering their objectives or determining how well they are carried out? I started out
thinking that it was bad for organizations not to evaluate, and I ended up wondering why
they ever do it. Evaluation and organization, it turns out, are somewhat contradictory. (p.
212)

When he questioned joining together management and evaluation, Wildavsky chiefly had
in mind summative evaluations where the future of programs, and possibly reallocation of
funding, would be an issue. Historically, the federal government of Canada, for example,
offered this definition of program evaluation in its first publication on the purposes and scope
of the then new evaluation function in federal departments and agencies:

Program evaluation in federal departments and agencies should involve the systematic
gathering of verifiable information on a program and demonstrable evidence on its results
and cost-effectiveness. Its purpose should be to periodically produce credible, timely,
useful and objective findings on programs appropriate for resource allocation, program
improvement and accountability. (Office of the Comptroller General [OCG] of Canada,
1981, p. 3)

Central agencies still maintain this chiefly summative focus on evaluations. In its statement
of the purposes of program evaluation, the Treasury Board of Canada Secretariat (2009), the

central agency responsible for the government-wide evaluation function, offers a view of
evaluation that is substantially the same as that offered nearly three decades earlier. In its
“Policy on Evaluation,” the principal rationale for evaluation is that “evaluation provides
Canadians, Parliamentarians, Ministers, central agencies and deputy heads an evidence-based,
neutral assessment of the value for money, i.e. relevance and performance, of federal
government programs” (p. 3). The main thrust of the policy is clearly a summative view of
evaluation that focuses on “resource allocation and reallocation” and “providing objective
information to help Ministers understand how new spending proposals fit with existing
programs, identifying synergies and avoid wasteful duplication” (p. 3).

If evaluations are to be used to reallocate resources as well as to improve programs,
organizations must have the capacity to participate in and respond to evaluations that have both
formative and summative facets. This suggests an image of organizations that are amenable to
rethinking existing commitments—managers would need to balance attachment to the stability
of their programs with attachment to the evidence-based evaluation process. The
rational/technical view of organizations (de Lancer Julnes & Holzer, 2001), which we
discussed in Chapter 9, suggests that within such organizations, decision making would be
based on evidence, managers and workers would behave in ways that do not undermine a
results-focused culture, and summative evaluations would be welcomed as a part of regular
management processes.

Wildavsky’s (1979) view of organizations as settings where “speaking truth to power” is a
challenge is similar to the political/cultural image of organizations offered by de Lancer Julnes
and Holzer (2001). Wildavsky views the respective roles of evaluators and managers as painted
in contrasting colors. Evaluators are described as people who question assumptions, who are
skeptical, who are detached, who view organizations/programs as means and not ends in
themselves, whose currency is evidence, and who ultimately focus on the social needs that the
program serves rather than on organizational needs.

By contrast, in Wildavsky’s view, organizational/program managers can be characterized
as people who are committed to their programs, who are advocates for what they do and what
their programs do, and who do not want to see their commitments curtailed or their resources
diminished.

How, then, even for formative evaluation capacity, do organizations resolve the question of
who has the power and authority to make decisions, who constructs evaluation information,
and who controls its interpretation and distribution? In one scenario, evaluators could be a
central part of program and policy design, implementation, and assessment of results. They
may suggest that new programs or policies should be implemented as experiments or quasi-
experiments (perhaps as pilot programs), with clear objectives, well-constructed comparisons,
baseline measurements, and sufficient control over the implementation process, to ensure the
internal and construct validities of the evaluation process. This view of trying out new
programs was the essence of Donald Campbell’s image of the experimenting society (Watson,
1986).

Managers, however, may prefer to implement programs to more immediately meet
organizational and client needs. Objectives may, in that case, be stated in ways that facilitate
flexible interpretations of what was important to convey, depending on the audience. Managers
would want program objectives to be able to withstand the scrutiny of stakeholders with
different values and expectations. As we might anticipate, experimentation can create political
problems: What does the organization tell prospective clients who want the program but cannot
get access to it because they are members of a “control group”? What do executives tell the
elected officials, when client groups question either the lack of flexibility in the service (to

maintain construct validity of the evaluation) or its lack of availability (to increase internal
validity of the evaluation)?

Where the evaluation function is internal, it may be much more challenging to experiment
with a program before its launch. An example of the dilemmas and controversies involved in
designing and implementing a randomized controlled trial in a setting where there is an acute
social need is the New York City Department of Homeless Services’ 2-year experiment to
evaluate the Homebase program. The Homebase program is intended to provide housing-
related services to families that are at risk or are already homeless. The evaluation was started
in the fall of 2010, and for the ensuing 2 years, those in the control group (200 families) are
excluded from accessing the bundle of services that constitute the Homebase program (New
York City Department of Homeless Services, 2010).

The social dilemmas inherent in this kind of situation raise the question: Where should the
evaluation function be located in organizations, or even governments? One possible solution is
to make program evaluation an external function. Thus, evaluators would be a part of an
agency that is not under the administrative control of the organization’s managers. This
solution, however, does face challenges as well. In British Columbia, for example, the
Secretary of Treasury Board at one point outlined a plan for creating a centralized evaluation
capacity in the government (Wolff, 1979). This approach would have been similar to the way
external auditors function in governments. Treasury Board analysts housed in that central
agency would have conducted evaluations of line department programs with a view to
preparing reports for Treasury Board managers. The plan was never implemented, however, in
part because the line departments strongly objected to the creation of a central evaluation unit
that would not be accountable to line department executives. In fact, at that point, some
departments were developing in-house evaluation units, which were intended to perform
functions that executives argued would be duplicated by any centralized evaluation unit.

Centralized evaluation functions have certainly been developed for summative evaluation
purposes. Under the Bush administration in the United States, the Office of Management and
Budget (OMB), an executive agency responsible for budget preparation and expenditure
management, was responsible for assessing all federal programs on a cyclical basis using the
Program Assessment Rating Tool (PART) process (U.S. OMB, 2002, 2004). From 2002
through 2009, OMB assessed about 20% of all programs every year. These PART reviews
were, in effect, summative evaluations that relied in part on existing program evaluation and
performance measurement information, but offered an independent assessment conducted by
OMB analysts.

EVALUATORS AND MANAGERS AS PARTNERS IN
EVALUATION

Wildavsky’s (1979) view of self-evaluating organizations was quite pessimistic and reflected a
view that saw evaluation as a form of research best done by those who had some distance from
the programs being evaluated. He saw evaluation and management as being quite separate,
with distinct roles for managers and evaluators. But in the past several decades, there has been
a broad movement in the field of evaluation to find ways of knitting evaluation and
management together. Instead of seeing evaluation as an activity that challenges management,
this contrasting view assumes that evaluators can work with managers to define and execute
evaluations that combine the best of what both parties bring to that relationship. Utilization-
focused evaluation (Patton, 2008), for example, is premised on producing evaluations that

managers and other stakeholders will use—and ensuring use means developing a working
relationship between evaluators and managers. Managers are expected to be participants in the
evaluation process. Patton (1997) characterizes the role of the evaluator this way:

The evaluator facilitates judgment and decision-making by intended users rather than
acting as a distant, independent judge. Since no evaluation can be value-free, utilization-
focused evaluation answers the question of whose values will frame the evaluation by
working with clearly identified, primary intended users who have responsibility to apply
evaluation findings and implement recommendations. In essence, I shall argue, evaluation
use is too important to be left to evaluators. (p. 21)

Utilization-focused evaluation (Patton, 2008) and participatory evaluation (Cousins &
Whitmore, 1998) are among a growing number of approaches that emphasize the importance
of evaluators engaging with, and in some respects becoming a part of, the organizations in
which they do their work. The traditional view of evaluators as experts who conduct arms-
length “evaluation studies” of programs, and offer their written reports to stakeholders at the
end of the process, is giving way to the view that evaluators should not stand aside from
organizations but instead should get involved (Mayne & Rist, 2006).

Cousins and Whitmore (1998) suggest that the evaluation team and the practitioner team
both need to be committed to improving the program. The evaluation process—identifying the
key questions, design of the evaluation, collection of the data, and reporting of the results—can
be shared between the evaluators and the practitioners (see also King, Cousins, & Whitmore,
2007).

Love (1991) elaborated an approach that is premised on the assumption that evaluators can
be a part of organizations (i.e., paid employees who report to organizational executives) and
can contribute to improving the efficiency and effectiveness of programs. For Love, “internal
evaluation is the process of using staff members who have the responsibility for evaluating
programs or problems of direct relevance to an organization’s managers” (p. 2).

Internal evaluation units are common and are the norm in some governments. In the federal
government of Canada, for example, each department or agency typically has its own
evaluation unit, which reports to the administrative head of that organization. These units are
expected to work with departmental executives and managers to identify evaluation priorities
and undertake program evaluations. Although external consultants are often hired to conduct
parts of such projects, they are managed by internal evaluators.

Love (1991) outlines six stages in the development of internal evaluation capacity,
beginning with ad hoc program evaluations and ending with strategically focused cost–benefit
analyses:

• Ad hoc evaluations focused on single programs
• Regular evaluations that describe program processes and results
• Program goal setting, measurement of program outcomes, program monitoring,

adjustment
• Evaluations of program effectiveness, improving organizational performance
• Evaluations of technical efficiency and cost-effectiveness
• Strategic evaluations including cost–benefit analyses

These six stages can be seen as a gradual transformation of the intentions of evaluations
from formative to summative purposes. Love (1991) highlights the importance of an internal

working environment where organizational members are encouraged to participate in
evaluations, and where trust of evaluators and their commitment to the organization is part of
the culture. What Love is suggesting in his approach is that it is possible to transform an
organizational culture so that it embraces evaluation as a strategic asset. We will consider the
prospects for building evaluative cultures in the next section of this chapter.

Building an Evaluative Culture in Organizations: An Expanded Role for Evaluators

Mayne (2008) and Patton (2011) are among the advocates for a broader role for evaluation
and evaluators in organizations. Like Love (1991), their view is that it is possible to build
organizational capacity to perform evaluation that ultimately transforms the organization.
Mayne (2008) has outlined the key features of an evaluative culture. We have summarized his
main points in Table 11.1.

For Mayne (2008) and Mayne and Rist (2006), the roles of evaluators are broader than
doing evaluation studies/projects—they need to encompass knowledge management for the
organization. Evaluators need to be prepared to engage with executives and program managers,
offer them advice and assistance, take a lead role in training and other kinds of events that
showcase and mainstream evaluation, and generally play a supportive role in building an
organizational culture that values and relies on timely, reliable, valid, and relevant information
on programs and policies. In Wildavsky’s (1979) words, an evaluative culture is one wherein
both managers and evaluators feel supported in “speaking truth to power.”

Table 11.1 Characteristics of an Evaluative Culture in Organizations An organization that
has a strong evaluative culture:

• Engages in self-reflection and self-examination by
◦ Seeing evidence on what it is achieving, using both monitoring and evaluation

approaches
◦ Using evidence of results to challenge and support what it is doing
◦ Valuing candor, challenge, and genuine dialogue both horizontally and

vertically within the organization
• Engages in evidence-based learning by

◦ Allocating time and resources for learning events
◦ Acknowledging and learning from mistakes and poor performance
◦ Encouraging and modeling knowledge sharing and fostering the view that

knowledge is a resource and not a political weapon
• Encourages experimentation and change by

◦ Supporting program and policy implementation in ways that facilitate
evaluation and learning

◦ Supporting deliberate risk taking
◦ Seeking out new ways of doing business

Source: Adapted from Mayne (2008, p. 1).

Organizations with evaluative cultures can also be seen as learning organizations. Morgan
(2006), following on Senge (1990), suggests that learning organizations develop capacities to

• Scan and anticipate change in the wider environment to detect significant variations …
• Develop an ability to question, challenge, and change operating norms and assumptions

…
• Allow an appropriate strategic direction and pattern of organization to emerge. (Morgan,

2006, p. 87)

Key to establishing a learning organization is what Morgan (2006) calls double-loop
learning—that is, learning that critically assesses existing organizational goals and priorities in
light of evidence and includes options for adopting new goals and objectives. Organizations
must get outside their established structures and procedures and instead focus on processes to
create new information, which in turn can be used to challenge the status quo and make
changes.

Garvin (1993) has suggested five “building blocks” for creating learning organizations,
which are similar to key characteristics of organizations that have evaluative cultures: (1)
systematic problem solving using evidence, (2) experimentation and evaluation of outcomes
before broader implementation, (3) learning from past performance, (4) learning from others,
(5) and treating knowledge as a resource that should be widely communicated.

Creating Ongoing Streams of Evaluative Knowledge

Streams of evaluative knowledge comprise both program evaluations and performance
measurement results (Rist & Stame, 2006). In Chapter 9, we outlined 12 steps that are
important in building and sustaining performance measurement systems in organizations. In
the chapter we discussed the importance of real-time performance measurement and results
being available to managers. By itself, building a performance measurement system to meet
periodic external accountability expectations will not ensure that performance information will
be used internally by organizational managers. The same point can apply to program
evaluation. Key to a working evaluative culture would be the usefulness of ongoing evaluative
information to managers, and the responsiveness of evaluators to managerial priorities.

Patton (1994, 2011) has introduced developmental evaluation as an alternative to
formative and summative program evaluations. Developmental evaluations view organizations
as co-evolving in complex environments. Organizational objectives (and hence program
objectives) and/or the organizational environment may be in flux. Conventional evaluation
approaches that assume a relatively static program structure in which it is possible to build
logic models, for example, may have limited application in co-evolving settings. Patton
suggests that evaluators should take on the role of organizational development specialists,
working with managers and other stakeholders as team members to offer evaluative
information in real time so that programs and policies can take advantage of a range of periodic
and dynamic evaluative information.

Obstacles to Building and Sustaining an Evaluative Culture

What are the prospects for building evaluative cultures? Recall that in Chapter 10, we
suggested that adversarial political cultures can inhibit developing and sustaining performance
measurement and reporting systems—one effect of making performance results high stakes

where there are significant internal consequences to reporting performance failures is to
discourage managers from using externally reported performance results for internal
management purposes. In effect, managers, when confronted by situations where public
performance results need to be sanitized or at least carefully presented to reduce political risks,
tend to decouple those measures from internal performance management uses, preferring
instead to develop and use other measures that remain internal to the organization.

Mayne (2008), Mayne and Rist (2006), Patton (2011), and other proponents of evaluative
cultures are offering us a normative view of what “ought” to occur in organizations. But many
public sector and nonprofit organizations have to navigate environments or governments that
are adversarial, engendering negative consequences to managers (and their political masters) if
programs or policies are not “successful,” or if candid information about the weaknesses in
performance becomes public. What we must keep in mind, much as we did in Chapter 10 when
we were assessing the prospects for performance measurement and public reporting systems to
be used for both accountability and performance improvement, is that the environments in
which public and nonprofit organizations are embedded play an important role in the ways
organizational cultures evolve and co-adapt.

To build and sustain an evaluative culture, Mayne (2008) suggests, among other things,
that

managers need adequate autonomy to manage for results—Managers seeking to achieve
outcomes need to be able to adjust their operations as they learn what is working and what
is not. Managing only for planned outputs does not foster a culture of inquiry about what
are the impacts of delivering those outputs. (p. 2)

Refocusing organizational managers on outcomes instead of inputs and offering them
incentives to perform to those (desired) outcomes has been linked to New Public Management
ideals of loosening the process constraints on organizations so that managers would have more
autonomy to improve efficiency and effectiveness (Hood, 1995). But as Moynihan (2008) and
Gill (2011) point out, what has tended to happen in settings where political cultures are
adversarial is that performance expectations (objectives, targets, and measures) have been
layered on top of existing process controls instead of replacing them. In effect, from a
managerial perspective, there are more controls in place now that performance measurement
and reporting are part of the picture and less “freedom to manage.”

What effect does this have on building evaluative cultures? The main issue is the impact on
the willingness to take risks. Where organizational environments are substantially risk-averse,
that will condition and limit the prospects for developing an organizational culture that
encourages risk taking. In short, building and sustaining evaluative cultures requires not only
supportive organizational leadership but also a political and organizational environment that
permits reporting evaluative results that are able to acknowledge below-par performance, when
it occurs.

MANAGER INVOLVEMENT IN EVALUATIONS: LIMITS AND
OPPORTUNITIES

Increasingly, program managers are expected to play a role in evaluating their own programs.
In many situations, particularly for managers in nonprofit organizations, resources to conduct
evaluations are scarce. But expectations that programs will be evaluated (and that information

will be provided that can be used by funders to make decisions about the program’s future) are
growing. Designing and implementing performance measurement systems also presumes a key
role for managers.

In Chapter 10, we discussed the ways in which setting up performance measures to make
summative judgments about programs can produce unintended consequences—managers will
respond to the incentives that are implied by the consequences of reporting performance results
and will shape their behavior accordingly. The “naming and shaming” system of England’s
health care providers from 2000 to 2005 resulted in substantial problems with the validity of
the performance data (Bevan & Hamblin, 2009).

Involving managers, indeed giving them a central role in evaluations that are intended to
meet external accountability requirements, is different from involving them or even giving
them the lead in formative evaluations. Because the field of evaluation is so broad and diverse,
we see a range of views on how much and in what ways managers should be involved in
evaluations (including performance measurement systems).

Intended Evaluation Uses and Managerial Involvement

Most contemporary evaluation approaches emphasize the importance of the ultimate uses
of evaluations. In fact, there is a growing literature that examines and categorizes different
kinds of uses (Leviton, 2003; Mark & Henry, 2004). Patton (2008), in his book Utilization-
Focused Evaluation, points out that the evaluation field has evolved toward making uses of
evaluations a key criterion. The Program Evaluation Standards (Yarbrough, Shulha, Hopson,
& Caruthers, 2011), developed by the Joint Committee on Standards for Educational
Evaluation, make utility one of the five standards for evaluation quality. The other four are
feasibility, propriety, accuracy, and accountability.

Many evaluation approaches support involving program managers in the process of
evaluating programs. Participatory evaluation approaches, for example, emphasize the
importance of having practitioners involved in evaluations, principally to increase the
likelihood that the evaluations will be used (Cousins & Whitmore, 1998; Smits & Champagne,
2008).

Some evaluation approaches (empowerment evaluation is an example) emphasize
evaluation use but go beyond practitioner involvement to making social justice–related
outcomes an important goal of the evaluation process. Empowerment evaluation is intended in
part to make evaluation part of the normal planning and management of programs and to
ultimately put managers and staff in charge of their own destinies. “Too often,” argue
Fetterman, Kaftarian, and Wandersman (1996),

external evaluation is an exercise in dependency rather than an empowering experience: in
these instances the process ends when the evaluator departs, leaving participants without
the knowledge or expertise to continue for themselves. In contrast, an evaluation
conducted by program participants is designed to be ongoing and internalized in the
system, creating the opportunity for capacity building. (p. 9)

Initially, Fetterman seemed to view evaluation as a formative process. He argued that the
assessment of a program’s worth is not an end point in itself but part of an ongoing process of
program improvement. Fetterman (2001) acknowledged, however, that

the value or strength of empowerment evaluation is directly linked to the purpose of the
evaluation.… Empowerment evaluation makes a significant contribution to internal

accountability, but has serious limitations in the area of external accountability … An
external audit or assessment would be more appropriate if the purpose of the evaluation
was external accountability. (p. 145)

In a more recent rebuttal of criticism of empowerment evaluation, Fetterman and
Wandersman (2007) suggest that their approach is capable of producing unbiased evaluations
and, by implication, evaluations that are defensible as summative products. In response to
criticism by Cousins (2005), they suggest,

contrary to Cousins’ (2005) position that “collaborative evaluation approaches … [have]
… an inherent tendency toward self-serving bias” (p. 206), we have found many
empowerment evaluations to be highly critical of their own operations, in part because
they are tired of seeing the same problems and because they want their programs to work.
Similarly, empowerment evaluators may be highly critical of programs that they favor
because they want them to be effective and accomplish their intended goals. It may appear
counterintuitive, but in practice we have found appropriately designed empowerment
evaluations to be more critical and penetrating than many external evaluations. (Fetterman
& Wandersman, 2007, p. 184)

Below, we expand on managerial involvement in evaluation for accountability and
evaluation for program improvement.

Evaluating for Accountability

Public accountability has become nearly a universal expectation in both the public and the
nonprofit sectors internationally. There are many countries where some regime of public
accountability exists at both the national and the subnational levels. Evaluating for
accountability is typically summative, and often the key stakeholders are outside the
organizations in which the programs being evaluated are located. Stakeholders can include
central agencies, funders, elected officials, and others, including interest groups and citizens.

Summative evaluations can be aimed at meeting accountability requirements, but they do
not have to be. It is possible to have an evaluation that looks at the merit or worth of a program
(Lincoln & Guba, 1980) but is intended for stakeholders within an organization. A volunteer
nonprofit board, for example, may be the principal client for a summative evaluation of a
program, and although the decisions flowing from such an evaluation could affect the future of
the program, the evaluation could be seen as internal to the organization.

A good example of an organization that conducts high-stakes accountability evaluations is
the Government Accountability Office (GAO) in the United States. Although a part of the
Congress, the GAO straddles the boundary between the executive and the legislative branches
of the U.S. federal government. Eleanor Chelimsky (2008), from the GAO, in a candid
discussion describes the “clash of cultures” between evaluation and politics, and makes a
strong case for the importance of evaluator independence in the case of summative evaluations
for accountability. She points to the American division-of-powers structure as both prompting
a demand for evaluation and, at the same time, threatening evaluator independence:

Because our government’s need for evaluation arises from its checks-and-balances
structure—which, as you know, features separation of powers, legislative oversight, and
accountability to the people as protectors for individual liberty—evaluators working
within that structure must deal, not exceptionally but routinely and regularly, with

political infringements on their independence that result directly from that structure. (p.
400)

For Chelimsky (2008), evaluator independence is an essential asset for the GAO in its work
with the Congress. At the same time, the GAO relies on government agencies to contribute to
its work. It needs to secure the cooperation of the agencies in which the programs being
evaluated are located. It needs the data that are housed in federal departments and agencies, to
be able to construct key lines of evidence for evaluations. What Chelimsky has observed over
time is a growing trend toward limiting access to agency data:

Between 1980 and 1994—that is, across the Carter, Reagan, Bush, and Clinton
presidencies—we found that secrecy and classification of information were becoming
prevalent in an increasing number of agencies. Yet it would be hard to find a more critical
issue for evaluation than this one. (p. 407)

Chelimsky’s (2008) view is that this issue, if anything, became more critical under the
Bush administrations (2001–2008). In effect, agency and managerial involvement in GAO
evaluations has become a significant political issue in the American government.

The GAO model of independent evaluations is exceptional—most governments do not
have a substantial institutional capacity to conduct independent evaluations. Instead, a more
typical model would be the one in the Canadian federal government, wherein each department
and agency has at least some evaluation capacity built into the organizational structure but
evaluation unit heads report to the administrative head of the agency. This model is similar to
the one advocated by Love (1991) in his description of internal evaluation. Unlike audit, where
there are typically both internal and external auditors to examine administrative processes and
even performance, evaluation continues to be an internal function.

In the Canadian example of the federal evaluation function, housing evaluation capacity in
departments and agencies makes sense from a formative standpoint; evaluators report to the
heads of the agencies, and their work would, in principle, be useful for making program-related
changes. But the overall thrust of the 2009 Federal Evaluation Policy is summative; that is, the
emphasis in the policy is on evaluations providing information to senior elected and appointed
officials and being used to fulfill accountability expectations. Evaluators who work in the
Canadian federal government are expected to wear two hats: They are members of the
organizations in which they do their evaluation work, but at the same time, they are expected to
meet the policy requirements set forth by Treasury Board. Like their counterparts in the GAO,
they need to work with managers to be able to do their work, but unlike the GAO, they do not
have an institutional base that is independent of the programs they are expected to evaluate.

Evaluating for Program Improvement

Most evaluation approaches emphasize the importance of evaluating to improve programs.
In Chapter 10, we saw that when public sector performance measurement systems are intended
to be used for both public accountability and performance improvement purposes, one use can
crowd out the other use. Specifically, requiring performance results to be publicly reported (to
fulfill accountability expectations) can affect the ways that information is viewed and used
within organizations. Evaluating to improve programs while evaluating to meet accountability
expectations can have similar effects as happens for performance measurement systems. If
organizational managers are invited to be a part of an evaluation where the results will become
public and may have significant consequences for their programs or their organizations,

suggesting that the evaluation is intended as well to improve the program will be viewed with
some skepticism.

The political culture in which the organization is embedded will affect perceptions of risk,
willingness to be candid, and perhaps even willingness to provide information for the
evaluation. Chelimsky (2008) points out that organizationally based information is critical to
constructing credible program evaluations. Making program evaluation high stakes, that is,
making evaluation results central to deciding the future of programs or even organizations, will
weaken the connections between evaluators and evaluands (the programs and managers being
evaluated), and affect the likelihood of successful future evaluation engagements.

Manager Bias in Evaluations: Limits to Manager Involvement

We began with Wildavsky’s (1979) view that managers and evaluators have quite different
and, in some respects, conflicting roles. The whole field of evaluation has moved toward a
position that makes room for manager involvement in evaluations and raises the question of
what limits, if any, there are in how managers can participate in evaluations.

At one end of a continuum of manager involvement, Fetterman and Wandersman (2007)
suggest that empowerment evaluation as a participatory approach facilitates managers and
other organizational members taking the lead in conducting both formative and summative
evaluations of their own programs. This view has been challenged by those who advocate for a
central role for program evaluators as judges of the merit and worth of programs (Scriven,
2005). Stufflebeam (1994) challenged advocates of empowerment evaluation around the issue
of whether managers and other stakeholders (not the evaluator[s]) should make the decisions
about the evaluation process and evaluation findings. His view is that ceding that amount of
control invites “corrupt or incompetent evaluation activity” (p. 324):

Many administrators caught in political conflicts over programs or needing to improve
their public relations image likely would pay handsomely for such friendly, non-
threatening, empowering evaluation service. Unfortunately, there are many persons who
call themselves evaluators who would be glad to sell such services. Unhealthy alliances of
this type can only delude those who engage in such pseudo evaluation practices, deceive
those whom they are supposed to serve, and discredit the evaluation field as a legitimate
field of professional practice. (p. 325)

Although Stufflebeam’s view is a strong critique of empowerment evaluation and, by
implication, other evaluative approaches that cede the central position that evaluation
professionals have in conducting both formative and summative evaluations, the roles that
evaluators and managers have often differ. The views put forward by advocates for
empowerment evaluation (Fetterman & Wandersman, 2007) suggest assumptions about what
motivates program managers that are similar to Le Grand’s (2010) suggestion that historically,
public servants in Britain were assumed to be interested in “doing the right thing” in their
work. In other words, managers would not be self-serving but instead would be motivated by a
desire to serve the public. Le Grand (2010) called such public servants “knights.” His own
view is that this assumption is naïve and needs to be tempered by considering the incentives
that shape behaviors.

The nature of organizational politics and the interactions between organizations and their
environments usually mean that managerial interests in preserving and enhancing programs is
challenged by the role that evaluators play in judging the merit and worth of programs.

Expecting managers to evaluate their own programs can result in biased program
evaluations. Indeed, a culture can be built up around the evaluation function such that
evaluators are expected to be advocates for programs. Under such conditions, departments and
agencies would use their evaluation capacity to defend their programs, structuring evaluations
and presenting results so that programs are seen to be above criticism. In the language used in
Chapter 10 to describe situations where performance measurement systems produced
unintended results: Gaming the program evaluation function can occur.

Evaluations produced by organizations under such conditions will tend to be viewed
outside the organization with skepticism. Funders, or analysts who are employed by the
funders, will work hard to expose weaknesses in the methodologies used and cast doubt on the
information in the evaluation reports. In effect, adversarial relationships can develop, which
serve to “expose” weaknesses in evaluations, but are generally not conducive to building self-
evaluating or learning organizations. As well, such controversies can undermine a sense that
the organization is accountable.

The reality is that expecting program managers to evaluate their own programs, particularly
where evaluation results are likely to be used in funding decisions, is likely to produce
evaluations that reflect the natural incentives and risk aversion inherent in such situations.
They are not necessarily credible even to the managers themselves. Program evaluation, as an
organizational function, becomes distorted and contributes to a view that evaluations are
biased.

Parenthetically, Nathan (2000), who has worked with several top American policy research
centers, points out that internal evaluations are not the only ones that may reflect incentives
that bias evaluation results:

Even when outside organizations conduct evaluations, the politics of policy research can
be hard going. To stay in business, a research organization (public or private) has to
generate a steady flow of income. This requires a delicate balance in order to have a
critical mass of support for the work one wants to do and at the same time maintain a high
level of scientific integrity. (p. 203)

Nevertheless, such incentives are likely to be more prevalent and stronger with internal
evaluations.

Should managers participate in evaluations of their own programs? Generally, scholars and
practitioners who have addressed this question have favored managerial involvement. Love
(1991) envisions (internal) evaluators working closely with program managers to produce
evaluations on issues that are of direct relevance to the managers. Patton (2008) stresses that
among the fundamental premises of utilization-focused evaluation, the first is commitment to
working with the intended users to ensure that the evaluation actually gets used.

STRIVING FOR OBJECTIVITY IN PROGRAM EVALUATIONS

Chelimsky (2008), in her description of the challenges to independence that are endemic in the
work that the GAO does, makes a case for the importance of evaluations being objective:

The strongest defense for an evaluation that’s in political trouble is its technical
credibility, which, for me, has three components. First, the evaluation must be technically
competent, defensible, and transparent enough to be understood, at least for the most part.
Second, it must be objective: That is, in Matthew Arnold’s terms (as cited in Evans,

2006), it needs to have “a reverence for the truth.” And third, it must not only be but also
seem objective and competent: That is, the reverence for truth and the methodological
quality need to be evident to the reader of the evaluation report. So, by technical
credibility, I mean methodological competence and objectivity in the evaluation, and the
perception by others that both of these characteristics are present. (p. 411)

Clearly, Chelimsky sees the value in claiming that high-stakes GAO evaluations are
objective. “Objective” is also a desired attribute of the information produced in federal
evaluations in Canada: “Evaluation … informs government decisions on resource allocation
and reallocation by … providing objective information to help Ministers understand how new
spending proposals fit” (Treasury Board of Canada Secretariat, 2009, sec. 3.2).

Evaluation is fundamentally about linking theory and practice. Notwithstanding the
practitioner views cited above, that objectivity is desirable, academics in the field have not
tended to emphasize “objectivity” as a criterion for good-quality evaluations (Conley-Tyler,
2005; Patton, 2008). Stufflebeam (1994), one exception, emphasizes the importance of what he
calls “objectivist evaluation” (p. 326) in professional evaluation practice. His definition of
objectivist evaluation picks up some of the themes articulated by Chelimsky (2008) above. For
Stufflebeam (1994),

objectivist evaluations are based on the theory that moral good is objective and
independent of personal or merely human feelings. They are firmly grounded in ethical
principles, strictly control bias or prejudice in seeking determinations of merit and worth,
… obtain and validate findings from multiple sources, set forth and justify conclusions
about the evaluand’s merit and/or worth, report findings honestly and fairly to all-right-to
know audiences, and subject the evaluation process and findings to independent
assessments against the standards of the evaluation field. Fundamentally, objectivist
evaluations are intended to lead to conclusions that are correct—not correct or incorrect
relative to a person’s position, standing or point of view. (p. 326)

Scriven has also advocated for good evaluations to be objective. For Scriven (1997),
objectivity is defined as “with basis and without bias” (p. 480), and an important part of being
able to claim that an evaluation is objective is to maintain an appropriate distance between the
evaluator and what is being evaluated (the evaluand). There is a crucial difference, for Scriven,
between being an evaluator and being an evaluation consultant. The former relies on validity as
one’s stock-in-trade, and objectivity is a central part of being able to claim that one’s work is
valid. The latter work with their clients and stakeholders, but according to Scriven, in the end
they cannot offer analysis, conclusions, or recommendations that are not tainted by interactions
and the biases that they entail.

In addition to Scriven’s view that objectivity is a key part of evaluation practice, other
related professions have asserted, and continue to assert, that professional practice is, or at least
ought to be, objective. In the 2003 edition of the Government Auditing Standards (GAO,
2003), government auditors are enjoined to perform their work this way:

Professional judgment requires auditors to exercise professional skepticism, which is an
attitude that includes a questioning mind and a critical assessment of evidence. Auditors
use the knowledge, skills, and experience called for by their profession to diligently
perform, in good faith and with integrity, the gathering of evidence and the objective
evaluation of the sufficiency, competency, and the relevancy of evidence. (p. 51)

Should evaluators claim that their work is also objective? Objectivity has a certain cachet,
and as a practitioner, it would be appealing to be able to assert to prospective clients that one’s
work is objective. Indeed, in situations where evaluators are competing with auditors for
clients, claiming objectivity could be an important factor in convincing clients to use the
services of an evaluator.

Can Program Evaluators Be Objective?

If giving managers a (substantial) stake in evaluations compromises evaluator and
evaluation objectivity, then it is important to unpack what is entailed by claims that evaluations
or audits are objective. Is Scriven’s definition of objectivity defensible? Is objectivity a
meaningful criterion for high-quality program evaluations? Could we defend a claim to a
prospective client that our work would be objective?

Scriven (1997) suggests a metaphor to understand the work of an evaluator: When we do
program evaluations, we can think of ourselves as expert witnesses. We are, in effect, called to
“testify” about a program, we offer our expert opinions, and the “court” (our client) can decide
what to do with our contributions.

Scriven (1997) takes the courtroom metaphor further when he asserts that in much the same
way that witnesses are sworn to tell “the truth, the whole truth, and nothing but the truth” (p.
496), evaluators can rely on a common-sense notion of the truth as they do their work. If such
an oath “works” in courts (Scriven believes it does), then despite the philosophical questions
that can be raised by a claim that something is true, we can and should continue to rely on a
common-sense notion of what is true and what is not.

Scriven’s main point is that program evaluators should be prepared to offer objective
evaluations and that to do so, it is essential that we recognize the difference between
conducting ourselves in ways that promote our objectivity and ways that do not. Even those
who assert that there cannot be any truths in our work are, according to Scriven, uttering a self-
contradictory assertion: They wish to claim the truth of a statement that there are no truths.

Although Scriven’s argument has a common-sense appeal, it is important to examine it
more closely. There are essentially two main issues in the approach he takes.

First, Scriven’s metaphor of evaluators as expert witnesses does have some limitations. In
courts of law, expert witnesses are routinely challenged by their counterparts and by opposing
lawyers. Unlike Scriven’s evaluators, who do their work, offer their report, and then absent
themselves to avoid possible compromises of their objectivity, expert witnesses in courts
undergo a high level of scrutiny. Even where expert witnesses have offered their version of the
truth, it is often not clear whether that is their view or the views of a party to a legal dispute.
Expert witnesses can sometimes be “purchased.”

Second, witnesses speaking in court can be severely penalized if it is discovered that they
have lied under oath. For program evaluators, it is far less likely that sanctions will be brought
to bear even if it could be demonstrated that an evaluator did not speak “the truth.”
Undoubtedly, an evaluator’s place in the profession can be affected when the word gets around
that he or she has been “bought” by a client, but the reality is that in the practice of program
evaluation, clients can and do shop for evaluators who are likely to “do the job right.” “Doing
the job right” can mean that evaluators are paid to not speak “the truth, the whole truth, and
nothing but the truth.”

Looking for a Defensible Definition of Objectivity

Are there other definitions of objectivity that are useful in terms of assisting our practice of
program evaluation? The Federal Government of Canada’s OCG (Office of the Comptroller
General) was among the government jurisdictions that historically advocated the importance of
objectivity in evaluations. In one statement, objectivity was defined this way:

Objectivity is of paramount importance in evaluative work. Evaluations are often
challenged by someone: a program manager, a client, senior management, a central
agency or a minister. Objectivity means that the evidence and conclusions can be verified
and confirmed by people other than the original authors. Simply stated, the conclusions
must follow from the evidence. Evaluation information and data should be collected,
analyzed and presented so that if others conducted the same evaluation and used the same
basic assumptions, they would reach similar conclusions. (Treasury Board of Canada
Secretariat, 1990, p. 28)

This definition of objectivity emphasizes the reliability of evaluation findings and
conclusions, and is similar to the way auditors define high-quality work in their profession.
This implies, at least in principle, that the work of one evaluator or one evaluation team could
be repeated, with the same results, by a second evaluation of the same program.

A Natural Science Definition of Objectivity

The OCG criterion of repeatability is similar in part to the way scientists do their work.
Findings and conclusions, to be accepted by the discipline, must be replicable.

There is, however, an important difference between program evaluation practice and the
practice of scientific disciplines. In the sciences, the methodologies and procedures that are
used to conduct research and report the results are intended to facilitate replication. Methods
are scrutinized by one’s peers, and if the way the work has been conducted and reported passes
this test, it is then “turned over” to the community of researchers, where it is subjected to
independent efforts to replicate the results. In other words, meaningfully claiming objectivity
requires both the use of replicable methodologies and actual replications of programs and
policies. In practical terms, satisfying both of these criteria is rare.

If a particular set of findings cannot be replicated by independent researchers, the
community of research peers eventually discards the results as an artifact of the setting or the
scientist’s biases. Transparent methodologies are necessary but not sufficient to establish
objectivity of scientific results. The initial reports of cold fusion reactions (Fleischmann &
Pons, 1989), for example, prompted additional attempts to replicate the reported findings, to no
avail. Fleischman and Pons’s research methods proved to be faulty, and cold fusion did not
pass the test of replicability.

A more contemporary controversy that also hinges on being able to replicate experimental
results is the question of whether high-energy neutrinos can travel faster than the speed of
light. If such a finding were corroborated (reproduced by independent teams of researchers), it
would undermine a fundamental assumption of Einstein’s relativity theory—that no particle
can travel faster than the speed of light. The back-and-forth “dialogue” in the high-energy
physics community is illustrated by a publication that claims that the one set of experimental
results (apparently replicating the original experiment) were wrong and that Einstein’s theory
is safe (Antonello et al., 2012). The dialogue between the experimentalists and the
theoreticians in physics on whether neutrinos actually have been measured traveling faster than

the speed of light has the potential to change physics as we know it. The stakes are high, and
therefore, the canons of scientific research must be respected.

For scientists, then, objectivity has two important elements, both of which are necessary.
Methods and procedures need to be constructed and applied so that the work done, as well as
the findings, are open to scrutiny by one’s peers. Although the process of doing a given
science-based research project does not by itself make the research objective, it is essential that
this process be transparent. Scrutability of methods facilitates repeating the research. If
findings can be replicated independently, the community of scholars engaged in similar work
confers objectivity on the research. Even then, scientific findings are not treated as absolutes.
Future tests might raise questions, offer refinements, and generally increase knowledge.

This working definition of objectivity does not imply that objectivity confers “truth” on
scientific findings. Indeed, the idea that objectivity is about scrutability and replicability of
methods and repeatability of findings is consistent with Kuhn’s (1962) notion of paradigms.
Kuhn suggested that communities of scientists who share a “worldview” are able to conduct
research and interpret the results. Within a paradigm, “normal science” is about solving
puzzles that are implied by the theoretical structure that undergirds the paradigm. “Truth” is
agreement, based on research evidence, among those who share a paradigm.

In program evaluation practice, much of what we call methodology is tailored to particular
settings. Increasingly, we are taking advantage of mixed qualitative–quantitative methods
(Creswell, 2009; Hearn, Lawler, & Dowswell, 2003) when we design and conduct evaluations,
and our own judgment as professionals plays an important role in how evaluations are designed
and data are gathered, interpreted, and reported. Owen and Rogers (1999) make this point
when they state,

no evaluation is totally objective: it is subject to a series of linked decisions [made by the
evaluator]. Evaluation can be thought of as a point of view rather than a statement of
absolute truth about a program. Findings must be considered within the context of the
decisions made by the evaluator in undertaking the translation of issues into data
collection tools and the subsequent data analysis and interpretation. (p. 306)

Although the OCG criterion of repeatability (Treasury Board of Canada Secretariat, 1990)
in principle might be desirable, it is rarely applicable to program evaluation practice. Even in
the audit community, it is rare to repeat the fieldwork that underlies an audit report. Instead,
the fieldwork is conducted so that all findings are documented and corroborated by more than
one line of evidence (or one source of information). In effect, there is an audit trail for the
evidence and the findings.

Implications for Evaluation Practice

Where does this leave us? Scriven’s (1997) criteria for objectivity—with basis and without
bias—has some defensibility limitations in as much as they usually depend on the “objectivity”
of individual evaluators in particular settings. Not even in the natural sciences, where the
subject matter and methods are far more conducive to Scriven’s definition, do researchers rely
on one scientist’s assertions about “facts” and “objectivity.” Instead, the scientific community
demands that the methods and results be stated so that the research results can be corroborated
or disconfirmed, and it is via that process that “objectivity” is conferred. Objectivity is not an
attribute of one researcher but instead is predicated on the process in the scientific community
in which that researcher practices.

In some professional settings where teams of evaluators work on projects, it may be
possible to construct internal challenge functions and even share draft reports externally to
increase the likelihood that the final product will be viewed as defensible and robust. But
repeating an evaluation to confirm the replicability of the findings is almost never done.

The realities of the practice of program evaluation weaken claims that we evaluators can be
objective in the work we do. Evaluation is not a science. Instead, it is a craft that mixes
together methods with professional judgment to produce products that are methodologically
defensible, tailored to contexts, and almost always have unique characteristics.

CRITERIA FOR HIGH-QUALITY EVALUATIONS: THE
VARYING VIEWS OF EVALUATION ASSOCIATIONS

Many professional associations that represent the interests and views of program evaluators
have developed codes of ethics or best practice guidelines. A review of several of these
guideline documents indicates that, with one exception (American Educational Research
Association [AERA], 2011), there is little specific attention to “objectivity” among the criteria
suggested for good evaluations (AERA, 2011; American Evaluation Association, 2004;
Australasian Evaluation Society, 2010; Yarbrough, Shulha, Hopson, & Caruthers, 2011).

Historically, Scriven (1997), Stufflebeam (1994), and, more recently, Chelimsky (2008)
have emphasized objectivity as a key commodity of program evaluations, and there are
government organizations that in their guidelines for assessing evaluation reports do discuss
the issue of objectivity (see, e.g., Treasury Board of Canada Secretariat, 1990, 2009; U.S.
OMB, 2004b). Markiewicz (2008) provides a provocative discussion about challenges of
independence and objectivity in the political context of evaluation, noting,

the challenges presented by the political and stakeholder context of evaluation do raise the
longstanding paradigm wars between scientific realists and social constructionists. The
former group of evaluators tend to uphold concepts of objectivity and independence in
evaluation, while the latter group of evaluators view themselves as negotiators of different
realities. (p. 35)

There is one research and evaluation association that has explicitly included objectivity as a
criterion for high-quality studies. The AERA (2008, p. 1) defines scientifically based research
as “the use of rigorous, systematic, and objective methodologies to obtain valid and reliable
knowledge.” The full set of criteria includes the following:

a. development of a logical, evidence-based chain of reasoning;

b. methods appropriate to the questions posed;

c. observational or experimental designs and instruments that provide reliable and
generalizable findings;

d. data and analysis adequate to support the findings;

e. explication of procedures and results clearly and in detail, including specification of
the population to which the findings can be generalized;

f. adherence to professional norms of peer review;

g. dissemination of the findings to contribute to scientific knowledge; and

h. access to data for reanalysis, replication, and the opportunity to build on findings.

Evaluating program effectiveness (assessing cause-and-effect relationships) requires
“experimental designs using random assignment or quasi-experimental or other designs that
substantially reduce plausible competing explanations for the obtained results” (AERA, 2008,
p. 1).

The AERA has been part of the policy changes in the United States in the field of
education evaluation that began with the No Child Left Behind Act of 2002 (Duffy, Giordano,
Farrell, Paneque, & Crump, 2008). Duffy et al. (2008) point out that the phrase “scientifically-
based research” appeared over 100 times in the legislation. The working definition of that
phrase is very similar to the AERA definition above. Since the No Child Left Behind Act was
passed, privileging quantitative, experimental, and quasi-experimental evaluation designs has
had an impact on the whole evaluation community in the United States (Smith, 2007).

The key question for us is whether the AERA definition of “scientifically based research”
offers a credible alternative to other standards or guidelines. The AERA definition highlights
the objectivity of research methodologies and mentions replication as one possible outcome
from a study. But when we look at the field of education evaluation (and evaluation more
broadly), we see that the efficacy of randomized controlled trials is substantially limited by
contextual variables.

Lykins (2009), in his assessment of the impacts of U.S. federal policy on education
research, offers this example of the limits of “scientific research” in education:

Take for instance the much-studied Tennessee STAR experiment in class-size reduction.
The results of the randomized trial suggested that class-size reduction caused modest
gains in the test scores of children in early grades. Boruch, De Moya, and Synder (2002)
cite this study as evidence that “a single RFT can help to clarify the effect of a particular
intervention against a backdrop of many nonrandomized trials” (p. 74). In fact, the
experiment taught, at most, only that class-size reductions were responsible for increased
test-scores for these particular students. It did not lend warrant to the claim that class-size
reductions are an effective way for raising achievement as such. This became clear when
California implemented a state-wide policy of class-size reduction. The California
program not only failed to increase student achievement, but may have been responsible
for a substantial increase in the number of poorly qualified teachers in high-poverty
schools, thus actually harming student performance. (pp. 94–95, italics in original)

The practical effect of privileging (experimental) methodologies that are aimed at
examining cause-and-effect relationships is that program evaluations are limited in their
generalizability. Cronbach (1982) pointed this out and effectively countered the then dominant
view in evaluation that experimental designs, with their overriding emphasis on internal
validity, were the gold standard.

For evaluation associations and for evaluators, there are other quality-related criteria that
are more relevant. With the exception of the AERA, the evaluation profession as a whole has
generally not been prepared to emphasize objectivity as a criterion for high-quality evaluations.
Instead, professional evaluation organizations tend to mention the accuracy and credibility of
evaluation information (American Evaluation Association, 2004; Canadian Evaluation Society,
2012; Organisation for Economic Cooperation and Development, 2010; Yarbrough et al.,
2011), the honesty and integrity of evaluators and the evaluation process (American Evaluation
Association, 2004; Australasian Evaluation Society, 2010; Canadian Evaluation Society, 2012;

Yarbrough et al., 2011), the fairness of evaluation assessments (Australasian Evaluation
Society, 2010; Canadian Evaluation Society, 2012; Yarbrough et al., 2011), and the validity
and reliability of evaluation information (American Evaluation Association, 2004; Canadian
Evaluation Society, 2012; Organisation for Economic Cooperation and Development, 2010;
Yarbrough et al., 2011).

In addition, professional guidelines emphasize the importance of declaring and avoiding
conflicts of interest (American Evaluation Association, 2004; Australasian Evaluation Society,
2010; Canadian Evaluation Society, 2012; Yarbrough et al., 2011) and the importance of
impartiality in reporting findings and conclusions (Organisation for Economic Cooperation and
Development, 2010). Evaluator independence is also mentioned as a criterion (Markiewicz,
2008). Also, guidelines tend to emphasize the importance of competence in conducting
evaluations, and the importance of upgrading evaluation skills (American Evaluation
Association, 2004; Australasian Evaluation Society, 2010; Canadian Evaluation Society,
2012). Collectively, these guidelines cover many of the characteristics of evaluators and
evaluations that we might associate with objectivity: accuracy, credibility, validity, reliability,
fairness, honesty, integrity, and competence. Transparency is also a criterion mentioned in
some guidelines and standards (see, e.g., Organisation for Economic Cooperation and
Development, 2010; Yarbrough et al., 2011). But—and this is a key point—objectivity is more
than just having good evaluators or even good evaluations; it is a process that involves
corroboration of one’s findings by one’s peers. Our profession is so diverse and includes so
many different epistemological and methodological stances that asserting “objectivity” would
not be supported by most evaluators.

But, the evaluation profession does not exist alone in the current world of professionals
who claim expertise in evaluating programs. The movement to connect evaluation to
accountability expectations in public sector and nonprofit organizations has created situations
where evaluation professionals, with their diverse backgrounds and standards, are compared
with accounting professionals or with management consultants, who generally have a more
uniform view of their respective professional standards. Because the public sector auditing
community in particular has predicated objectivity of their practice, it is arguable that they
have a marketing advantage with prospective clients (see Everett, Green, & Neu, 2005;
Radcliffe, 1998). Furthermore, with some key central agencies asserting that in assessing the
quality of evaluations one of the key criteria should be objectivity of the findings (Treasury
Board of Canada Secretariat, 2009), that criterion confers an advantage on practitioners who
claim that their process and products are objective. Patton (2008) refers to the politics of
objectivity, meaning that for some evaluators it is important to be able to declare that their
work is objective.

What should evaluators tell prospective clients who, having heard that the auditing
profession or management consultants (Institute of Management Consultants, 2008) make
claims about their work being objective, expect the same from a program evaluation? If we tell
clients that we cannot produce an objective evaluation, there may be a risk of their going
elsewhere for assistance. On the other hand, claims that we can be objective are not supported,
given the evaluators’ work.

Perhaps the best way to respond is to offer criteria that cover much of the same ground as is
covered if one conducts evaluations with a view to their being “objective.” Criteria like
accuracy, credibility, honesty, completeness, fairness, impartiality, avoiding conflicts of
interest, competence in conducting evaluations, and a commitment to staying current in skills
are all relevant. They would be among the desiderata that scientists and others who can make
defensible claims about objectivity would include in their own practice. The criteria mentioned

are also among the principal ones included by auditors and accountants in their own standards
(GAO, 2003).

Patton (2008) takes a pragmatic stance in his own assessment of whether to claim that
evaluations are objective:

Words such as fairness, neutrality, and impartiality carry less baggage than objectivity and
subjectivity. To stay out of the argument about objectivity, I talk with intended users
about balance, fairness, and being explicit about what perspectives, values, and priorities
have shaped the evaluation, both the design and the findings. (p. 452)

To sum up, current guidelines and standards that have been developed by professional
evaluation associations generally do not claim that program evaluations should be objective.
Correspondingly, as practicing professionals, we should not be making such claims in our
work. That does not mean that we are without standards, and indeed, we should be striving to
be honest, accurate, fair, impartial, competent, highly skilled, and credible in the work we do.
If we are these things, we can justifiably claim that our work meets the same professional
standards as work done by scholars and practitioners who might claim to be objective.

SUMMARY

The relationships between managers and evaluators are affected by the incentives that each
party faces in particular contexts. If evaluators have been commissioned to conduct a
summative evaluation, it is more likely that program managers will defend their programs,
particularly where the stakes are perceived to be high. Expecting managers, under these
conditions, to participate as neutral parties in an evaluation ignores the potential for conflicts of
commitments, which can affect the accuracy and completeness of information that managers
provide about their own programs. This problem parallels the problem that exists in
performance measurement systems, where public, high-stakes, summative uses of performance
results will tend to induce gaming of the system by those who are affected by the consequences
of disseminating performance results.

Formative evaluations, where it is generally possible to project a “win-win” scenario for
managers and evaluators, offer incentives for managers to be forthcoming so that they benefit
from an assessment based on an accurate and complete understanding of their programs.
Historically, a majority of evaluations have been formative. Although advocates for program
evaluation and performance measurement imply that evaluations can be used for resource
allocation/reallocation decisions, it is comparatively rare to have an evaluation that does that.
There has been a gap between the promise and the performance of evaluation functions in
governments in that regard (Muller-Clemm & Barnes, 1997).

Many evaluation approaches encourage or even mandate manager or organizational
participation in evaluations. Where utilization of evaluation results is a central concern of
evaluation processes, managerial involvement has been shown to increase uses of evaluation
findings. Some evaluation approaches—empowerment evaluation is an example of an
important and relatively new approach—suggest that control of the evaluation process should
be devolved to those in the organizations and programs being evaluated. This view is contested
in the evaluation field and continues to be deliberated by other evaluation scholars and
practitioners.

Promoting quality standards for evaluations continues to be an important indicator of the
professionalization of evaluation practice. Although objectivity has been a desired feature of

“good” evaluations in the past, professional associations have generally opted not to emphasize
objectivity among the criteria that define high-quality evaluations.

Evaluators, accountants, and management consultants will continue to be connected with
efforts by government and nonprofit organizations to be more accountable. In some situations,
evaluation professionals, accounting professionals, and management consultants will compete
for work with clients. Because the accounting profession continues to assert that their work is
objective, evaluators will have to address the issue of how to characterize their own practice,
so that clients can be assured that the work of evaluators meets standards of rigor, defensibility,
and ethical practice.

DISCUSSION QUESTIONS

1. Why are summative evaluations more challenging to do than formative evaluations?
2. How should program managers be involved in evaluations of their own programs?
3. What is a learning organization, and how is the culture of a learning organization

supportive of evaluation?
4. What are the advantages and disadvantages of relying on internal evaluators in public

sector and nonprofit organizations?
5. What is an evaluative culture in an organization? What roles would evaluators play in

building and sustaining such a culture?
6. What would it take for an evaluator to claim that her or his evaluation is objective?

Given those requirements, is it possible for any evaluator to say that his or her evaluation
is objective? Under what circumstances, if any?

7. Suppose that you are a practicing evaluator and you are discussing a possible contract to
do an evaluation for an agency. The agency director is very interested in your proposal
but, in the discussions, says that he wants an objective evaluation. If you are willing to
tell him that your evaluation will be objective, you have the contract. How would you
respond to this situation?

8. Other professions like medicine, law, accounting, and social work have guidelines for
professional practice that can be enforced against individual practitioners, if need be.
Evaluation has guidelines, but they are not enforceable. What would be the advantages
and disadvantages of the evaluation profession having enforceable practice guidelines?
Who would do the enforcing?

REFERENCES

American Educational Research Association. (2008). Definition of scientifically based
research. Retrieved from
http://www.aera.net/Portals/38/docs/About_AERA/KeyPrograms/
DefinitionofScientificallyBasedResearch

American Educational Research Association. (2011). Code of ethics: American Educational
Research Association—approved by the AERA Council February 2011. Retrieved from
http://www.aera.net/Portals/38/docs/
About_AERA/CodeOfEthics(1)

American Evaluation Association. (2004). Guiding principles for evaluators. Retrieved from
http://www.eval.org/Publications/GuidingPrinciples.asp

Antonello, M., Aprili, P., Baibussinov, B., Baldo Ceolin, M., Benetti, P., Calligarich, E., …
Zmuda, J. (2012). A search for the analogue to Cherenkov radiation by high energy
neutrinos at superluminal speeds in ICARUS. Physics Letters B, 711(3–4), 270–275.

Australasian Evaluation Society. (2010). AES guidelines for the ethical conduct of evaluations.
Retrieved from http://www.aes.asn.au/

Bevan, G., & Hamblin, R. (2009). Hitting and missing targets by ambulance services for
emergency calls: Effects of different systems of performance measurement within the UK.
Journal of the Royal Statistical Society: Series A (Statistics in Society), 172(1), 161–190.

Canadian Evaluation Society. (2012). Program evaluation standards. Retrieved from
http://www.evaluationcanada.ca/site.cgi?s=6&
ss=10&_lang=EN

Chelimsky, E. (2008). A clash of cultures: Improving the “fit” between evaluative
independence and the political requirements of a democratic society. American Journal of
Evaluation, 29(4), 400–415.

Conley-Tyler, M. (2005). A fundamental choice: Internal or external evaluation? Evaluation
Journal of Australasia, 5(1&2), 3–11.

Cousins, J. B. (2005). Will the real empowerment evaluation please stand up? A critical friend
perspective. In D. Fetterman & A. Wandersman (Eds.), Empowerment evaluation
principles in practice (pp. 183–208). New York, NY: Guilford Press.

Cousins, J. B., & Whitmore, E. (1998). Framing participatory evaluation. New Directions for
Evaluation, 80, 5–23.

Creswell, J. W. (2009). Research design: Qualitative, quantitative, and mixed methods
approaches. Thousand Oaks, CA: Sage.

Cronbach, L. J. (1982). Designing evaluations of educational and social programs (1st ed.).
San Francisco, CA: Jossey-Bass.

de Lancer Julnes, P., & Holzer, M. (2001). Promoting the utilization of performance measures
in public organizations: An empirical study of factors affecting adoption and
implementation. Public Administration Review, 61(6), 693–708.

Duffy, M., Giordano, V. A., Farrell, J. B., Paneque, O. M., & Crump, G. B. (2008). No Child
Left Behind: Values and research issues in high-stakes assessments. Counseling and
Values, 53(1), 53–66.

Everett, J., Green, D., & Neu, D. (2005). Independence, objectivity and the Canadian CA
profession. Critical Perspectives on Accounting, 16(4), 415–440.

Fetterman, D. (2001). Foundations of empowerment evaluation. Thousand Oaks, CA: Sage.
Fetterman, D., Kaftarian, S. J., & Wandersman, A. (1996). Empowerment evaluation:

Knowledge and tools for self-assessment and accountability. Thousand Oaks, CA: Sage.
Fetterman, D., & Wandersman, A. (2007). Empowerment evaluation: Yesterday, today, and

tomorrow. American Journal of Evaluation, 28(2), 179–198.
Fleischmann, M., & Pons, S. (1989). Electrochemically induced nuclear fusion of deuterium.

Journal of Electroanalytical Chemistry, 261(2A), 301–308.
Garvin, D. A. (1993). Building a learning organization. Harvard Business Review, 71(4), 78

–90.
Gill, D. (Ed.). (2011). The iron cage recreated: The performance management of state

organisations in New Zealand. Wellington, NZ: Institute of Policy Studies.

Government Accountability Office. (2003, August). Government auditing standards: 2003
revision (GAO-03–673G). Washington, DC: Author.

Hearn, J., Lawler, J., & Dowswell, G. (2003). Qualitative evaluations, combined methods and
key challenges: General lessons from the qualitative evaluation of community intervention
in stroke rehabilitation. Evaluation, 9(1), 30–54.

Hood, C. (1995). The “new public management” in the 1980s: Variations on a theme.
Accounting, Organizations and Society, 20(2–3), 93–109.

Institute of Management Consultants. (2008). IMC code of ethics & member’s pledge.
Retrieved from http://www.imc.org.au/Become-a-Member/Membership/
IMC-CODE-OF-ETHICS-MEMBERS-PLEDGE.asp

King, J. A., Cousins, J. B., & Whitmore, E. (2007). Making sense of participatory evaluation:
Framing participatory evaluation. New Directions for Evaluation, 114, 83–105.

Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago, IL: University of Chicago
Press.

Le Grand, J. (2010). Knights and knaves return: Public service motivation and the delivery of
public services. International Public Management Journal, 13(1), 56–71.

Leviton, L. C. (2003). Evaluation use: Advances, challenges and applications. American
Journal of Evaluation, 24(4), 525–535.

Lincoln, Y. S., & Guba, E. G. (1980). The distinction between merit and worth in evaluation.
Educational Evaluation and Policy Analysis, 2(4), 61–71.

Love, A. J. (1991). Internal evaluation: Building organizations from within. Newbury Park,
CA: Sage.

Lykins, C. (2009). Scientific research in education: An analysis of federal policy (Doctoral
dissertation). Nashville, TN: Graduate School of Vanderbilt University. Retrieved from
http://etd.library.vanderbilt.edu/available/etd-
07242009-114615/unrestricted/lykins

Mark, M. M., & Henry, G. T. (2004). The mechanisms and outcomes of evaluation influence.
Evaluation, 10(1), 35–57.

Markiewicz, A. (2008). The political context of evaluation: What does this mean for
independence and objectivity? Evaluation Journal of Australasia, 8(2), 35–41.

Mayne, J. (2008). Building an evaluative culture for effective evaluation and results
management. Retrieved from http://www.cgiar-ilac.org/files/publications/briefs/
ILAC_Brief20_Evaluative_Culture

Mayne, J., & Rist, R. C. (2006). Studies are not enough: The necessary transformation of
evaluation. Canadian Journal of Program Evaluation, 21(3), 93–120.

Morgan, G. (2006). Images of organization (Updated ed.). Thousand Oaks, CA: Sage.
Moynihan, D. P. (2008). The dynamics of performance management: Constructing information

and reform. Washington, DC: Georgetown University Press.
Muller-Clemm, W. J., & Barnes, M. P. (1997). A historical perspective on federal program

evaluation in Canada. Canadian Journal of Program Evaluation, 12(1), 47–70.
Nathan, R. P. (2000). Social science in government: The role of policy researchers (Updated

ed.). Albany, NY: Rockefeller Institute Press.
New York City Department of Homeless Services. (2010). City council hearing general

welfare committee “Oversight: DHS’s Homebase Study.” Retrieved from

http://nycppf.org/html/dhs/downloads/pdf/
abt_testimony_120910

Norris, N. (2005). The politics of evaluation and the methodological imagination. American
Journal of Evaluation, 26(4), 584–586.

Office of the Comptroller General of Canada. (1981). Guide on the program evaluation
function. Ottawa, Ontario, Canada: Treasury Board of Canada Secretariat.

Organisation for Economic Cooperation and Development. (2010). Evaluation in development
agencies: Better aid. Paris, France: Author.

Owen, J. M., & Rogers, P. J. (1999). Program evaluation: Forms and approaches
(International ed.). Thousand Oaks, CA: Sage.

Patton, M. Q. (1994). Developmental evaluation. Evaluation Practice, 15(3), 311–319.
Patton, M. Q. (1997). Utilization-focused evaluation: The new century text (3rd ed.). Thousand

Oaks, CA: Sage.
Patton, M. Q. (2008). Utilization-focused evaluation (4th ed.). Thousand Oaks, CA: Sage.
Patton, M. Q. (2011). Developmental evaluation: Applying complexity to enhance innovation

and use. New York: Guilford Press.
Radcliffe, V. S. (1998). Efficiency audit: An assembly of rationalities and programmes.

Accounting, Organizations and Society, 23(4), 377–410.
Rist, R. C., & Stame, N. (Eds.). (2006). From studies to streams: Managing evaluative systems

(Vol. 12). New Brunswick, NJ: Transaction.
Scriven, M. (1997). Truth and objectivity in evaluation. In E. Chelimsky & W. R. Shadish

(Eds.), Evaluation for the 21st century: A handbook (pp. 477–500). Thousand Oaks, CA:
Sage.

Scriven, M. (2005). Review of the book: Empowerment evaluation principles in practice.
American Journal of Evaluation, 26(3), 415–417.

Senge, P. M. (1990). The fifth discipline: The art and practice of the learning organization (1st
ed.). New York: Doubleday/Currency.

Smith, N. L. (2007). Empowerment evaluation as evaluation ideology. American Journal of
Evaluation, 28(2), 169–178.

Smits, P., & Champagne, F. (2008). An assessment of the theoretical underpinnings of
practical participatory evaluation. American Journal of Evaluation, 29(4), 427–442.

Stufflebeam, D. L. (1994). Empowerment evaluation, objectivist evaluation, and evaluation
standards: Where the future of evaluation should not go and where it needs to go.
Evaluation Practice, 15(3), 321–338.

Treasury Board of Canada Secretariat. (1990). Program evaluation methods: Measurement and
attribution of program results (3rd ed.). Ottawa, Ontario, Canada: Deputy Comptroller
General Branch, Government Review and Quality Services.

Treasury Board of Canada Secretariat. (2009). Policy on evaluation. Retrieved from
http://www.tbs-sct.gc.ca/pol/doc-eng.aspx?id=15024

U.S. Office of Management and Budget. (2002). Program performance assessments for the FY
2004 budget: Memorandum for heads of executive departments and agencies from Mitchell
E. Daniels Jr. Retrieved from http://www.whitehouse.gov/sites/default/files/omb/
assets/omb/memoranda/m02-10

U.S. Office of Management and Budget. (2004). What constitutes strong evidence of a
program’s effectiveness? Retrieved from http://www.whitehouse.gov/omb/part/2004_
program_eval

Watson, K. F. (1986). Programs, experiments, and other evaluations: An interview with
Donald Campbell. Canadian Journal of Program Evaluation, 1(1), 83–86.

Wildavsky, A. B. (1979). Speaking truth to power: The art and craft of policy analysis.
Boston, MA: Little, Brown.

Wolff, E. (1979). Proposed approach to program evaluation in the Government of British
Columbia. Victoria, British Columbia, Canada: Treasury Board.

Yarbrough, D., Shulha, L., Hopson, R., & Caruthers, F. (2011). Joint committee on standards
for educational evaluation: A guide for evaluators and evaluation users (3rd ed.).
Thousand Oaks, CA: Sage.

CHAPTER 12

THE NATURE AND PRACTICE OF
PROFESSIONAL JUDGMENT IN PROGRAM
EVALUATION

Introduction
The Nature of the Evaluation Enterprise

What Is Good Evaluation Practice?

Methodological Considerations

Problems With Experimentation as a Criterion for Good Methodologies

The Importance of Causality: The Core of the Evaluation Enterprise

Alternative Perspectives on the Evaluation Enterprise

Reconciling Evaluation Theory With the Diversity of Practice

Working in the Swamp: The Real World of Evaluation Practice

Common Ground Between Program Evaluators and Program Managers

Situating Professional Judgment in Program Evaluation Practice

Acquiring Knowledge and Skills for Evaluation Practice

Professional Knowledge as Applied Theory

Professional Knowledge as Practical Know-How

Balancing Theoretical and Practical Knowledge in Professional Practice

Understanding Professional Judgment

A Modeling of the Professional Judgment Process

The Decision Environment

Values,

Beliefs

, and Expectations

Acquiring Professional Knowledge

Improving Professional Judgment in Evaluation Through Reflective Practice

Guidelines for the Practitioner

The Range of Professional Judgment Skills

Ways of Improving Sound Professional Judgment Through Education and Training-Related

Activities

Teamwork and Professional Judgment

Evaluation as a Craft: Implications for Learning to Become an Evaluation Practitioner

Ethics for Evaluation Practice

The Development of Ethics for Evaluation Practice

Ethical Evaluation Practice

Cultural Competence in Evaluation Practice

The Prospects for an Evaluation Profession

Summary
Discussion Questions
Appendix

Appendix A: Fiona’s Choice: An Ethical Dilemma for a Program Evaluator
Your Task
References

Good judgment is based on experience. Unfortunately, experience is based on poor
judgment.

—Anonymous

INTRODUCTION

Chapter 12 begins by reflecting on what good evaluation methodology is, and reminding the
reader that in the evaluation field there continues to be considerable disagreement around how
we should design evaluations to assess program effectiveness and, in so doing, examine causes
and effects. We then look at the diversity of evaluation practice, and how evaluators actually
do their work. Developing the capacity to exercise sound professional judgment is key to
becoming a competent evaluator.

Much of Chapter 12 is focused on what professional judgment is, how to cultivate sound
professional judgment, and how evaluation education and training programs can build in
opportunities to learn the practice of exercising professional judgment. Key to developing
one’s own capacity to render sound professional judgments is learning how to be more
reflective in one’s evaluation practice. We introduce evaluation ethics and connect ethics to
professional judgment in evaluation practice.

Throughout this book, we have referred to the importance of professional judgment in the
practice of evaluation. Our view is that evaluators rely on their professional judgment in all
evaluation settings. Although most textbooks in the field, as well as most academic programs
that prepare evaluators for careers as practitioners, do not make the acquisition or practice of
sound professional judgment an explicit part of evaluator training, this does not change the fact
that professional judgment is an integral part of our practice.

To ignore or minimize the importance of professional judgment suggests a scenario that
has been described by Schön (1987) as follows:

In the varied topography of professional practice, there is the high, hard ground
overlooking a swamp. On the high ground, manageable problems lend themselves to
solutions through the application of research-based theory and technique. In the swampy
lowland, messy, confusing problems defy technical solutions.… The practitioner must
choose. Shall he remain on the high ground where he can solve relatively unimportant
problems according to prevailing standards of rigor, or shall he descend to the swamp of
important problems and non-rigorous inquiry? (p. 3)

THE NATURE OF THE EVALUATION ENTERPRISE

Evaluation can be viewed as a structured process that creates and synthesizes information that
is intended to reduce the level of uncertainty for stakeholders about a given program or policy.
It is intended to answer questions (see the list of evaluation questions discussed in Chapter 1)
or test hypotheses, the results of which are then incorporated into the information bases used
by those who have a stake in the program or policy.

What Is Good Evaluation Practice?
Methodological Considerations

Views of evaluation research and practice, and in particular about what they ought to be,
vary widely. At one end of the spectrum, advocates of a highly structured (typically
quantitative) approach to evaluations tend to emphasize the use of research designs that ensure

sufficient internal and statistical conclusions validity that the key causal relationships between
the program and outcomes can be isolated and tested. According to this view, experimental
designs are the benchmark of sound evaluation designs, and departures from this ideal are
associated with problems that either require specifically designed (and usually complex)
methodologies to resolve limitations, or are simply not resolvable—at least to a point where
plausible threats to internal validity are controlled.

In the United States, evaluation policy for several major federal departments under the
Bush administration (2001–2009) emphasized the importance of experimental research designs
as the “gold standard” for program evaluations. As well, the Office of Management and Budget
(OMB) reflected this view as it promulgated the use of the Program Assessment Rating Tool
(PART) process between 2002 and 2009. In its 2004 guidance, the OMB states the following
under the heading “What Constitutes Strong Evidence of a Program’s Effectiveness?” (OMB,
2004):

The revised PART guidance this year underscores the need for agencies to think about the
most appropriate type of evaluation to demonstrate the effectiveness of their programs. As
such, the guidance points to the randomized controlled trial (RCT) as an example of the
best type of evaluation to demonstrate actual program impact. Yet, RCTs are not suitable
for every program and generally can be employed only under very specific circumstances.
(p. 1)

The No Child Left Behind Act (2002) had, as a central principle, the idea that a key
criterion for the availability of federal funds for school projects should be that a reform

has been found, through scientifically based research to significantly improve the
academic achievement of students participating in such program as compared to students
in schools who have not participated in such program; or … has been found to have strong
evidence that such program will significantly improve the academic achievement of
participating children. (Sec. 1606(a)11(A & B))

In Canada, a major federal department (Human Resources and Skills Development Canada)
that funds evaluations of social service programs has specified in guidelines that randomized
experiments are ideal for evaluations, but at minimum, evaluation designs must be based on
quasi-experimental research designs that include comparison groups that permit before-and-
after assessments of program effects between the program and the control groups (Human
Resources Development Canada, 1998).

Problems With Experimentation as a Criterion for Good Methodologies

In the United States, privileging experimental and quasi-experimental designs for
evaluations in the federal government has had a significant impact on the evaluation
community. Although the “paradigm wars” were thought to have ended or at least been set
aside in the 1990s (Patton, 1997), they were resurrected as U.S. government policies
emphasizing the importance of “scientifically based research” were implemented. The merits
of randomized controlled trials (RCTs) as the benchmark for high-quality evaluation designs
have again been debated in conferences, evaluation journals, and Internet listserv discussions.

Continued disagreements among evaluators about the best or most appropriate ways of
assessing program effectiveness will affect the likelihood that evaluation will emerge as a
profession. Historically, advocates for experimental approaches have argued in part that the

superiority of their position rests in the belief that sound, internally valid research designs
obviate the need for the evaluator to “fill in the blanks” with information that is not gleaned
directly from the (usually) quantitative comparisons built into the designs. The results of a
good experimental design are said to be more valid and credible, and therefore more defensible
as a basis for supporting decisions about a program or policy. Experimentation has also been
linked to fostering learning cultures where new public policies are assessed incrementally and
rationally. Donald Campbell (1991) was among the first to advocate for an “experimenting
society.”

Although experimental evaluations continue to be important (Ford, Gyarmati, Foley,
Tattrie, & Jimenez, 2003; Gustafson, 2003) and are central to both the Cochrane Collaboration
(Higgins & Green, 2011) in health-related fields and the Campbell Collaboration (2010) in
social program fields as an essential basis for supporting the systematic reviews that are the
mainstay of these collaborations, the view that experiments are the “gold standard” does not
dominate the whole evaluation field. The experiences with large-scale evaluations of social
programs in the 1970s, when enthusiasm for experimental research designs was at its highest,
suggested that implementing large-scale RCTs was problematical (Pawson & Tilley, 1997).

Social experiments tended to be complex and were often controversial as evaluations. The
Kansas City Preventive Patrol Experiment (Kelling, 1974a, 1974b) was an example of a major
evaluation that relied on an experimental design that was intended to resolve a key policy
question: whether the level of routine preventive patrol (assigned randomly to samples of
police patrol beats in Kansas City, Missouri) made any differences to the actual and perceived
levels of crime and safety in the selected patrol districts of Kansas City. Because the patrol
levels were kept “secret” to conceal them from the citizens (and presumably potential law
breakers), the experimental results encountered a substantial external validity problem—even
if the findings supported the hypothesis that the level of routine preventive patrol had no
significant impacts on perceived levels of safety and crime, or on actual levels of crime, how
could any other police department announce that it was going to reduce preventive patrol
without jeopardizing citizen (and politicians’) confidence? Even in the experiment itself, there
was evidence that the police officers who responded to calls for service in the reduced patrol
beats did so with more visibility (lights and sirens) than normal—suggesting that they wanted
to establish their visibility in the low-patrol beats (Kelling, 1974a; Larson, 1982). In other
words, there was a construct validity threat that was not adequately controlled—compensatory
rivalry was operating, whereby the patrol officers in the low-patrol beats acted to “beef up” the
perceived level of patrol in their beats.

The Importance of Causality: The Core of the Evaluation Enterprise

Picciotto (2011) points to the centrality of program effectiveness as a core issue for
evaluation as a discipline/profession:

What distinguishes evaluation from neighboring disciplines is its unique role in bridging
social science theory and policy practice. By focusing on whether a policy, a programme
or project is working or not (and unearthing the reasons why by attributing outcomes)
evaluation acts as a transmission belt between the academy and the policy-making. (p.
175)

Advocates for experimental research designs point out that since most evaluations include,
as a core issue, whether the program was effective, experimental designs are the best and least
ambiguous way to answer these causal questions.

Michael Scriven (2008) has taken an active role, since the changes in U.S. policies have
privileged RCTs, in challenging the primacy of experimental designs as the methodological
backbone of evaluations; in the introduction to a paper published in 2008, he asserts, “The
causal wars are still raging, and the amount of collateral damage is increasing” (p. 11). In a
series of publications (Cook, Scriven, Coryn, & Evergreen, 2010; Scriven, 2008), he has
argued that it is possible to generate valid causal knowledge in many other ways, and argues
for a pluralism of methods that are situationally appropriate in the evaluation of programs. In
one of his presentations, a key point he makes is that human beings are “hardwired” to look for
causal relationships in the world around them. In an evolutionary sense, we have a built-in
capacity to observe causal connections. In Scriven’s (2004) words,

Our experience of the world and our part in it, is not only well understood by us but
pervasively, essentially, perpetually a causal experience. A thousand times a day we
observe causation, directly and validly, accurately and sometimes precisely. We see
people riding bicycles, driving trucks, carrying boxes up stairs, turning the pages of
books, picking goods off shelves, calling names, and so on. So, the basic kind of causal
data, vast quantities of highly reliable and checkable causal data, comes from observation,
not from elaborate experiments. Experiments, especially RCTs, are a marvelously
ingenious extension of our observational skills, enabling us to infer to causal conclusions
where observation alone cannot take us. But it is surely to reverse reality to suppose that
they are the primary or only source of reliable causal claims: they are, rather, the realm of
flight for such claims, where the causal claims of our everyday lives are the ground traffic
of them. (pp. 6–7)

Alternative Perspectives on the Evaluation Enterprise

At the other end of the spectrum are approaches that eschew positivistic or post-positivistic
approaches to evaluation and advocate methodologies that are rooted in anthropology or
subfields of sociology or other disciplines. Advocates of these interpretivist (generally
qualitative) approaches have pointed out that the positivist view of evaluation is itself based on
a set of beliefs about observing and measuring patterns of human interactions. We introduced
different approaches to qualitative evaluation in Chapter 5 and pointed out that in the 1980s a
different version of “paradigm wars” happened in the evaluation field—that “war” was
between the so-called quals and the quants.

Quantitative methods cannot claim to eliminate the need for evaluators to use professional
judgment. Smith (1994) argues that quantitative methods necessarily involve judgment calls:

Decisions about what to examine, which questions to explore, which indicators to choose,
which participants and stakeholders to tap, how to respond to unanticipated problems in
the field, which contrary data to report, and what to do with marginally significant
statistical results are judgment calls. As such they are value-laden and hence subjective.…
Overall the degree of bias that one can control through random assignment or blinded
assessment is a minute speck in the cosmos of bias. (pp. 38–39)

Moreover, advocates of qualitative approaches argue that quantitative methods miss the
meaning of much of human behavior. Understanding intentions is critical to getting at what the
“data” really mean, and the only way to do that is to embrace methods that treat individuals as
unique sense-makers who need to be understood on their own terms (Schwandt, 2000).

Kundin (2010) advocates the use of qualitative, naturalistic approaches to understand how
evaluators use their “knowledge, experience and judgment to make decisions in their everyday
work” (p. 350). Her view is that the methodology-focused logic of inquiry that is often
embedded in textbooks does not reflect actual practice:

Although this logic is widely discussed in the evaluation literature, some believe it is
rarely relied upon in practice. Instead researchers … suggest that evaluators use their
intuition, judgment, and experience to understand the evaluand, and by doing so, they
understand its merits through an integrated act of perceiving and valuing. (p. 352)

More recently, evaluators who have focused on getting evaluations used have tended to
take a pragmatic stance in their approaches, mixing qualitative and quantitative
methodologies in ways that are intended to be situationally appropriate (Patton, 2008). They
recognize the value of being able to use structured designs where they are feasible and
appropriate, but also recognize the value of employing a wide range of complementary (mixed
methods) approaches in a given situation to create information that is credible, and hence more
likely to be used. We discussed mixed-methods designs in Chapter 5.

Reconciling Evaluation Theory With the Diversity of Practice

The practice of program evaluation is even more diverse than the range of normative
approaches and perspectives that populate the textbook and coursework landscape.
Experimental evaluations continue to be done and are still viewed by many practitioners as
exemplars (Chen, Donaldson, & Mark, 2011; Henry & Mark, 2003). Substantial investments in
time and resources are typically required, and this limits the number and scope of evaluations
that are able to randomly assign units of analysis (usually people) to program and control
conditions.

Conducting experimental evaluations entails creating a structure that may produce
statistical conclusions and internal validity yet fails to inform decisions about implementing
the program or policy in non-experimental settings (the Kansas City Preventive Patrol
Experiment is an example of that) (Kelling, 1974a; Larson, 1982). Typically, experiments are
time limited, and as a result participants can adjust their behaviors to their expectations of how
long the experiment will last, as well as to what their incentives are as it is occurring. Cronbach
(1982) was eloquent in his criticisms of the (then) emphasis on internal validity as the central
criterion for judging the quality of research designs for evaluations of policies and programs.
He argued that the uniqueness of experimental settings undermines the extent to which well-
constructed (internally valid) experiments can be generalized to other units of analysis,
treatments, observing operations, and settings (UTOS). Cronbach argued for the primacy of
external validity of evaluations to make them more relevant to policy settings. Shadish, Cook,
and Campbell’s (2002) book can be seen in part as an effort to address Cronbach’s criticisms
of the original Cook and Campbell (1979) book on experimental and quasi-experimental
research designs.

The existence of controversies over the construction, execution, and interpretation of many
of the large-scale social experiments that were conducted in the 1970s to evaluate programs
and policies suggest that very few methodologies are unassailable—even experimental
research designs (Basilevsky & Hum, 1984). The craft of evaluation research, even research
that is based on randomly assigned treatment and control conditions, is such that its
practitioners do not agree about what exemplary practice is, even in a given situation.

In the practice of evaluation, it is rare to have the resources and the control over the
program setting needed to conduct even a quasi-experimental evaluation. Instead, typical
evaluation settings are limited by significant resource constraints and the expectation that the
evaluation process will somehow fit into the existing administrative process that has
implemented a policy or a program. The widespread interest in performance measurement as
an evaluation approach tends to be associated with an assumption that existing managerial and
information technological resources will be sufficient to implement performance measurement
systems, and produce information for formative and summative evaluative purposes. We have
pointed out the limitations of substituting performance measurement for program evaluation in
Chapters 1 and 8 of this textbook.

Working in the Swamp: The Real World of Evaluation Practice

Typical program evaluation methodologies rely on multiple, independent data sources to
“strengthen” research designs that are case study or implicit designs (diagrammed in Chapter 3
as XO designs, where X is the program and O is the set of observations on the outcomes that
are expected to be affected by the program). The program has been implemented at some time
in the past, and now, the evaluator is expected to assess program effectiveness. There is no
pretest and no control group; there are insufficient resources to construct these comparisons,
and in most situations, comparison groups would not exist. Although multiple data sources
permit triangulation of findings, that does not change the fact that the basic research design is
the same; it is simply repeated for each data source (which is a strength since measurement
errors would likely be independent) but is still subject to all the weaknesses of that design. In
sum, typical program evaluations are conducted after the program is implemented, in settings
where the evaluation team has to rely on evidence about the program group alone (i.e., there is
no control group). In most evaluation settings, these designs rely on mixed qualitative and
quantitative lines of evidence.

In such situations, some evaluators would advocate not using the evaluation results to make
any causal inferences about the program. In other words, it is argued that such evaluations
ought not to be used to try to address the question: “Did the program make a difference, and if
so, what difference(s) did it make?” Instead the evaluation should be limited to addressing the
question of whether intended outcomes were actually achieved, regardless of whether the
program “produced” those outcomes. That is essentially what performance measurement
systems do.

But, many evaluations are commissioned with the need to know whether the program
worked and why. Even formative evaluations often include questions about the effectiveness of
the program (Cronbach, 1980; Weiss, 1998). Answering the “why” question entails looking at
causes and effects.

In situations where a client wants to know if and why the program was effective, and there
is clearly insufficient time, money, and control to construct an evaluation design that meets
criteria that are textbook-appropriate for answering those questions using an experimental
design, evaluators have a choice. They can advise their client that wanting to know whether the
program or policy worked—and why—is perhaps not feasible, or they can proceed,
understanding that their work may not be as defensible as some textbooks would advocate.

Usually, some variation of the work proceeds. Although comparisons between program and
no-program groups are not possible, comparisons among program recipients, comparisons over
time for program recipients who have participated in the program, and comparisons among the
perspectives of other stakeholders are all possible. We maintain that the way to answer causal
questions without research designs that can categorically rule out rival hypotheses is to

acknowledge that in addressing issues like program effectiveness (which we take to be the
central question in most evaluations), we cannot offer definitive findings or conclusions.
Instead, our findings, conclusions, and our recommendations, supported by the evidence at
hand and by our professional judgment, will reduce the uncertainty associated with the
question.

In this textbook we have argued that in all evaluations, regardless of how sophisticated they
are, evaluators use one form or another of professional judgment. The difference between the
most sophisticated experimentally designed evaluation and an evaluation based on a case
study/implicit design is the amount and the kinds of professional judgments that are
entailed—not that the former is appropriate for assessing program effectiveness and the latter
is not. Unlike some who have commented on the role of professional judgment in program
evaluations and see making judgments as a particular phase in the evaluation process (Skolits,
Morrow, & Burr, 2009), we see professional judgment being exercised throughout the entire
evaluation process.

Where a research design is (necessarily) weak, we introduce to a greater extent our own
experience and our own assessments, which in turn are conditioned by our values, beliefs, and
expectations. These become part of the basis on which we interpret the evidence at hand and
are also a part of the conclusions and the recommendations. This professional judgment
component in every evaluation means that we should be aware of what it is, and learn how to
cultivate sound professional judgment.

Common Ground Between Program Evaluators and Program Managers

The view that all evaluations incorporate professional judgments to a greater or lesser
extent means that evaluators have a lot in common with program managers. Managers often
conduct assessments of the consequences of their decisions—informal evaluations, if you will.
These are not usually based on a necessarily systematic gathering of information, but instead
often rely on a manager’s own observations, values, beliefs, expectations, and
experiences—their professional judgment. That these assessments are done “on the fly” and
are often based on information that is gathered using research designs that do not warrant
causal conclusions does not vitiate their being the basis for good management practice. Good
managers become skilled at being able to recognize patterns in the complexity of their
environments. Inferences from observed or sensed patterns (Mark, Henry, & Julnes, 2000) to
causal linkages are informed by their experience and judgment, are tested by observing and
often participating in the consequences of a decision, and in turn add to the fund of knowledge
and experience that contributes to their capacity to make sound professional judgments.

Situating Professional Judgment in Program Evaluation Practice

Scriven (1994) emphasizes the centrality of judgment (judgments of merit and worth) in
the synthesis of evaluation findings/lines of evidence to render an overall assessment of a
program. For Scriven, the process of building toward and then rendering a holistic evaluation
judgment is a central task for evaluators, a view reflected by others (Skolits et al., 2009).
Judgments can be improved by constructing rules or decision processes that make explicit how
evidence will be assessed and weighted. Scriven (1994) suggests that, generally, judgments
supported by decision criteria are superior to those that are intuitive.

Although evaluations typically use several different lines of evidence to assess a program’s
effectiveness and, in the process, have different measures of effectiveness, methodologies such

as cost–utility analysis exist for weighting and amalgamating findings that combine multiple
measures of program effectiveness (Levin & McEwan, 2001). However, they are data
intensive, and apart from the health sector, they are not widely used. The more typical situation
is described by House and Howe (1999). They point out that the findings from various lines of
evidence in an evaluation may well contain conflicting information about the worth of a
program. In this situation, evaluators use their professional judgment to produce an overall
conclusion. The process of rendering such a judgment engages the evaluator’s own knowledge,
values, beliefs, and expectations. House and Howe (1999) describe this process:

The evaluator is able to take relevant multiple criteria and interests and combine them into
all-things-considered judgments in which everything is consolidated and related.… Like a
referee in a ball game, the evaluator must follow certain sets of rules, procedures, and
considerations—not just anything goes. Although judgment is involved, it is judgment
exercised within the constraints of the setting and accepted practice. Two different
evaluators might make different determinations, as might two referees, but acceptable
interpretations are limited. In the sense that there is room for the evaluator to employ
judgment, the deliberative process is individual. In the sense that the situation is
constrained, the judgment is professional. (p. 29)

There are also many situations where evaluators must make judgments in the absence of
clear methodological constraints or rules to follow. House and Howe (1999) go on to point out
that

for evaluators, personal responsibility is a cost of doing business, just as it is for
physicians, who must make dozens of clinical judgments each day and hope for the best.
The rules and procedures of no profession are explicit enough to prevent this. (p. 30)

Although House and Howe point out that evaluators must make judgments, the process by
which judgments are made is nevertheless not well understood. Hurteau, Houle, and Mongiat
(2009), in a meta-analysis of 50 evaluation studies, examined the ways that judgments are
evidenced and found that in only 20 of those studies had the evaluator(s) made a judgment
based on the findings. In addition, in none of those 20 studies do the evaluators describe the
process that they used to render the judgment(s).

Program evaluators are currently engaged in debates around the issue of professionalizing
evaluation. One element of that debate is whether and how our knowledge and our practice can
be codified so that evaluation is viewed as a coherent body of knowledge and skills, and
practitioners are seen to be having a consistent set of competencies (King, Stevahn, Ghere, &
Minnema, 2001; Stevahn, King, Ghere, & Minnema, 2005b). This debate has focused in part
on what is needed to be an effective evaluator—core competencies that provide a framework
for assessing the adequacy of evaluation training as well as the adequacy of evaluator practice.

In a study of 31 evaluation professionals in the United States, practitioners were asked to
rate the importance of 49 evaluator competencies (King et al., 2001) and then try to come to a
consensus about the ratings, given feedback on how their peers had rated each item. The 49
items were grouped into four broad clusters of competencies: (1) systematic inquiry (most
items were about methodological knowledge and skills), (2) competent evaluation practice
(most items focused on organizational and project management skills), (3) general skills for
evaluation practice (most items were on communication, teamwork, and negotiation skills),
and (4) evaluation professionalism (most items focused on self-development and training,
ethics and standards, and involvement in the evaluation profession).

Among the 49 competencies, one was “making judgments” and referred to an overall
evaluative judgment, as opposed to recommendations, at the end of an evaluation (King et al.,
2001, p. 233). It was rated the second lowest on average among all the competencies. This
finding suggests that judgment, comparatively, is not perceived to be that important (although
the item average was still 74.68 out of 100 possible points). King et al. (2001) suggested that
“some evaluators agreed with Michael Scriven that to evaluate is to judge; others did not” (p.
245). The “reflects on practice” item, however, was given an average rating of 93.23—a
ranking of 17 among the 49 items. Schön (1987) makes reflection on one’s practice the key
element in being able to develop sound professional judgment. For both of these items, there
was substantial variation among the practitioners about their ratings, with individual ratings
ranging from 100 (highest possible score) to 20. The discrepancy between the low overall
score for “making judgments” and the higher score for “reflects on practice” may be related to
the difference between making a judgment, as an action, and reflecting on practice, as a
personal quality.

We see professional judgment being a part of the whole process of working with clients,
framing evaluation questions, designing and conducting evaluation research, analyzing and
interpreting the information, and communicating the findings, conclusions, and
recommendations to stakeholders. If you go back to the outline of a program evaluation
process offered in Chapter 1, or the outline of the design and implementation of a performance
measurement system offered in Chapter 9, you will see professional judgment is a part of all
the steps in both processes. Furthermore, we see different kinds of professional judgment being
more or less important at different stages in evaluation processes. We will come back to the
relationships between evaluation competencies and professional judgment later in this chapter.

ACQUIRING KNOWLEDGE AND SKILLS FOR EVALUATION
PRACTICE

The idea that evaluation is a profession, or aspires to be a profession, is an important part of
contemporary discussions of the scope and direction of the enterprise (Altschuld, 1999).
Modarresi, Newman, and Abolafia (2001) quote Leonard Bickman (1997), who was president
of the American Evaluation Association (AEA) in 1997, in asserting that “we need to move
ahead with professionalizing evaluation or else we will just drift into oblivion” (p. 1). Bickman
and others in the evaluation field were aware that other related professions continue to carve
out territory, sometimes at the expense of evaluators. Picciotto (2011) points out, however, that
“heated doctrinal disputes within the membership of the AEA have blocked progress [towards
professionalization] in the USA” (p. 165).

What does it mean to be a professional? What distinguishes a profession from other
occupations? Eraut (1994) suggests that professions are characterized by the following: a core
body of knowledge that is shared through the training and education of those in the profession;
some kind of government-sanctioned license to practice; a code of ethics and standards of
practice; and self-regulation (and sanctions for wrongdoings) through some kind of
professional association to which members of the practice community must belong.

Professional Knowledge as Applied Theory

The core body of knowledge that is shared among members of a profession can be
characterized as knowledge that is codified, publicly available (taught for and learned by

aspiring members of the profession), and supported by validated theory (Eraut, 1994). One
view of professional practice is that competent members of a profession apply this validated
theoretical knowledge in their work. Competent practitioners are persons who have the
credentials of the profession (including evidence that they have requisite knowledge) and have
established a reputation for being able to translate theoretical knowledge into sound practice.

Professional Knowledge as Practical Know-How

An alternative view of professional knowledge is that it is the application of practical
know-how to particular situations. The competent practitioner uses his or her experiential and
intuitive knowledge to assess a situation and offer a diagnosis (in the health field) or a decision
in other professions (Eraut, 1994). Although theoretical knowledge is a part of what competent
practitioners rely on in their work, practice is seen as more than applying theoretical
knowledge. It includes a substantial component that is learned through practice itself. Although
some of this knowledge can be codified and shared (Schön, 1987; Tripp, 1993), part of it is
tacit, that is, known to individual practitioners, but not shareable in the same ways that we
share the knowledge in textbooks, lectures, or other publicly accessible learning and teaching
modalities (Schwandt, 2008).

Polanyi (1958) described tacit knowledge as the capacity we have as human beings to
integrate “facts” (data and perceptions) into patterns. He defined tacit knowledge in terms of
the process of discovering theory: “This act of integration, which we can identify both in the
visual perception of objects and in the discovery of scientific theories, is the tacit power we
have been looking for. I shall call it tacit knowing” (Polanyi & Grene, 1969, p. 140).

For Polanyi, tacit knowledge cannot be communicated directly. It has to be learned through
one’s own experiences—it is by definition personal knowledge. Knowing how to ride a
bicycle, for example, is in part tacit. We can describe to others how the physics and the
mechanics of getting onto a bicycle and riding it works, but the experience of getting onto the
bicycle, pedaling, and getting it to stay up is quite different from being told how to do so.

One implication of acknowledging that what we know is in part personal is that we cannot
teach everything that is needed to learn a skill. The learner can be guided with textbooks, good
examples, and even demonstrations, but that knowledge (Polanyi calls it impersonal
knowledge) must be combined with the learner’s own capacity to tacitly know—to experience
the realization (or a series of them) that he or she understands how to use the skill.

Clearly, from this point of view, practice is an essential part of learning. One’s own
experience is essential for fully integrating impersonal knowledge into working knowledge.
But because the skill that has been learned is in part tacit, when the learner tries to
communicate it, he or she will discover that, at some point, the best advice is to suggest that
the new learner try it and “learn by doing.” This is a key part of craftsmanship.

Balancing Theoretical and Practical Knowledge in Professional Practice

The difference between the applied theory and the practical know-how views of
professional knowledge (Fish & Coles, 1998; Schwandt, 2008) has been characterized as the
difference between knowing that (publicly accessible, propositional knowledge and skills) and
knowing how (practical, intuitive, experientially grounded knowledge that involves wisdom, or
what Aristotle called praxis) (Eraut, 1994).

These two views of professional knowledge highlight different views of what professional
practice is and indeed ought to be. The first view can be illustrated with an example. In the

field of medicine, the technical/rational view of professional knowledge and professional
practice continues to support efforts to construct and use expert systems—software systems
that can offer a diagnosis based on a logic model that links combinations of symptoms in a
probabilistic tree to possible diagnoses (Fish & Coles, 1998). By inputting the symptoms that
are either observed or reported by the patient, the expert system (embodying the public
knowledge that is presumably available to competent practitioners) can treat the diagnosis as a
problem to solve. Clinical decision making employs algorithms that produce a probabilistic
assessment of the likelihood that symptom, drug, and other technical information will support
one or another alternative diagnoses.

The second view of professional knowledge as practical know-how embraces a view of
professional practice as craftsmanship and artistry. Although it acknowledges the importance
of experience in becoming a competent practitioner, it also complicates our efforts to
understand the nature of professional practice. If practitioners know things that they cannot
share and their knowledge is an essential part of sound practice, how do professions find ways
of ensuring that their members are competent?

Schwandt (2008) recognizes the importance of balancing applied theory and practical
knowledge in evaluation. His concern is with the tendency, particularly in performance
management systems where practice is circumscribed by a focus on outputs and outcomes, to
force “good practice” to conform to some set of performance measures and performance
results:

The fundamental distinction between instrumental reason as the hallmark of technical
knowledge and judgment as the defining characteristic of practical knowledge is
instinctively recognizable to many practitioners … Yet the idea that “good” practice
depends in a significant way on the experiential, existential, knowledge we speak of as
perceptivity, insightfulness, and deliberative judgment is always in danger of being
overrun by (or at least regarded as inferior to) an ideal of “good” practice grounded in
notions of objectivity, control, predictability, generalizability beyond specific
circumstances, and unambiguous criteria for establishing accountability and success. This
danger seems to be particularly acute of late, as notions of auditable performance, output
measurement, and quality assurance have come to dominate the ways in which human
services are defined and evaluated. (p. 37)

The idea of balance is further explored in the section below, where we discuss various
aspects of professional judgment.

UNDERSTANDING PROFESSIONAL JUDGMENT

What are the different kinds of professional judgment? How does professional judgment
impact the range of decisions that evaluators make? Can we construct a model of how
professional judgment affects evaluation-related decisions?

Fish and Coles (1998) have constructed a typology of four kinds of professional judgment
in the health care field. We believe that these can be generalized to the evaluation field. Each
builds on the previous one; the extent and kinds of judgment differ across the four kinds. At
one end of the continuum, practitioners apply technical judgments that are about specific
issues involving routine tasks. Typical questions include the following: What do I do now?
How do I apply my existing knowledge and skills to do this routine task? In an evaluation, an

example of this kind of judgment would be how to select a random sample from a population
of case files in a social service agency.

The next level is procedural judgment, which focuses on procedural questions and
involves the practitioner comparing the skills/tools that he or she has available to accomplish a
task. Practitioners ask questions like “What are my choices to do this task?” “From among the
tools/knowledge/skills available to me, which combination works best for this task?” An
example from an evaluation would be deciding how to contact clients in a social service
agency—whether to use a survey (and if so, mailing, telephone, interview format, or some
combination) or use focus groups (and if so, how many, where, how many participants in each,
how to gather them).

The third level of professional judgment is reflective. It again assumes that the task or the
problem is a given, but now the practitioner is asking the following questions: How do I tackle
this task? Given what I know, what are the ways that I could proceed? Are the tools that are
easily within reach adequate, or instead, should I be trying some new combination or perhaps
developing some new ways of dealing with this task or problem? A defining characteristic of
this third level of professional judgment is that the practitioner is reflecting on his or her
practice and seeking ways to enhance his or her practical knowledge and skills and perhaps
innovate to address a given situation.

An example from a needs assessment for child sexual abuse prevention programs in an
urban school district serves to illustrate reflective judgment on the part of the evaluator in
deciding on the research methodology. Classes from an elementary school are invited to attend
a play acted by school children of the same ages as the audience. The play is called “No More
Secrets” and is about an adult–child relationship that involves touching and other activities. At
one point in the play, the “adult” tells the “child” that their touching games will be their secret.
The play is introduced by a professional counselor, and after the play, children are invited to
write questions on cards that the counselor will answer. The children are told that if they have
questions about their own relationships with adults, these questions will be answered
confidentially by the counselor. The evaluator, having obtained written permissions from the
parents, contacts the counselor, who, without revealing the identities of any of the children,
indicates to the evaluator the number of potentially abusive situations among the students who
attended the play. Knowing the proportion of the school district students that attended the play,
the evaluator is able to roughly estimate the incidence of potentially abusive situations in that
school-age population.

The fourth level of professional judgment is deliberative—it explicitly involves a
practitioner’s own values. Here the practitioner is asking the following question: What ought I
to be doing in this situation? No longer are the ends or the tasks fixed, but instead the
professional is taking a broad view that includes the possibility that the task or problem may or
may not be an appropriate one to pursue. Professionals at this level are asking questions about
the nature of their practice and connecting what they do as professionals with their broader
values and moral standards. We discuss evaluation ethics later in this chapter. The case study
in Appendix A of this chapter is an example of a situation that would involve deliberative
judgment.

A Modeling of the Professional Judgment Process

Since professional judgment spans the evaluation process, it will influence a wide range of
decisions that evaluators make in their practice. The four types of professional judgment that
Fish and Coles (1998) describe suggest decisions of increasing complexity from discrete
technical decisions to global decisions that can affect an evaluator’s present and future roles as

an evaluation practitioner. Figure 12.1 displays a model of the way that professional judgment
is involved in evaluator decision making. The model focuses on a single decision—a typical
evaluation would involve many such decisions of varying complexity. In the model, evaluator
values, beliefs, and expectations, together with both shareable and practical (tacit) knowledge
combine to create a fund of experience that is tapped for professional judgments. In turn,
professional judgments influence the decision at hand.

We will present the model and then discuss it, elaborating on the meanings of the key
constructs in the model.

Evaluator decisions have consequences. They may be small—choosing a particular alpha
(α) level for tests of statistical significance will have an impact on which findings are
noteworthy, given a criterion that significant findings are worth reporting; or they may be
large—deciding not to conduct an evaluation in a situation where the desired findings are being
specified in advance by a key stakeholder could affect the career of an evaluation practitioner.
These consequences feed back to our knowledge (both our shareable and our practical know-
how), values, beliefs, and expectations. Evaluators have an opportunity to learn from each
decision, and one of our challenges as professionals is to increase the likelihood that we take
advantage of such learning opportunities. We will discuss reflective practice later in this
chapter.

Figure 12.1 The Professional Judgment Process

The model can be unpacked by discussing key constructs in it. Some constructs have been
elaborated in this chapter already (shareable knowledge, practical know-how, and professional
judgment), but it is worthwhile to define each one explicitly in one table. Table 12.1
summarizes the constructs in Figure 12.1 and offers a short definition of each. Several of the
constructs will then be discussed further to help us understand what roles they play in the
process of forming and applying professional judgment.

Table 12.1 Definitions of Constructs in the Model of the Professional Judgment Process

Constructs in
the Model

Definitions

Values Values are statements about what is desirable, what ought to be, in a given
situation.

Beliefs

Beliefs are about what we take to be true—our assumptions about how we
know what we know (our epistemologies are examples of beliefs).

Expectations Expectations are assumptions that are typically based on what we have
learned and what we have come to accept as normal. Expectations can limit
what we are able to “see” in particular situations.

Shareable
knowledge

Knowledge that is typically found in textbooks or other such media;
knowledge that forms the core of the formal training and education of
professionals in a field.

Practical know-
how

Practical know-how is the knowledge that is gained through practice. It
complements shareable knowledge and can be tacit—that is, acquired from
one’s professional practice and not shareable.

Experience Experience is an amalgam of our knowledge, values, beliefs, expectations,
and practical know-how. For a given decision, we have a “fund” of
experience that we can draw from. We can augment that fund with learning,
and from the consequences of the decisions we make as professionals.

Professional
judgment

Professional judgment is a process that relies on our experience and ranges
from technical judgments to deliberative judgments.

Decision In a typical evaluation, evaluators make hundreds of decisions that
collectively define the evaluation process. Decisions are choices—a choice
made by an evaluator about everything from discrete methodological issues
to global values–based decisions that affect the whole evaluation (and
perhaps future evaluations).

Consequences Each decision has consequences—for the evaluator and for the evaluation
process. Consequences can range from discrete to global, commensurate
with the scope and implications of the decision.

Decision
environment

The decision environment is the set of factors that influences the decision-
making process, including the stock of knowledge that is available to the
evaluator.
Among the factors that could impact an evaluator decision are professional
standards, resources (including time and data), incentives (perceived
consequences that induce a particular pattern of behavior), and constraints
(legal, institutional, and regulatory requirements that specify the ways that
evaluator decisions must fit a decision environment).

The Decision Environment

The particular situation or problem at hand influences how a program evaluator’s
professional judgment will be exercised. Each opportunity for professional judgment will have
unique characteristics that will demand that it be approached in particular ways. For example, a
methodological issue will require a different kind of professional judgment from one that
centers on an ethical issue. Even two cases involving a similar question of methodological
choice will have facts about each of them that will influence the professional judgment
process. We would agree with evaluators who argue that methodologies need to be
situationally appropriate, avoiding a one-size-fits-all approach (Patton, 2008). The extent to

which the relevant information about a particular situation is known or understood by the
evaluator will affect the professional judgment process.

The decision environment includes constraints and incentives and costs and benefits, both
real and perceived, that affect professional judgment. Some examples include the expectations
of the client, the professional’s lines of accountability, tight deadlines, complex and conflicting
objectives, and financial constraints. For people working within an organization—for example,
internal evaluators—the organization also presents a significant set of environmental factors, in
that its particular culture, goals, and objectives may have an impact on the way the professional
judgment process operates.

Relevant professional principles and standards such as the AEA’s (2004) “Guiding
Principles for Evaluators” also form part of the judgment environment because, to some extent,
they interact with and condition the free exercise of judgment by professionals and replace
individual judgment with collective judgment (Gibbins & Mason, 1988). We will come back to
evaluation standards later in this chapter.

Values, Beliefs, and Expectations

Professional judgment is influenced by personal characteristics of the person exercising it.
It must always be kept in mind that “judgment is a human process, with logical, psychological,
social, legal, and even political overtones” (Gibbins & Mason, 1988, p. 18). Each of us has a
unique combination of values, beliefs, and expectations that make us who we are, and each of
us has internalized a set of professional norms that make us the kind of practitioner that we are.
These personal factors can lead two professionals to make quite different professional
judgments about the same situation (Tripp, 1993).

Among the personal characteristics that can influence one’s professional judgment,
expectations are among the most important. Expectations have been linked to paradigms;
perceptual and theoretical structures that function as frameworks for organizing one’s
perspectives, even one’s beliefs about what is real and what is taken to be factual. Kuhn (1962)
has suggested that paradigms are formed through our education and training. Eraut (1994) has
suggested that the process of learning to become a professional is akin to absorbing an
ideology.

Our past experiences (including the consequences of previous decisions we have made in
our practice) predispose us to understand or even expect some things and not others, to
interpret situations, and consequently to behave in certain ways rather than in others. As
Abercrombie (1960) argues, “We never come to an act of perception with an entirely blank
mind, but are always in a state of preparedness or expectancy, because of our past experiences”
(p. 53). Thus, when we are confronted with a new situation, we perceive and interpret it in
whatever way makes it most consistent with our existing understanding of the world, with our
existing paradigms. For the most part, we perform this act unconsciously. We are not aware of
how our particular worldview influences how we interpret and judge the information we
receive on a daily basis in the course of our work or how it affects our subsequent behavior.

How does this relate to our professional judgment? Our expectations lead us to see things
we are expecting to see, even if they are not actually there, and to not see things we are not
expecting, even if they are there. Abercrombie (1960) calls our worldview our “schemata” and
illustrates its power over our judgment process with the following figure (Figure 12.2).

Figure 12.2 The Three Triangles

In most cases, when we first read the phrases contained in the triangles, we do not see the
extra words. As Abercrombie (1960) points out, “it’s as though the phrase ‘Paris in the Spring,’
if seen often enough, leaves a kind of imprint on the mind’s eye, into which the phrase in the
triangle must be made to fit” (p. 35). She argues that “if [one’s] schemata are not sufficiently
‘living and flexible,’ they hinder instead of help [one] to see” (p. 29). Our tendency is to ignore
or reject what does not fit our expectations. Thus, similar to the way we assume the phrases in
the triangles make sense and therefore unconsciously ignore the extra words, our professional
judgments are based in part on our preconceptions and thus may not be appropriate for the
situation.

Expectations can also contribute to improving our judgment by allowing us to
unconsciously know how best to act in a situation. When the consequences of such a decision
are judged to be salutary, our expectations are reinforced.

Acquiring Professional Knowledge

Our professional training and education are key influences; they can affect professional
judgment in positive ways by not only allowing us to understand and address problems in a
manner that those without the same education could not, but they also predispose us to
interpret situations in particular ways. Indeed, professional education is often one of the most
pervasive reasons for our acceptance of “tried and true” ways of approaching problems in
professional practice. As Katz (1988) observes, “Conformity and orthodoxy, playing the game
according to the tenets of the group to which students wish to belong, are encouraged in … all
professional education” (p. 552). Thus, somewhat contrary to what would appear to be
common sense, professional judgment does not necessarily improve in proportion to increases
in professional training and education. Similarly, professional judgment does not necessarily
improve with increased professional experience, if such experience does not challenge but only
reinforces already accepted ideologies. Ayton (1998) makes the point that even experts in a
profession are not immune to poor professional judgment:

One view of human judgment is that people—including experts—not only suffer various
forms of myopia but are somewhat oblivious of the fact.… Experts appear to have very
little insight into their own judgment.… This oblivion in turn might plausibly be
responsible for further problems, e.g. overconfidence … attributed, at least in part, to a
failure to recognize the fallibility of our own judgment. (pp. 238–239)

On the other hand, Mowen (1993) notes that our experience, if used reflectively and
analytically to inform our decisions, can be an extremely positive factor contributing to good
professional judgment. Indeed, he goes so far as to argue that “one cannot become a peerless
decision maker without that well-worn coat of experience … the bumps and bruises received
from making decisions and seeing their outcomes, both good or bad, are the hallmark of
peerless decision makers” (p. 243).

IMPROVING PROFESSIONAL JUDGMENT IN EVALUATION
THROUGH REFLECTIVE PRACTICE

Having reviewed the ways that professional judgment is woven through the fabric of
evaluation practice and having shown how professional judgment plays a part in our decisions
as evaluation practitioners, we can turn to discussing ways of self-consciously improving our
professional judgment. Key to this process is becoming aware of one’s own decision-making
processes.

Guidelines for the Practitioner

Epstein (1999) suggests that a useful stance for professional practice is mindfulness.
Krasner et al. (2009) define mindfulness this way:

The term mindfulness refers to a quality of awareness that includes the ability to pay
attention in a particular way: on purpose, in the present moment, and nonjudgmentally.
Mindfulness includes the capacity for lowering one’s own reactivity to challenging
experiences; the ability to notice, observe, and experience bodily sensations, thoughts, and
feelings even though they may be unpleasant; acting with awareness and attention (not
being on autopilot); and focusing on experience, not on the labels or judgments applied to
them. (p. 1285)

Epstein and others have developed programs to help medical practitioners become more
mindful (Krasner et al., 2009). In a study involving 70 primary care practitioners in Rochester,
New York, participants were trained through an 8-week combination of weekly sessions and an
all-day session to become more self-aware. The training was accompanied by opportunities to
write brief stories to reflect on their practice and to use appreciative inquiry to identify ways
that they had been successful in working through challenging practice situations.

The before-versus-after results suggested that for the doctors “increases in mindfulness
correlated with reductions in burnout and total mood disturbance. The intervention was also
associated with increased trait emotional stability (i.e. greater resilience)” (p. 1290).

Mindfulness is the cultivation of a capacity to observe, in a nonjudgmental way, one’s own
physical and mental processes during and after tasks. In other words, it is the capacity for self-
reflection that facilitates bringing to consciousness our values, assumptions, expectations,
beliefs, and even what is tacit in our practice. Epstein (1999) suggests, “Mindfulness informs
all types of professionally relevant knowledge, including propositional facts, personal
experiences, processes, and know-how each of which may be tacit or explicit” (p. 833).

Although mindfulness can be linked to religious and philosophical traditions, it is a secular
way of approaching professional practice that offers opportunities to continue to learn and
improve (Epstein, 2003). A mindful practitioner is one who has cultivated the art of self-
observation (cultivating the compassionate observer). Epstein characterizes mindful practice
this way:

When practicing mindfully, clinicians approach their everyday tasks with critical
curiosity. They are present in the moment, seemingly undistracted, able to listen before
expressing an opinion, and able to be calm even if they are doing several things at once.
These qualities are considered by many to be prerequisite for compassionate care. (p. 2)

The objective of mindfulness is to see what is, rather than what one wants to see or even
expects to see. Mindful self-monitoring involves several things: “access to internal and
external data; lowered reactivity to inner experiences such as thoughts and emotions; active
and attentive observation of sensations, images, feelings, and thoughts; curiosity; adopting a
nonjudgmental stance; presence, [that is] acting with awareness …; openness to possibility;
adopting more than one perspective; [and] ability to describe one’s inner experience” (Epstein,
Siegel, & Silberman, 2008, p. 10).

Epstein (1999) suggests that there are at least three ways of nurturing mindfulness: (1)
mentorships with practitioners who are themselves well regarded in the profession; (2)
reviewing one’s own work, taking a nonjudgmental stance; and (3) meditation to cultivate a
capacity to observe one’s self.

In order to cultivate the capacity to make sound professional judgments it is essential to
become aware of the unconscious values and other personal factors that may be influencing
one’s professional judgment. For only through coming to realize how much our professional
judgments are influenced by these personal factors can we become more self-aware and work
toward extending our conscious control of them and their impacts on our judgment. As Tripp
(1993) argues, “Without knowing who we are and why we do things, we cannot develop
professionally” (p. 54). By increasing our understanding of the way we make professional
judgments, we improve our ability to reach deliberate, fully thought-out decisions rather than
simply accepting as correct the first conclusion that intuitively comes to mind.

But how can we, as individuals, learn what factors are influencing our own personal
professional judgment? One way is to conduct a systematic questioning of professional
practice (Fish & Coles, 1998). Professionals should consistently reflect on what they have done
in the course of their work and then investigate the issues that arise from this review.
Reflection should involve articulating and defining the underlying principles and rationale
behind our professional actions and should focus on discovering the “intuitive knowing
implicit in the action” (Schön, 1988, p. 69).

Tripp (1993) suggests that this process of reflection can be accomplished by selecting and
then analyzing critical incidents that have occurred during our professional practice in the past
(critical incident analysis). A critical incident can be any incident that occurred in the course of
our practice that sticks in our mind and hence, provides an opportunity to learn. What makes it
critical is the reflection and analysis that we bring to it. Through the process of critical incident
analysis, we can gain an increasingly better understanding of the factors that have influenced
our professional judgments. As Fish and Coles (1998) point out,

Any professional practitioner setting out to offer and reflect upon an autobiographical
incident from any aspect of professional practice is, we think, likely to come sooner or
later to recognize in it the judgments he or she made and be brought to review them. (p.
254)

For it is only in retrospect, in analyzing our past decisions, that we can see the complexities
underlying what at the time may have appeared to be a straightforward, intuitive professional
judgment. “By uncovering our judgments … and reflecting upon them,” Fish and Coles (1998)
maintain, “we believe that it is possible to develop our judgments because we understand more
about them and about how we as individuals come to them” (p. 285).

Jewiss and Clark-Keefe (2007) connect reflective practice for evaluators to developing
cultural competence. The Guiding Principles for Evaluators (AEA, 2004) makes cultural
competence, “seeking awareness of their own culturally-based assumptions, their
understanding of the worldviews of culturally-different participants and stakeholders in the

evaluation,” one of the core competencies for evaluators. They believe that to become more
culturally competent, evaluators would benefit from taking a constructivist stance in
reflecting on their own practice:

Constructivism has indeed helped signal evaluators’ responsibility for looking out[ward]:
for attending to and privileging program participants’ expressions as the lens through
which to learn about, change, and represent programs. Just as important, constructivism
conveys evaluators’ responsibilities for looking in[ward]: for working to develop and
maintain a critically self-reflective stance to examine personal perspectives and to monitor
bias. (p. 337)

Self-consciously challenging the routines of our practice, the “high hard ground” that
Schön refers to in the quote at the outset of this chapter, is an effective way to begin to develop
a more mindful stance. In our professional practice, each of us will have developed routines for
addressing situations that occur frequently. As Tripp (1993) points out, although routines

may originally have been consciously planned and practiced, they will have become
habitual, and so unconscious, as expertise is gained over time. Indeed, our routines often
become such well-established habits that we often cannot say why we did one thing rather
than another, but tend to put it down to some kind of mystery such as “professional
intuition.” (p. 17)

Another key way to critically reflect on our professional practice and understand what
factors influence the formation of our professional judgments is to discuss our practice with
our colleagues. Colleagues, especially those who are removed from the situation at hand or
under discussion, can act as “critical friends” and can help in the work of analyzing and
critiquing our professional judgments with an eye to improving them. With different education,
training, and experience, our professional peers often have different perspectives from us.
Consequently, involving colleagues in the process of analyzing and critiquing our professional
practice allows us to compare with other professionals our ways of interpreting situations and
choosing alternatives for action. Moreover, the simple act of describing and summarizing an
issue so that our colleagues can understand it can reveal and provide much insight into the
professional judgments we have incorporated.

The Range of Professional Judgment Skills

There is considerable interest in the evaluation field in outlining competencies that define
sound practice (Ghere, King, Stevahn, & Minnema, 2006; King et al., 2001; Stevahn, King,
Ghere, & Minnema, 2005a). Although there are different versions of what these competencies
are, there is little emphasis on acquiring professional judgment skills as a distinct competency.
Efforts to establish whether practitioners themselves see judgment skills as being important
indicate a broad range of views, reflecting some important differences as to what evaluation
practice is and ought to be (King et al., 2001).

If we consider linkages between types of professional judgment and the range of activities
that comprise evaluation practice, we can see that some kinds of professional judgment are
more important for some clusters of activities than others. But for many evaluation activities,
several different kinds of professional judgment can be relevant. Table 12.2 summarizes
clusters of activities that reflect the design and implementation of a typical program evaluation
or a performance measurement system. These clusters are based on the outlines for the design

and implementation of program evaluations and performance measurement systems included in
Chapters 1 and 9, respectively. Although they are not comprehensive, that is, do not absolutely
represent the detailed range of activities discussed earlier in this textbook, they illustrate the
ubiquity of professional judgment in all areas of our practice.

Table 12.2 suggests that for most clusters of evaluation activities, several different types of
professional judgment are in play. The notion that somehow we could practice by exercising
only technical and procedural professional judgment, or confining our judgment calls to one
part of the evaluation process, is akin to staying on Schön’s (1987) “high hard ground.”

Ways of Improving Sound Professional Judgment Through Education and Training-
Related Activities

Developing sound professional judgment depends substantially on being able to develop
and practice the craft of evaluation. Schön (1987) and Tripp (1993), among others, have
emphasized the importance of practice as a way of cultivating sound professional judgment.
Although textbook knowledge (“knowing what”) is also an essential part of every evaluator’s
toolkit, a key part of evaluation curricula are opportunities to acquire experience.

Table 12.2 Types of Professional Judgment That Are Relevant to Program Evaluation and
Performance Measurement

There are at least six complementary ways that evaluation curricula can be focused to
provide opportunities for students to develop their judgment skills. Some activities are more
discrete, that is, are relevant for developing skills that are specific. These are generally limited
to a single course or even a part of a course. Others are more generic, offering opportunities to
acquire experience that spans entire evaluation processes. These are typically activities that

integrate coursework in real work experiences. Table 12.3 summarizes ways that academic
programs can inculcate professional judgment capacities in their students.

The types of learning activities in Table 12.3 are typical of many programs that train
evaluators, but what is important is realizing that each of these kinds of activities contributes
directly to developing a set of skills that all practitioners need and will use in all their
professional work. In an important way, identifying these learning activities amounts to
making explicit what has largely been tacit in our profession.

Table 12.3 Learning Activities to Increase Professional Judgment Capacity in Novice
Practitioners

Learning Activities Examples
Course-based activities
Problem/puzzle solving Develop a coding frame and test the coding categories for

intercoder reliability for a sample of open-ended
responses to an actual client survey that the instructor has
provided

Case studies Make a decision for an evaluator who finds himself or
herself caught between the demands of his or her superior
(who wants evaluation interpretations changed) and the
project team who see no reason to make any changes

Simulations Using a scenario and role playing, negotiate the terms of
reference for an evaluation

Course projects Students are expected to design a practical,
implementable evaluation for an actual client
organization

Program-based activities
Apprenticeships/internships/work
terms

Students work as apprentice evaluators in organizations
that design and conduct evaluations, for extended periods
of time (at least 4 months)

Conduct an actual program
evaluation

Working with a client organization, develop the terms of
reference for a program evaluation, conduct the
evaluation, including preparation of the evaluation report,
deliver the report to the client, and follow up with
appropriate dissemination activities

Teamwork and Professional Judgment

Evaluators and managers often work in organizational settings where teamwork is
expected. Successful teamwork requires establishing norms and expectations that encourage
good communication, sharing of information, and a joint commitment to the task at hand.
Being able to select team members and foster a work environment wherein people are willing
to trust each other, and be open and honest about their own views on issues, is conducive to
generating information that reflects a diversity of perspectives. Even though there will still be

individual biases, the views expressed are more likely to be valid than simply the perceptions
of a dominant individual or coalition in the group. An organizational culture that emulates
features of learning organizations (Garvin, 1993; Mayne, 2008) will tend to produce
information that is more valid as input for making decisions and evaluating policies and
programs.

Managers and evaluators who have the skills and experience to be able to call on others
and, in doing so, be reasonably confident that honest views about an issue are being offered,
have a powerful tool to complement their own knowledge and experience and their systematic
inquiries. Good professional judgment, therefore, is partly about selecting and rewarding
people who themselves have demonstrated a capacity to deliver sound professional judgment.

Evaluation as a Craft: Implications for Learning to Become an Evaluation Practitioner

Evaluation has both a methodological aspect, where practitioners are applying tools, albeit
with the knowledge that the tools may not fit the situation exactly, and an aesthetic aspect,
which entails developing an appreciation for the art of design, the conduct of evaluation-related
research, and the interpretation of results. As Berk and Rossi (1999) contend, mastering a craft
involves more than learning the techniques and tools of the profession; it involves developing
“intelligence, experience, perseverance, and a touch of whimsy” (p. 99), which all form part of
professional judgment. Traditionally, persons learning a craft apprenticed themselves to more
senior members of the trade. They learned by doing, with the guidance and experience of the
master craftsperson.

We have come to think that evaluation can be taught in classrooms, often in university
settings or in professional development settings. Although these experiences are useful, they
are no substitute for learning how evaluations are actually done. Apprenticing to a person or
persons who are competent senior practitioners is an important part of becoming a practitioner
of the craft. Some evaluators apprentice themselves in graduate programs, preparing master’s
or doctoral theses with seasoned practitioners. Others work with practitioners in work
experience settings (e.g., co-op placements). Still others join a company or organization at a
junior level and, with time and experience, assume the role of full members of the profession.

Apprenticeship complements what can be learned in classrooms, from textbooks and other
such sources. Schön (1987) points out that an ideal way to learn a profession is to participate in
practical projects wherein students design for actual situations, under the guidance of
instructors or coaches who are themselves seasoned practitioners. Students then learn by doing
and also have opportunities, with the guidance of coaches, to critically reflect on their practice.

In evaluation, an example of such an opportunity might be a course that is designed as a
hands-on workshop to learn how to design and conduct a program evaluation. Cooksy (2008)
describes such a course at Portland State University. Students work with an instructor who
arranges for client organizations, who want evaluations done, to participate in the course.
Students work in teams, and teams are matched with clients. As the course progresses, each
team is introduced to the skills that are needed to meet client and instructor expectations for
that part of the evaluation process. There are tutorials to learn skills that are needed for the
teams’ work, and opportunities for teams to meet as a class to share their experiences and learn
from each other and the instructor. Clients are invited into these sessions to participate as
stakeholders and provide the class and the instructor with relevant and timely feedback. The
teams are expected to gather relevant lines of evidence, once their evaluation is designed, and
analyze the evidence. Written reports for the clients are the main deliverables for the teams,
together with oral presentations of the key results and recommendations in class, with the
clients in attendance.

ETHICS FOR EVALUATION PRACTICE

The Development of Ethics for Evaluation Practice

In this chapter we have alluded to ethical decision-making as a part of the work evaluators
do; it is a consideration in how they exercise professional judgment. The evaluation guidelines,
standards, and principles that have been developed for the evaluation profession all speak, in
different ways, to ethical practice. Although evaluation practice is not guided by a set of
professional norms that are enforceable (Rossi, Lipsey, & Freeman, 2004), ethical guidelines
are an important reference point for evaluators. Increasingly, organizations that involve people
(e.g., clients or employees) in research are expected to take into account the rights of their
participants across the stages of the evaluation: As the study objectives are framed, measures
and data collection are designed and implemented, results are interpreted, and findings are
disseminated. In universities, human research ethics committees routinely scrutinize research
plans to ensure that they do not violate the rights of participants. In both the United States and
Canada, there are national policies or regulations that are intended to protect the rights of
persons who are participants in research (Canadian Institutes of Health Research, Natural
Sciences and Engineering Research Council of Canada, & Social Sciences and Humanities
Research Council of Canada, 2010; U.S. Department of Health and Human Services, 2009).

The past quarter century has witnessed major developments in the domain of evaluation
ethics. These include publication of the original and revised versions of the Guiding Principles
for Evaluators (AEA, 1995, 2004), and the second and third editions of the Program Evaluation
Standards (Sanders, 1994; Yarbrough, Shulha, Hopson, & Caruthers, 2011). Two examples of
books devoted to program evaluation ethics (Morris, 2008; Newman & Brown, 1996) as well
as chapters on ethics in handbooks in the field (Seiber, 2009; Simons, 2006) are additional
resources. The AEA is active in promoting evaluation ethics with the creation of the Ethical
Challenges section of the American Journal of Evaluation (Morris, 1998) and the addition of
an ethics training module to the website of the AEA, as described in Morris’s The Good, the
Bad, and the Evaluator: 25 Years of AJE Ethics (Morris, 2011).

Morris (2011) has followed the development of evaluation ethics over the past quarter
century and notes that there are few empirical studies that focus on evaluation ethics to date.
Additionally, he argues that “most of what we know (or think we know) about evaluation
ethics comes from the testimonies and reflections of evaluators”—leaving out the crucial
perspectives of other stakeholders in the evaluation process (p. 145). Textbooks on the topic of
evaluation range in the amount of attention that is paid to evaluation ethics—in some
textbooks, it is the first topic of discussion on which the rest of the chapters rest, as in, for
example, Qualitative Researching by Jennifer Mason (2002). In others, the topic arises later, or
in some cases it is left out entirely.

Newman and Brown (1996) have undertaken an extensive study of evaluation practice to
establish ethical principles that are important for evaluators in the roles they play. Underlying
their work are principles, which they trace to Kitchener’s (1984) discussions of ethical norms.
Table 12.4 summarizes ethical principles that are taken in part from Newman and Brown
(1996) and from the Tri-Council Policy on the Ethical Conduct for Research Involving
Humans (Canadian Institutes of Health Research et al., 2010), and shows they correspond to
the AEA’s Guiding Principles for Evaluators (AEA, 2004) and the Canadian Evaluation
Society (CES) Guidelines for Ethical Conduct (CES, 2012a).

The ethical principles summarized in Table 12.4 are not absolute and arguably are not
complete. Each one needs to be weighed in the context of a particular evaluation project and

balanced with other ethical considerations. For example, the “keeping promises” principle
suggests that contracts, once made, are to be honored, and normally that is the case. But
consider the following example: An evaluator makes an agreement with the executive director
of a nonprofit agency to conduct an evaluation of a major program that is delivered by the
agency. The contract specifies that the evaluator will deliver three interim progress reports to
the executive director, in addition to a final report. As the evaluator begins his or her work, he
or she learns from several agency managers that the executive director has been redirecting
money from the project budget for office furniture, equipment, and his or her own travel
expenses—none of these being connected with the program that is being evaluated. In his or
her first interim report, he or she brings these concerns to the attention of the executive
director, who denies any wrongdoings, and makes it clear that the interim reports are not to be
shared with anyone else. The evaluator discusses this situation with his or her colleagues in the
firm in which he or she is employed and decides to inform the chair of the board of directors
for the agency. He or she has broken his or her contract but has called on a broader principle
that speaks to the honesty and integrity of the evaluation process.

Table 12.4 Relationships Between the American Evaluation Association Principles,
Canadian Evaluation Society Guidelines for Ethical Conduct, and Ethical Principles for
Evaluators

In Appendix A, we have included a case that provides you with an opportunity to make a
choice for an evaluator who works in a government department. The evaluator is in a difficult
situation and has to decide what decision he or she should make, balancing ethical principles

and his or her own well-being as the manager of an evaluation branch in that department.
There is no right answer to this case. Instead, it gives you an opportunity to see how
challenging ethical choice making can be, and it gives you an opportunity to make a choice and
build a rationale for your choice. The case is a good example of what is involved in exercising
deliberative judgment.

Ethical Evaluation Practice

Ethical behavior is not so much a matter of following principles as of balancing
competing principles. (Stake & Mabry, 1998, p. 108)

Ethical practice in program evaluation is situation specific and can be challenging. The
guidelines and principles discussed earlier are general. Sound ethical evaluation practice is
circumstantial, much like sound professional judgment. Practice with ethical decision making
is essential, dialogue being a key part of learning how ethics principles apply to practice and
how principles feel subjectively.

How do we define ethical evaluation? Several definitions of sound ethical practice exist.
Schwandt (2007, p. 401) refers to a “minimalist view,” in which evaluators develop sensitivity,
empathy, and respect for others, and a “maximalist view,” which includes specific guidelines
for ethical practice including “[recording] all changes made in the originally negotiated project
plans, and the reasons why the changes were made” (AEA, 2004). Stake and Mabry (1998)
define ethics as “the sum of human aspiration for honor in personal endeavor, respect in
dealings with one another, and fairness in the collective treatment of others” (p. 99).
Schweigert (2007) defines stated ethics in program evaluation as “limits or standards to
prohibit intentional harms and name the minimum acceptable levels of performance” (p. 394).
Ethical problems in evaluation are often indistinct, pervasive, and difficult to resolve with
confidence (Stake & Mabry, 1998).

Although guidelines and professional standards can help guide the evaluator toward more
ethical decisions, they have been criticized as lacking enforceability and failing to anticipate
the myriad situations inevitable in practice (Bamberger, Rugh, & Mabry, 2012)—hence the
call for cultivating sound professional judgment (through reflective practice) in applying the
principles and guidelines.

this chapter explores this dilemma. Smith (1998) cites Mabry (1997) in describing the
challenge of adhering to ethical principles for the evaluator:

Cultural Competence in Evaluation Practice

Turn in your highest-quality paper
Get a qualified writer to help you with

“ 5D1-9 – Summative and Formative Evaluations – see details below. Please follow instructions given and answer all questions. ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now