5D1-9 – Summative and Formative Evaluations – see details below. Please follow instructions given and answer all questions.

Discussion Instructions: 

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Based upon Program Evaluation and Performance Measurement text chapters 11 & 12 Readings.

1. Explain why summative evaluations are more challenging to do than formative evaluations.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper
  • 4
  • The Relationship Between Formative and Summative
    Assessment—In the Classroom and Beyond

    This chapter discusses the relationships between formative and summative
    assessments—both in the classroom and externally. In addition to teachers, site-and
    district-level administrators and decision makers are target audiences. External test
    developers also may be interested.

    Teachers inevitably are responsible for assessment that requires them to report on
    student progress to people outside their own classrooms. In addition to informing and
    supporting instruction, assessments communicate information to people at multiple
    levels within the school system, serve numerous accountability purposes, and provide
    data for placement decisions. As they juggle these varied purposes, teachers take on
    different roles. As coach and facilitator, the teacher uses formative assessment to help
    support and enhance student learning. As judge and jury, the teacher makes summative
    judgments about a student’s achievement at a specific point in time for purposes of
    placement, grading, accountability, and informing parents and future teachers about
    student performance. Often in our current system, all of the purposes and elements of
    assessment are not mutually supportive, and can even be in conflict. What seems
    effective for one purpose may not serve, or even be compatible with, another. Review
    Table 2-1 in Chapter 2.

    The previous chapters have focused primarily on the ongoing formative assessment
    teachers and students engage in on a daily basis to enhance student learning. This
    chapter briefly examines summative assessment that is usually prescribed by a local,
    district, or state agency, as it occurs regularly in the classroom and as it occurs in large-
    scale testing. The chapter specifically looks at the relationship between formative and

    Page 60 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science

    https://www.nap.edu/read/9847/chapter/4#p200047d7ttt00001

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save

    Cancel

    summative assessment and considers how inherent tensions between the different
    purposes of assessment may be mitigated.

    HOW CAN SUMMATIVE ASSESSMENT SERVE THE
    STANDARDS?

    The range of understanding and skill called for in the Standards acknowledges the
    complexity of what it means to know, to understand, and to be able to do in science.
    Science is not solely a collection of facts, nor is it primarily a package of procedural skills.
    Content understanding includes making connections among various concepts with
    which scientists work, then using that information in specific context. Scientific
    problem-solving skills and procedural knowledge require working with ideas, data, and
    equipment in an environment conducive to investigation and experimentation. Inquiry,
    a central component of the Standards, involves asking questions, planning, designing
    and conducting experiments, analyzing and interpreting data, and drawing conclusions.

    If the Standards are to be realized, summative as well as formative assessment must
    change to encompass these goals. Assessment for a summative purpose (for example,
    grading, placement, and accountability) should provide students with the opportunity to
    demonstrate conceptual understanding of the important ideas of science, to use
    scientific tools and processes, to apply their understanding of these important ideas to
    solve new problems, and to draw on what they have learned to explain new
    phenomena, think critically, and make informed decisions (NRC, 1996). The various
    dimensions of knowing in science will require equally varied assessment strategies, as
    different types of assessments capture different aspects of learning and achievement
    (Baxter & Glaser, 1998; Baxter & Shavelson, 1994; Herman, Gearhart, & Baker, 1993;
    Ruiz-Primo & Shavelson, 1996; Shavelson, Baxter, & Pine, 1991; Shavelson & Ruiz-Primo,
    1999).

    FORMS OF SUMMATIVE ASSESSMENT IN THE
    CLASSROOM

    As teachers fulfill their different roles as assessors, tensions between formative and
    summative purposes of assessment can be significant (Bol and Strange, 1996). However,
    teachers often are in the position of being able to tailor assessments for both
    summative and formative purposes.

    Performance Assessments

    Any activity undertaken by a student provides an opportunity for an assessment of the
    student’s performance. Performance assessment often implies a more formal
    assessment of a student as he or she engages in a performance-

    Page 61 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save
    Cancel

    based activity or task. Students are often provided with apparatus and are expected to
    design and conduct an investigation and communicate findings during a specified period
    of time. For example, students may be given the appropriate material and asked to
    investigate the preferences of sow bugs for light and dark, and dry or damp
    environments (Shavelson, Baxter, & Pine, 1991). Or, a teacher could observe while
    students design and conduct water-quality tests on a given sample of water to

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    determine what variables the students measure, and what those variables indicate to
    them, and how they explain variable interaction. Observations can be complemented by
    assessing the resultant products, including data sheets, graphs, and analysis. In some
    cases, computer simulations can replace actual materials and journals in which students
    include results, interpretations, and conclusions can serve as proxies for observers
    (Shavelson, Baxter, & Pine, 1991).

    By their nature, these types of assessments differ in a variety of ways from the
    conventional types of assessments. For one, they provide students with opportunities to
    demonstrate different aspects of scientific knowledge (Baxter & Shavelson, 1994;
    Baxter, Elder, & Glaser, 1996; Ruiz-Primo & Shavelson, 1996). In the sow bug
    investigation, for example, students have the opportunity to demonstrate their ability to
    design and conduct an experiment (Baxter & Shavelson, 1994). The investigation of
    water quality highlights procedural knowledge as well as the content knowledge
    necessary to interpret tests, recognize and explain relationships, and provide analysis.
    Because of the numerous opportunities to observe students at work and examine their
    products, performance assessments can be closely aligned with curriculum and
    pedagogy.

    Portfolios

    Duschl and Gitomer (1997) have conducted classroom-based research on portfolios as
    an assessment tool to document progress and achievement and to contribute to a
    supportive learning environment. They found that many aspects of the portfolio and the
    portfolio process provided assessment opportunities that contributed to improved work
    through feedback, conversations about content and quality, and other assessment-
    relevant discussions. The collection also can serve to demonstrate progress and inform
    and support summative evaluations. The researchers document the challenges as well
    as the successes of building a learning environment around portfolio assessment. They
    suggest that the relationship between assessment and instruction requires
    reexamination so that information gathered from student discussions can be used for
    instructional purposes. For

    Page 62 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save
    Cancel

    this purpose, a teacher’s conception and depth of subject-matter knowledge need to be
    developed and cultivated so that assessment criteria derive from what is considered
    important in the scientific field that is being studied, rather than from poorly connected
    pieces of discrete information.

    Researchers at Harvard’s Graduate School of Education (Seidel, Walters, Kirby, Olff,
    Powell, Scripp, & Veenema, 1997) suggest that the following elements be included in
    any portfolio system:

    • collection of student work that demonstrates what students have learned and
    understand;

    • an extended time frame to allow progress and effort to be captured;
    • structure or organizing principles to help organize as well as interpret and

    analyze; and
    • student involvement in not only the selection of the materials but also in the

    reflection and assessment.

    An example for the contents for a portfolio of a science project could be as follows:

    • the brainstorming notes that lead to the project concept;
    • the work plan that the student followed as a result of a time line;
    • the student log that records successes and difficulties;
    • review of actual research results;
    • photograph of finished project; and
    • student reflection on the overall project (p. 32).

    Using Traditional Tests Differently

    Certain kinds of traditional assessments that are used for summative purposes contain
    useful information for teachers and students, but these assessments are usually too
    infrequent, come too late for action, and are too coarse-grained. Some of the activities
    in these summative assessments provide questions and procedures that might, in a
    different context, be useful for formative purposes. For example, rescheduling
    summative assessments can contribute to their usefulness to teachers and students for
    formative purposes. Tests that are given before the end of a unit can provide both
    teacher and student with useful information on which to act while there is still
    opportunity to revisit areas where students were not able to perform well.
    Opportunities for revisions on tests or any other type of assessment give students
    another chance to work through, think about, and come to understand an area they did
    not fully understand or clearly articulate the previous time. In reviewing for a test, or
    preparing for essay questions, students can begin to make connections between aspects
    of subject matter that they may not have related previously to one another. Sharing
    designs before an experiment gets under way during a peer-assessment session gives
    each student a chance to comment on and to improve his or her own investigation as
    well as

    Page 63 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save
    Cancel

    those of their classmates. When performed as a whole class, reviewing helps make
    explicit to all students the key concepts to be covered.

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    Selected response and written assessments, homework, and classwork all serve as
    valuable assessment activities as part of a teacher ‘s repertoire if used appropriately.
    The form that the assessment takes should coincide with careful consideration of the
    intended purpose. Again, the use of the data generated by and through the
    assessment is important so that it feeds back into the teaching and learning.

    As shown in Table 4-1, McTighe and Ferrara (1998) provide a useful framework for
    selecting assessment approaches and methods. The table accents the range of
    common assessments available to teachers. Although their framework serves all
    subject-matter areas, the wide variety of assessments and assessment-rich activities
    could be applicable for assessments in a science classroom.

    TABLE 4-1 Framework of Assessment Approaches and Methods

    HOW MIGHT WE ASSESS STUDENT LEARNING IN THE
    CLASSROOM?
    Selecte
    d-
    Respon
    se
    Format

    Constructed-Response Format

     Mul
    tipl
    e-
    choi
    ce

     Tru
    e-
    fals
    e

    Brief
    Construct
    ed
    Response

    Performance-Based Assessment

     Fill in
    the
    blank

     Word(s
    )

    Product Performanc
    e

    Process-
    Focused
    Assessm
    ent

    https://www.nap.edu/read/9847/chapter/6#p200047d7ttt00003

     Mat
    chi
    ng

     Enh
    anc
    ed
    mul
    tipl
    e
    choi
    ce

     Phrase
    (s)

     Short
    answe
    r

     Senten
    ce(s)

     Paragr
    aphs

     Label a
    diagra
    m

     “Show
    your
    work”

     Researc
    h paper

     Story/pl
    ay

     Poem
     Portfoli

    o
     Art

    exhibit
     Science

    project
     Model

     Oral
    presenta
    tion

     Dance/
    moveme
    nt

     Science
    lab
    demonst
    ration

     Athletic
    skill
    perform
    ance

     Deba
    te

     Musi
    cal
    recita
    l

     Keyb
    oardi
    ng

     Teac
    h-a-
    lesso
    n

     Visual
    repres
    entatio
    n

     Essay

     Video/a
    udiotap
    e

     Spreads
    heet

     Lab
    report

     Dramati
    c
    reading

     Enactme
    nt

     Oral
    quest
    ionin
    g

     Obse
    rvatio
    n
    (“kid
    watc
    hing”)

     Inter
    view

     Conf
    erenc
    e

     Proce
    ss
    descr
    iption

     “Thin
    k
    aloud

     Learn
    ing
    log

    SOURCE: McTighe and Ferrara (1998).
    Page 64 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save
    Cancel

    GRADING AND COMMUNICATING ACHIEVEMENT

    One common summative purpose of assessment facing most teachers is the need to
    communicate information on student progress and achievement to parents, school
    board officials, members of the community, college admissions officers. In addition to
    scores from externally mandated tests, teacher-assigned grades traditionally serve this
    purpose.

    A discussion in Chapter 2 defends the use of descriptive, criterion-based feedback as
    opposed to numerical scoring (8/10) or grades (B). A study cited (Butler, 1987) showed
    that the students who demonstrated the greatest improvement were the ones who
    received detailed comments (only) on their returned pieces of work. However, grading
    and similar practices are the reality for the majority of teachers. How might grading be
    used to best support student learning?

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/4#p200047d7ddd0000011

    Though they are the primary currency of our current summative-assessment system,
    grades typically carry little meaning because they reduce a great deal of information to a
    single letter. Furthermore, there is often little agreement between the difference
    between an A and a B, a B and a C, a D and an F or what is required for a particular letter
    grade (Loyd & Loyd, 1997).

    Grades may symbolize achievement, yet they often incorporate other factors as well,
    such as work habits, which may or may not be related to level of achievement. They are
    often used to reward or motivate students to display certain behaviors (Loyd & Loyd,
    1997). Without a clear understanding of the basis for the grade, a single letter often will
    provide little information on how work can be improved. As noted previously, grades
    will only be as meaningful as the underlying criteria and the quality of assessment that
    produced them.

    A single-letter grade or the score on an end-of-unit test does not make student progress
    explicit, nor does either provide students and teachers with information that might
    further their understandings or inform their learning. A “C” on a project or on a report
    card indicates that a student did not do exemplary work, but beyond that, there is
    plenty of room for interpretation and ambiguity. Did the student show thorough
    content understanding but fall short in presentation? Did the student not convey clear
    ideas? Or did the student not provide adequate explanation of why a particular
    phenomenon occurred? Without any information about these other dimensions, a
    single-letter grade does not provide specific guidance about how work can be improved.

    Page 65 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    Cancel

    Surrounded by ambiguity, a letter grade without discussion and an understanding of
    what it constitutes does little to provide useful information to the student, or even give
    an indication of the level of performance. Thus, when a teacher establishes criteria for
    individual assessments and makes them explicit to students, they also need to do so for
    grading criteria. The criteria also should be clear to those who face interpreting them,
    such as parents and future teachers, and incorporate priorities and goals important to
    science as a school subject area.

    Careful documentation can allow formative assessments to be used for summative
    purposes. The manner in which summative assessments are reported helps determine
    whether they can be easily translated for formative purposes—especially by the
    student, teacher, and parents. In the vignette in Chapter 3, a middle school science
    teacher confers with students as they engage in an ongoing investigation. She keeps
    written notes of these exchanges as well as from the observations she makes of the
    students at work. When it is time for this teacher to assign student grades for the
    project, she can refer to these notes to provide concrete examples as evidence. Using
    ongoing assessments to inform summative evaluations is particularly important for
    inquirybased work, which cannot be captured in most one-time tests. Many teachers
    give students the opportunity to make test corrections or provide other means for
    students to demonstrate that they understand material previously not mastered.
    Documenting these types of changes over time will show progress and can be used as
    evidence of understanding for summative purposes.

    Teachers face the challenge of overcoming the common obstacle of assigning classroom
    grades and points in such a way that they drive classroom activity to the detriment of
    other, often more informative and useful, types of assessment that foster standards-
    based goals. Grading practices can be modified, however, so that they adhere to
    acceptable standards for summative assessments and at the same time convey
    important information that can be used to improve work in a way that is relatively easy
    to read and understand. Mark Wilson and colleagues at the University of California,
    Berkeley, have devised one such plan for the assessment system designed for the SEPUP
    (Science Education for Public Understanding Program) middle school science curriculum
    (Wilson & Sloane, 1999; Roberts, Wilson, & Draney, 1997; Wilson & Draney, 1997).

    The SEPUP assessment system serves as an example of possible alternatives to the
    traditional, current single-letter grade scheme. As shown in Table 4-2, the SEPUP
    assessment blueprint indicates that a single assessment will not capture all of the skills
    and content desired in any particular curricular unit. However, teachers do not need to

    https://www.nap.edu/read/9847/chapter/5#p200047d7ddd0000022

    https://www.nap.edu/read/9847/chapter/6#p200047d7ttt00004

    be concerned about getting all the assessment information they need at a single time
    with any single assessment.

    Page 66 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save
    Cancel

    TABLE 4-2 SEPUP Assessment Blueprint

    Teacher’s Guide
    Part 1: Water Usage and Safety
    Designing and

    Conducting
    Investigations

    • Designing
    investigation

    • Selecting and
    Recording
    Procedures

    Evidence and
    Tradeoffs

    • Using
    Evidence

    • Using
    Evidence

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    • Organizing
    Data

    • Analyzing and
    Interpreting
    Data

    to Make
    Tradeoffs

    1 Drinking-Water
    Quality

    2 Exploring
    Sensory
    Thresholds

    3 Concentration
    4 Mapping Death
    5 John Snow A: Using

    Evidence (p.
    52)

    6 Contaminated
    Water

    √: Designing
    Investigation (p. 61)

    7 Chlorination A: All elements (p.
    66)

    8 Chicken Little,
    Chicken Big

    9 Lethal Toxicity √: Organizing Data
    (p. 94)

    10 Risk
    Comparison

    √: Analyzing and
    Interpreting Data
    (p. 109)

    11 Injection
    Problem

    √: Both
    elements (p.
    120)

    12 Peru Story A: Organizing Data
    and Analyzing and
    Interpreting Data
    (p. 130)

    A: Both
    elements (p.
    132)

    SOURCE: Science Education for Public Understanding
    Program (1995).
    Page 67 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save
    Cancel
    Sections A and B

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    Understand
    ing
    Concepts

    • Recogniz
    ing
    Relevant
    Content

    • Applying
    Relevant
    Content

    Communicat
    ing Scientific
    Information

    • Organizat
    ion

    • Technical
    Aspects

    Group Interaction

    • Time Management
    • Role

    Performance/Partici
    pation

    • Shared Opportunity

    1
    2 √: Both

    elements (p.
    16)
    Measurement
    and Scale★

    3 √: Applying
    Relevant
    Content (p.
    28)
    Measurement
    and Scale★

    4 √: Time Management;
    Shared Opportunity (p.
    38)

    https://www.nap.edu/read/9847/chapter/6#p200047d7nnn00001

    https://www.nap.edu/read/9847/chapter/6#p200047d7nnn00001

    5 A: Both
    elements
    (p.52)

    6
    7
    8 √: Shared Opportunity

    (p. 76)
    9 A: Applying

    Relevant
    Content (p.
    97)
    Measurement
    and Scale★

    1
    0

    √: Applying
    Relevant
    Content (p.
    111)
    Measurement
    and Scale★

    1
    1

    1
    2

    A: Both
    elements (p.
    132)

    ★Indicates content concepts assessed

    https://www.nap.edu/read/9847/chapter/6#p200047d7nnn00001

    https://www.nap.edu/read/9847/chapter/6#p200047d7nnn00001

    SOURCE: Science Education for Public Understanding
    Program (1995).
    Page 68 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save
    Cancel

    By using the same scale for the entire unit, the SEPUP assessment system allows
    teachers to obtain evidence about the students’ progress. Without the context or
    criteria that the SEPUP scoring guide (Table 4-3) provides, a score of “2” on an
    assessment, could be interpreted as inadequate, even if the scale is 0-4. However, as the
    scoring guide indicates, in this example, a “2” represents a worthwhile step on the road
    to earning a score of “4”. In practice, the specific areas that need additional attention
    are conveyed in the scoring guide, thus a student could receive a “2” as feedback and
    know what they need to do to improve the piece of work. The scoring guide also can
    provide summative assessments at any given point.

    TABLE 4-3 SEPUP Scoring Guide

    Scoring Guide: Evidence and Tradeoffs (ET) Variable

    Score Using Evidence
    Response uses
    objective reason(s)

    Using Evidence to Make
    Tradeoffs

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6#p200047d7ttt00006

    based on relevant
    evidence to argue for
    or against a choice.

    Response recognizes
    multiple perspectives
    of issue and explains
    each perspective using
    objective reasons,
    supported by evidence,
    in order to make a
    choice.

    4 Response
    accomplishes level 3,
    AND goes beyond in
    some significant way,
    e.g. questioning or
    justifying the source,
    validity, and/or
    quantity of the
    evidence.

    Accomplishes Level 3
    AND goes beyond in
    some significant way,
    e.g., suggesting
    additional evidence
    beyond the activity that
    would influence choices
    in specific ways, OR
    questioning the source,
    validity, and/or quantity
    of the evidence and
    explaining how it
    influences choice.

    3 Provides major
    objective reasons AND
    supports each with
    relevant and accurate
    evidence.

    Uses relevant and
    accurate evidence to
    weigh the advantages
    and disadvantages of
    multiple option, and

    makes a choice
    supported by the
    evidence.

    2 Provides some
    objective reasons AND
    some supporting
    evidence, BUT at least
    one reason is missing
    and/or part of the
    evidence is
    incomplete.

    States at least two
    options AND provides
    some objective reasons
    using some relevant
    evidence BUT reasons or
    choices are incomplete
    and/or part of the
    evidence is missing; OR
    only one complete and
    accurate perspective has
    been provided.

    1 Provides only
    subjective reasons
    (opinions) for choice;
    uses unsupported
    statements; OR uses
    inaccurate or
    irrelevant evidence
    from the activity.

    States at least one
    perspective BUT only
    provides subjective
    reasons and/or uses
    inaccurate or irrelevant
    evidence.

    0 Missing; illegible, or
    offers no reasons AND
    no evidence to
    support choice made.

    Missing, illegible, or
    completely lacks reasons
    and evidence.

    X Student had no opportunity to respond.
    SOURCE: Science Education for Public Understanding
    Program (1995).
    Page 69 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save
    Cancel

    The SEPUP assessment system provides one such example, but teachers can employ
    other forms of assessment that capture progress as well as achievement at a specific
    point in time. Keyed to standards and goals, such systems can be strong on meaning for
    teachers and students and still convey information to different levels of the system in a
    relatively straightforward and plausible manner that is readily understood. Teachers can
    use the standards or goals to help guide their own classroom assessments and
    observations and also to help them support work or learning in a particular area where
    sufficient achievement has not been met.

    Devising a criterion-based scale to record progress and make summative judgments
    poses difficulties of its own. The levels of specificity involved in subdividing a domain to
    assure that the separate elements together represent the whole is a crucial and
    demanding task (Wiliam, 1996). This becomes an issue whether considering
    performance assessments or ongoing assessment data and needs to be articulated in
    advance of when students engage in activities (Quellmalz, 1991; Gipps, 1994).

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    Specific guidelines for the construction and selection of test items are not offered in this
    document. Test design and selection are certainly important aspects of a teacher’s
    assessment responsibility and can be informed by the guidelines and discussions
    presented in this document (see also Chapter 3). Item-writing recommendations and
    other test specifications are topics of a substantial body of existing literature (for
    practitioner-relevant discussions, see Airasian, 1991; Cangelosi, 1990; Cunningham,
    1997; Doran, Chan, and Tamir, 1998; Gallagher, 1998; Gronlund, 1998; Stiggins, 2001).
    Appropriate design, selection, interpretation and use of tests and assessment data were
    emphasized in the joint effort of the American Federation of Teachers (AFT), the
    National Council on Measurement in Education (NCME), and the National Education
    Association (NEA) to specify pedagogical skills necessary for effective assessment (AFT,
    NCME, & NEA, 1990).

    VALIDITY AND RELIABILITY IN SUMMATIVE ASSESSMENTS

    Regardless of what form a summative assessment takes or when it occurs, teachers
    need to keep in mind validity and reliability, two important technical elements of both
    classroomlevel assessments and external or large-scale assessments (AERA, APA, &
    NCME, 1999). These concepts also are discussed in Chapter 3.

    Validity and reliability are judged using different criteria, although the two are related.
    Validity has different

    Page 70 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save

    https://www.nap.edu/read/9847/chapter/5#p200047d7ddd0000022

    https://www.nap.edu/read/9847/chapter/5#p200047d7ddd0000022

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    Cancel

    dimensions, including content (does the assessment measure the intended content
    area?), construct (does the assessment measure the intended construct or ability?) and
    instructional (was the material on the assessment taught?). It is important to consider
    the uses of assessment and the appropriateness of resulting inferences and actions as
    well (Messick, 1989). Reliability has to do with generalizing across tasks (is this a
    generalizable measure of student performance?) and can involve variability in
    performance across tasks, between settings, as well as in the consistency of scoring or
    grading.

    What these terms mean operationally varies slightly for the kinds of assessments that
    occur each day in the classroom and in the form of externally designed exams. For
    example, the ongoing classroom assessment that relies on immediate feedback provides
    different types of opportunities for follow-up when compared to a typical testing
    situation where follow-up questioning for clarification or to ensure proper
    interpretation on the part of the respondent usually is not possible (Wiliam & Black,
    1996). The dynamic nature of day-to-day teaching affords teachers with opportunities to
    make numerous assessments, take relevant action, and to amend decisions and
    evaluations if necessary and with time. Wiliam and Black (1996) write, “the fluid action
    of the classroom, where rapid feedback is important, optimum validity depends upon
    the self-correcting nature of the consequent action ” (pp. 539-540).

    With a single-test score, especially from a test administered at the end of the school
    year, a teacher does not have the opportunity to follow a response with another
    question, either to determine if the previous question had been misinterpreted or to
    probe misunderstandings for diagnostic reasons. With a standardized test, where on-
    the-spot interpretation of the student’s response by the teacher and follow-up action is
    impossible, the context in which responses are developed is ignored. Measures of
    validity are decontextualized, depending almost entirely on the collection and nature of
    the actual test items. More important, all users of assessment data (teachers,
    administrators and policy makers) need to be aware of what claims they make about a
    student’s understanding and the consequential action based on any one assessment.

    Relying on a variety of assessments, in both form and what is being assessed, will go a
    long way to ensuring validity. Much of what is called for in the standards, such as
    inquiry, cannot be assessed in many of the multiplechoice, short-answer, or even two-
    hour performance assessments that are currently employed. Reliability, though more
    straightforward, may be more difficult to ensure than validity. On external tests, even
    when scorers

    Page 71 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save
    Cancel

    are carefully calibrated (or done by a machine), variations in a student’s performance
    from day to day, or from question to question, poses threats to reliability.

    Viable systems that command the same confidence as the current summative system
    but are free of many of the inherent conflicts and contradictions are necessary to make
    decisions psychometrically sound. The confidence that any assessment can demand will
    depend, in large part, on both reliability and validity (Baron, 1991; Black, 1997). As Box
    4-1 indicates, there are some basic questions to be asked of both teacher-made and
    published assessments. Teachers need to consider the technical aspect of the
    summative assessments they use in the classroom. They also should look for evidence
    that disproves earlier judgments and make necessary accommodations. Likewise, they
    should be looking for further assessment data that could help them to support their
    students ‘ learning.

    LARGE-SCALE, EXTERNAL ASSESSMENT—THE CURRENT
    SYSTEM AND NEED FOR REFORM

    Large-scale assessments at the district, state and national levels are conducted for
    different purposes: to formulate policy, monitor the effects of policies and enforce
    them, make

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6#p200047d7bbb00013

    https://www.nap.edu/read/9847/chapter/6#p200047d7bbb00013

    BOX 4-1 Applying Validity and Reliability
    Concerns to Classroom Teaching

     What am I interested in measuring? Does this
    assessment capture that?

     Have the students experienced this material as
    part of their curriculum?

     What can I say about a student’s understandings
    based on the information generated from the
    assessment? Are those claims legitimate?

     Are the consequences and actions that result from
    this performance justifiable?

     Am I making assumptions or inferences about
    other knowledge, skills or abilities that this
    assessment did not directly assess?

     Are there aspects of this assessment not relevant
    to what I am interested in assessing that may be
    influencing performance?

     Have I graded consistently?
     What could be unintended consequences

    associated with this assessment?

    comparisons, monitor progress towards goals, evaluate programs, and for accountability
    purposes (NRC, 1996). As a key element in the success of education-improvement
    systems, accountability has become one of the most important issues in educational
    policy today (NRC, 1999b). Accountability is a means by which policy makers at the state

    and district levels—and parents and taxpayers—monitor the performance of students
    and schools.

    Most states use external assessments for accountability purposes (Bernauer & Cress,
    1997). These

    Page 72 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save
    Cancel

    standardized, externally designed tests are either norm-referenced tests (NRTs),
    criterion-referenced tests (CRTs), or some combination of the two. A “standardized” test
    is one that is to be carried out in the same way for all individuals tested, scored in the
    same way, and scores interpreted in the same way (Gipps, 1994). NRTs are developed
    by test publishers to measure student performance against the norm. Results from
    these tests describe what students can do relative to other students and are used for
    comparing groups of students. The norm is a rank, the 50th percentile. For national
    tests, the norm is constructed by testing students all over the country. (It also is the
    score that test-makers call “at grade level” [Bracey, 1998]). On a norm-referenced test,
    half of all students in the norm sample will score at or above the 50th percentile, or
    above grade level, and half will score below the 50th percentile, or below grade level.
    These tests compare students to other students, rather than measuring student mastery
    of content standards or curricular objectives (Burger, 1998).

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    Increasingly, states and districts are moving towards criterion-referenced tests (CRTs),
    usually developed by state departments of education and districts, which compare
    student performance to a set of established criteria (for example, district, state or
    national standards) rather than comparing them to the performance of other students.
    CRT’s allow all students who have acquired skills and knowledge to receive high scores
    (Burger, 1998).

    A well-designed and appropriately used standardized test can generate data that can be
    used to inform different parts of the system and to assess a range of understandings
    and skills. Currently, they generally concentrate on the knowledge most amenable to
    scoring in multiple-choice and short-answer formats. These formats most easily capture
    factual knowledge (Shavelson & Ruiz-Primo, 1999) and are the most inexpensive in
    terms of resources necessary for test development, administration, and scoring (Hardy,
    1995). Although many of the current standardized tests are intended to assess student
    achievement, too often they are used only to stimulate competition among students,
    teachers or schools, or to make other judgments that are not justified by student scores
    on such tests.

    The lack of coherence among the different levels of assessment within the system, often
    leaves teachers, schools and districts torn between mandated external testing policies
    and practices, and the responsibilities of teachers to use assessment in the service of
    learning. These large-scale tests, which often command greater esteem than classroom
    assessments, create a tension for formative and summative assessment and a challenge
    for exemplary classroom

    Page 73 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    Save
    Cancel

    practice (Black, 1997; Frederiksen, 1984; Smith & Rottenberg, 1991). Teachers are left
    facing serious dilemmas.

    BUILDING AN EXTERNAL STANDARDS-BASED
    SUMMATIVE ASSESSMENT SYSTEM

    The foundations for a standards-based summative assessment system are assessments
    that are systemically valid: aligned to the recommendations of the national standards,
    grounded in the educational system, and congruent with the educational goals for
    students. Alignment of assessment to curriculum and standards ensures that the
    assessments match the learning goals embodied in the standards and enables the
    students, parents, teachers and the public to determine student progress toward the
    standards (NRC, 1999b).

    Assessment and accountability systems cannot be isolated from their purpose: to
    improve the quality of instruction and ultimately the learning of students (NRC, 1999b).
    They also must be well understood by the interested parties and based on standards
    acceptable to all (Stecher & Herman, 1997).

    An effective system will provide students with the opportunity to demonstrate their
    understanding and skills in a variety of ways and formats. The form the assessment
    takes must follow its purpose. Multiple-choice tests are easy to grade and can quickly
    assess some forms of science-content knowledge. Other areas may be better tapped
    through open-ended questions or performance-based assessments, where students
    demonstrate their abilities and understandings such as with an actual hands-on
    investigation (Shavelson & Ruiz-Primo, 1999). Assessing inquiry skills may require
    extended investigations and can be documented through portfolios of work as it
    unfolds.

    Educators need to be cautious, deliberate, and aware of the strong influence of high-
    stakes, external tests on classroom practice specific to the instruction emphasis and its
    assessment (Frederiksen, 1984; Gifford & O’Connor, 1992; Goodlad, 1984; Popham,
    1992; Resnick & Resnick, 1991; Rothman, 1995; Shepard, 1995; Smith et al., 1992; Wolf
    et al., 1991) when considering, implementing, and evaluating large-scale assessment
    systems. No assessment form is immune from negative influences. Messick (1994)
    concludes

    It is not just that some aspects of multiple-choice testing may have adverse
    consequences for teaching and learning, but that some aspects of all testing, even
    performance testing, may have adverse as well as beneficial educational consequences.
    And if both positive and negative aspects, whether intended or unintended, are not
    meaningfully addressed in the validation process, then the concept of validity loses its
    force as a social value. (p. 22)

    Page 74 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save
    Cancel

    Even well-designed assessments will need to be augmented by other assessments. Most
    criterion-referenced tests are multiple-choice or short-answer tests. Although they may
    align closely to a standards-based system, other assessment components, such as
    performance measures, where students demonstrate their understanding by doing
    something educationally desirable, also are necessary to measure standards-based
    outcomes. A long-term inquiry that constitutes a genuine scientific investigation, for
    example, cannot be captured in a single test or even in a performance assessment
    allotted for a single class period.

    LEARNING FROM CURRENT REFORM

    Beyond a Single Test

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    Several states and districts are making strides in expanding external testing beyond
    traditional notions of testing to include more teacher involvement and to better align
    classroom and external summative assessments, so to better support teaching and
    learning. The state of Vermont (VT) was one pioneer. The state sought to develop an
    assessment system that served accountability purposes as well as generated data that
    would inform instruction and improve individual achievement (Mills, 1996). The system
    had three components: Students and teachers gathered work for portfolios, teachers
    submitted a “best piece” sample for each student, and students took a standardized
    test. Scoring rubrics and exemplars were used by groups of teachers around the state to
    score the portfolios and student work samples. Despite the different pieces in place
    (which also included professional development) the VT experiment faced mixed results
    and is still evolving. The scoring of the portfolios and student work samples lacked an
    adequate reliability (in the technical sense) to be used for accountability purposes
    (Koretz, Stecher, Klein, & McCaffrey, 1994). Many teachers saw a positive impact on
    student learning, due in part to the focus and feedback on specific pieces of student
    work that teachers provided to students during the collection and preparation process
    (Asp, 1998) but also acknowledged the additional time needed for portfolio preparation
    (Koretz, Stecher, Klein, McCaffrey, & Deibert, 1993).

    Kentucky (KY) is another state that made changes to their system and faced similar
    challenges. The portfolio and performance-based assessment system in that state also
    did not achieve consistently reliable scores (Hambleton et al., 1995). Both states
    demonstrate that consistency across scores for samples of work requires training and
    time. Research on performance assessments in large-scale systems shows that
    variability in student performance across tasks also can be significant (Baron, 1991).

    Page 75 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    Save
    Cancel

    Involving Teachers

    Teachers who are privy to student discussions and able to making ongoing observations
    are in the best position to assess many of the educational goals including areas such as
    inquiry. Therefore, teachers need to become more involved in summative assessments
    for purposes beyond reporting on student progress and achievement to others in the
    system. Practices within the United States and in other countries provide us with
    possibilities of how to better tap into teachers ‘ summative assessments to augment or
    complement external exams.

    In Queensland, Australia, for example, the state moved away from their state-wide
    examination and placed the certification of students in the hands of teachers (Butler,
    1995). Teachers meet in regional groups to exchange results and assessment methods
    with colleagues. They justify their assessments and deliberate with colleagues from
    other schools to help ensure that the different schools are holding their students to
    comparable standards and levels of achievement. Additional examples of the role of
    teacher judgment in external assessment in other countries are discussed in the next
    chapter.

    Accountability efforts that exclude teachers from assessing their students’ work are
    often justified on grounds that teachers could undermine the reliability by injecting
    undue subjectivity and personal bias. This argument has some support based on results
    of efforts in VT and KY. However, as the teachers in Queensland engage in deliberation
    and discussion (a procedure called moderation), steps are taken that mitigate the
    possible loss of reliability. To help ensure consistency among different teachers in
    moderation sessions, teachers exchange samples of student work and discuss their
    respective assessments of the work. These deliberations, in which the standards for
    judging quality work are discussed, have proved effective in developing consistency in
    scoring by the teachers. Moderation also serves as an effective form of professional
    development because teachers sharpen their perspectives about the quality of student
    work that might be expected, as is illustrated in the next chapter. In the United States,
    teacher-scoring committees for Advanced Placement exams follow this model.

    Moderation is expensive and not always practical. There are other ways to maintain
    reliability and involve teachers in summative assessments that serve accountability and

    reporting purposes. In Connecticut, the science portion of the state assessment system
    involves teachers selecting from a list of tasks and using them in conjunction with their
    own curriculum and contexts. The state provides the teachers with exemplars and
    criteria, and the teachers are responsible for scoring

    Page 76 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    Save
    Cancel

    their own student work. Teachers can use the criteria in other areas of their curriculum
    throughout the year.

    Douglas County Schools in Colorado rely heavily on teacher judgments for accountability
    purposes (Asp, 1998). Teachers collect a variety of evidence of student progress towards
    district standards. Teacher-developed materials that include samples of work,
    evaluation criteria, and possible assessment tasks guide them. The county uses these
    judgments to communicate to parents and district-level monitors and decision makers.

    Examples and research can help inform large-scale assessment models so that systems
    produce useful data that inform the necessary purposes while not creating obstacles for
    quality teaching and learning. Policy and decision makers must look to and learn from
    reforms underway. After examining large scale testing practices, Asp (1998) offers keys
    to building compatibility between classroom and large-scale summative assessment
    systems. His recommendations include the following:

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    • make large-scale assessment more accessible to classroom teachers;
    • embed large-scale assessment in the instructional program of the classroom in a

    meaningful way; and
    • use multiple measures at several levels within the system to assess individual

    student achievement (pp. 41-42).

    When data on individual achievement is not the desired aim (as is often the case when
    accountability concerns focus on an aggregate level, such as the school, district or
    region), the use of sampling procedures to test fewer students and to test less
    frequently can be options.

    The assessment systems and features discussed above are not flawless, yet there is
    much to learn from the experiences of these reforms. Current strategies and systems
    need to be modified without compromising the goal of a more aligned system. Changes
    of any kind will require support from the system and resources for designing and
    evaluating options, informing and training teachers and administrators, and educating
    the public

    KEY POINTS

    • Tensions between formative and summative assessment do exist, but there are
    ways in which these tensions can be reduced. Some productive steps for reducing
    tensions include relying on a variety of assessment forms and measures and
    considering the purposes for the assessment and the subsequent form the
    assessment and its reporting takes.

    • Test results should be used appropriately, not to make other judgments that are
    not justified by student scores on such tests.

    Page 77 Share Cite
    Suggested Citation:”4 The Relationship between
    Formative and Summative Assessment — In the
    Classroom and Beyond.” National Research Council.
    2001. Classroom Assessment and the National Science
    Education Standards. Washington, DC: The National
    Academies Press. doi: 10.17226/9847.
    ×

    https://www.nap.edu/read/9847/chapter/6

    https://www.nap.edu/read/9847/chapter/6

    Save
    Cancel

    • A testing program should include criterion-referenced exams and reflect the
    quality and depth of curriculum advocated by the standards.

    • For accountability purposes, external testing should not be designed in such a
    way as to be detrimental to learning, such as by limiting curricular and teaching
    activities.

    • A teacher’s position in the classroom provides opportunities to gain useful
    information for use in both formative and summative assessments. These teacher
    assessments need to be developed and tapped to best utilize the information
    that only teachers possess to augment even the best designed paper-and-pencil
    or performance-based test.

    • System-level changes are needed to reduce tensions between formative and
    summative assessments.

      4

    • The Relationship Between Formative and Summative Assessment—In the Classroom and Beyond
    • HOW CAN SUMMATIVE ASSESSMENT SERVE THE STANDARDS?

      FORMS OF SUMMATIVE ASSESSMENT IN THE CLASSROOM

      Performance Assessments

      Portfolios

      Using Traditional Tests Differently

      4

      The Relationship Between Formative and Summative Assessment—In the Classroom and Beyond

      VALIDITY AND RELIABILITY IN SUMMATIVE ASSESSMENTS

      LARGE-SCALE, EXTERNAL ASSESSMENT—THE CURRENT SYSTEM AND NEED FOR REFORM

      BUILDING AN EXTERNAL STANDARDS-BASED SUMMATIVE ASSESSMENT SYSTEM

      LEARNING FROM CURRENT REFORM

      Beyond a Single Test

      Involving Teachers

      KEY POINTS

    failing to anticipate the myriad situations inevitable in practice (Bamberger, Rugh, & Mabry,
    2012)—hence the call for cultivating sound professional judgment (through reflective practice)
    in applying the principles and guidelines.

    Like other professional judgment decisions, appropriate ethical practice occurs throughout
    the evaluation process. It usually falls to the evaluator to lead by example, ensuring that ethical
    principles are adhered to and are balanced with the goals of the stake-holders. Brandon, Smith,
    and Hwalek (2011), in discussing a successful private evaluation firm, describe the process this
    way:

    Ethical matters are not easily or simply resolved but require working out viable solutions
    that balance professional independence with client service. These are not technical matters
    that can be handed over to well-trained staff or outside contractors, but require the
    constant, vigilant attention of seasoned evaluation leaders. (p. 306)

    In contractual engagements, the evaluator has to make a decision to move forward with a
    contract or, as Smith (1998) describes it, to determine if an evaluation contract may be “bad for
    business” (p. 178). Smith goes on to recommend declining a contract if the desired work is not
    possible at an “acceptable level of quality” (Smith, 1998, p. 178). For internal evaluators,
    turning down an evaluation contract may have career implications. The case study at the end of
    this chapter explores this dilemma. Smith (1998) cites Mabry (1997) in describing the
    challenge of adhering to ethical principles for the evaluator:

    Evaluation is the most ethically challenging of the approaches to research inquiry because
    it is the most likely to involve hidden agendas, vendettas, and serious professional and
    personal consequences to individuals. Because of this feature, evaluators need to exercise
    extraordinary circumspection before engaging in an evaluation study. (Mabry, 1997, p. 1,
    cited in Smith, 1998, p. 180)

    Cultural Competence in Evaluation Practice

    While issues of cultural sensitivity are addressed in Chapter 5, cultural sensitivity is as
    important for quantitative evaluation as it is for qualitative evaluation. We are including
    cultural competence in this section on ethics, as cultural awareness is an important feature of
    not only development evaluation where we explicitly work across cultures, but also virtually
    any evaluations conducted in our increasingly multicultural society. Evaluations in the health,
    education, or social sectors, for example, would commonly require that the evaluator have
    cultural awareness and sensitivity.

    There is evidence of a growing sense of the importance and the relevance of
    acknowledging cultural awareness for evaluations. Schwandt (2007) notes that “the Guiding
    Principles (as well as most of the ethical guidelines of academic and professional associations
    in North America) have been developed largely against the foreground of a Western
    framework of moral understandings” (p. 400) and are often framed in terms of individual
    behaviors, largely ignoring the normative influences of social practices and institutions. The
    AEA Guiding Principles for Evaluators include the following caveat to address the cross-
    cultural limitations of their principles:

    These principles were developed in the context of Western cultures, particularly the
    United States, and so may reflect the experiences of that context. The relevance of these

    principles may vary across other cultures, and across subcultures within the United States.
    (AEA, 2004)

    Schwandt (2007) notes that “in the Guiding Principles for evaluators, cultural competence
    is one dimension of a general principle (‘competence’) concerned with the idea of fitness or
    aptitude for the practice of evaluation” (p. 401); however, he challenges the adequacy of this
    dimension, asking “Can we reasonably argue for something like a cross cultural professional
    ethic for evaluators, and if so, what norms would it reflect?” (p. 401). Schwandt (2007) notes
    that the focus on cultural competence in evaluation has developed out of concern for “attending
    to the needs and interests of an increasingly diverse, multicultural society and the challenges of
    ensuring social equity in access to and quality of human service programs” (p. 401). In an
    imagined dialogue between two evaluators, Schwandt and Dahler-Larsen (2006) discuss
    resistance to evaluation and the practical implications for performing evaluation in
    communities. They conclude that “perhaps evaluators should listen more carefully and respond
    more prudently to voices in communities that are hesitant or skeptical about evaluation […]
    Evaluation is not only about goals and criteria, but about forms of life” (p. 504).

    THE PROSPECTS FOR AN EVALUATION PROFESSION

    In this chapter, we have emphasized the importance of acknowledging and cultivating sound
    professional judgment as part of what we believe is required to move evaluation in the
    direction of becoming a profession. In some professions, medicine being a good example, there
    is growing recognition that important parts of sound practice are tacitly learned, and that
    competent practitioners need to cultivate the capacity to reflect on their experience to develop
    an understanding of their own subjectivity and how their values, beliefs, expectations, and
    feelings affect the ways that they make decisions in their practice.

    Some evaluation associations, the Canadian Evaluation Society (CES) being the most
    prominent example, have embarked on a professionalization path that has included identifying
    core competencies for evaluators and offering members the option of applying for a
    professional designation. Knowledge (formal education), experience, and professional
    reputation are all included in the assessment process conducted by an independent panel, and
    successful applicants receive a Credentialed Evaluator designation (CES, 2012b).

    Other evaluation associations, with their emphasis on guidelines and standards for
    evaluation practice, are also embarking on a process that moves the field toward becoming
    more professional. Efforts are being made to identify core competencies (King et al., 2001),
    and discussions have outlined some of the core epistemological and methodological issues that
    would need to be addressed if evaluation is to move forward as a profession (Bickman, 1997;
    Patton, 2008). The evaluation field continues to evolve as academic and practice-based
    contributors offer new ideas, critique each other’s ideas, and develop new approaches.
    Becoming more like a profession will mean balancing the norms of professional practice (core
    body of knowledge, ethical standards, and perhaps even entry to practice requirements) with
    the ferment that continues to drive the whole field and makes it both challenging and exciting.

    Although many evaluators have made contributions that suggest we are moving toward
    making evaluation into a profession, we are not there yet. Picciotto (2011) concludes the
    following:

    Evaluation is not a profession today but could be in the process of becoming one. Much
    remains to be done to trigger the latent energies of evaluators, promote their expertise,

    protect the integrity of their practice and forge effective alliances with well wishers in
    government, the private sector and the civil society. It will take strong and shrewd
    leadership within the evaluation associations to strike the right balance between autonomy
    and responsiveness, quality and inclusion, influence and accountability. (p. 179)

    SUMMARY

    Program evaluation is partly about learning methods and how to apply them. But, because most
    evaluation settings offer only roughly appropriate opportunities to apply tools that are often
    designed for social science research settings, it is essential that evaluators learn the craft of
    working with square pegs for round holes. Evaluators and managers have in common the fact
    that they are often trained in settings that idealize the applications of the tools that they learn.
    When they enter the world of practice, they must adapt what they have learned. What works is
    determined by the context and their experiences. Experience becomes the foundation not only
    of when and how to apply tools but, more important, the essential basis for interpreting the
    information that is gathered in a given situation.

    Evaluators have the comparative luxury of time and resources to examine a program or
    policy that managers usually have to judge in situ, as it were. Even for evaluators, there are
    rarely sufficient resources to apply the tools that would yield the highest quality of data. That is
    a limitation that circumscribes what we do, but does not mean that we should stop asking
    whether and how programs work.

    This chapter emphasizes the central role played by professional judgment in the practice of
    professions, including evaluation, and the importance of cultivating sound professional
    judgment. Michael Patton, through his alter ego Halcolm, puts it this way (Patton, 2008, p.
    501):

    Forget “judge not and ye shall not be judged.”
    The evaluator’s mantra: Judge often and well so that you get better at it.

    —Halcolm

    It follows that professional programs, courses in universities, and textbooks should
    underscore for students the importance of developing and continuously improving their
    professional judgment skills, as opposed to focusing only on learning methods, facts, and
    exemplars. Practicing the craft of evaluation necessitates developing knowledge and skills that
    are tacit. These are learned through experience, refined through reflective practice, and applied
    along with the technical and rational knowledge that typically is conveyed in books and in
    classrooms. Practitioners in a profession

    begin to recognize that practice is much more messy than they were led to believe [in
    school], and worse, they see this as their own fault—they cannot have studied sufficiently
    well during their initial training.… This is not true. The fault, if there is one, lies in the
    lack of support they receive in understanding and coping with the inevitably messy world
    of practice. (Fish & Coles, 1998, p. 13)

    Fish and Coles continue thus:

    Learning to practice in a profession is an open capacity, cannot be mastered and goes on
    being refined forever. Arguably there is a major onus on those who teach courses of
    preparation for professional practice to demonstrate this and to reveal in their practice its
    implications. (p. 43)

    The ubiquity of different kinds of judgment in evaluation practice suggests that as a nascent
    profession we need to do at least three things. First, we need to fully acknowledge the
    importance of professional judgment and the role it plays in the diverse ways we practice
    evaluation. Second, we need to understand how our professional judgments are made—the
    factors that condition our own judgments. Reflective practice is critical to reaping the potential
    from experience. Third, we need to work toward self-consciously improving the ways we
    incorporate, into the education and training of evaluators, opportunities for current and future
    practitioners to improve their professional judgments. Embracing professional judgment is an
    important step toward more mature and self-reflective evaluation practice.

    Ethical evaluation practice is a part of cultivating sound judgment. Although national and
    international evaluation associations have developed principles and guidelines that include
    ethical practice, these guidelines are general and are not enforceable. Individual evaluators
    need to learn, through their reflective practice, how to navigate the ethical tradeoffs in
    situations, understanding that appropriate ethical practice will weigh the risks and benefits for
    those involved.

    DISCUSSION QUESTIONS

    1. Take a position for or against the following proposition and develop a strong one-page
    argument that supports your position. This is the proposition: “Be it resolved that
    experiments, where program and control groups are randomly assigned, are the Gold
    Standard in evaluating the effectiveness of programs.”

    2. What do evaluators and program managers have in common? What differences can you
    think of as well?

    3. What is tacit knowledge? How does it differ from public knowledge?
    4. In this chapter, we said that learning to ride a bicycle is partly tacit. For those who want

    to challenge this statement, try to describe learning how to ride a bicycle so that a person
    who has never before ridden a bicycle could get on one and ride it right away.

    5. What is mindfulness, and how can it be used to develop sound professional judgment?
    6. Why is teamwork an asset for persons who want to develop sound professional

    judgment?
    7. What do you think would be required to make evaluation more professional, that is, have

    the characteristics of a profession?

    APPENDIX

    Appendix A: Fiona’s Choice: An Ethical Dilemma for a Program Evaluator

    Fiona Barnes did not feel well as the deputy commissioner’s office door closed behind her.
    She walked back to her office wondering why bad news seems to come on Friday afternoons.

    Sitting at her desk, she went over the events of the past several days and the decision that lay
    ahead of her. This was clearly the most difficult situation that she had encountered since her
    promotion to the position of director of evaluation in the Department of Human Services.

    Fiona’s predicament had begun the day before, when the new commissioner, Fran Atkin,
    had called a meeting with Fiona and the deputy commissioner. The governor was in a difficult
    position: In his recent election campaign, he had made potentially conflicting campaign
    promises. He had promised to reduce taxes and had also promised to maintain existing health
    and social programs, while balancing the state budget.

    The week before, a loud and lengthy meeting of the commissioners in the state government
    had resulted in a course of action intended to resolve the issue of conflicting election promises.
    Fran Atkin had been persuaded by the governor that she should meet with the senior staff in
    her department, and after the meeting, a major evaluation of the department’s programs would
    be announced. The evaluation would provide the governor with some post-election breathing
    space. But the evaluation results were predetermined—they would be used to justify program
    cuts. In sum, a “compassionate” but substantial reduction in the department’s social programs
    would be made to ensure the department’s contribution to a balanced budget.

    As the new commissioner, Fran Atkin relied on her deputy commissioner, Elinor Ames.
    Elinor had been one of several deputies to continue on under the new administration and had
    been heavily committed to developing and implementing key programs in the department,
    under the previous administration. Her success in doing that had been a principal reason why
    she had been promoted to deputy commissioner.

    On Wednesday, the day before the meeting with Fiona, Fran Atkin had met with Elinor
    Ames to explain the decision reached by the governor, downplaying the contentiousness of the
    discussion. Fran had acknowledged some discomfort with her position, but she believed her
    department now had a mandate. Proceeding with it was in the public’s interest.

    Elinor was upset with the governor’s decision. She had fought hard over the years to build
    the programs in question. Now she was being told to dismantle her legacy—programs she
    believed in that made up a considerable part of her budget and person-year allocations.

    In her meeting with Fiona on Friday afternoon, Elinor had filled Fiona in on the political
    rationale for the decision to cut human service programs. She also made clear what Fiona had
    suspected when they had met with the commissioner earlier that week—the outcomes of the
    evaluation were predetermined: They would show that key programs where substantial
    resources were tied up were not effective and would be used to justify cuts to the department’s
    programs.

    Fiona was upset with the commissioner’s intended use of her branch. Elinor, watching
    Fiona’s reactions closely, had expressed some regret over the situation. After some hesitation,
    she suggested that she and Fiona could work on the evaluation together, “to ensure that it
    meets our needs and is done according to our standards.” After pausing once more, Elinor
    added, “Of course, Fiona, if you do not feel that the branch has the capabilities needed to
    undertake this project, we can contract it out. I know some good people in this area.”

    Fiona was shown to the door and asked to think about it over the weekend.
    Fiona Barnes took pride in her growing reputation as a competent and serious director of a

    good evaluation shop. Her people did good work that was viewed as being honest, and they
    prided themselves on being able to handle any work that came their way. Elinor Ames had
    appointed Fiona to the job, and now this.

    Your Task

    Analyze this case and offer a resolution to Fiona’s dilemma. Should Fiona undertake the
    evaluation project? Should she agree to have the work contracted out? Why?

    In responding to this case, consider the issues on two levels: (1) look at the issues taking
    into account Fiona’s personal situation and the “benefits and costs” of the options available to
    her and (2) look at the issues from an organizational standpoint, again weighing the “benefits
    and the costs.” Ultimately, you will have to decide how to weigh the benefits and costs from
    both Fiona’s and the department’s standpoints.

    REFERENCES

    Abercrombie, M. L. J. (1960). The anatomy of judgment: An investigation into the processes of
    perception and reasoning. New York: Basic Books.

    Altschuld, J. (1999). The certification of evaluators: Highlights from a report submitted to the
    Board of Directors of the American Evaluation Association. American Journal of
    Evaluation, 20(3), 481–493.

    American Evaluation Association. (1995). Guiding principles for evaluators. New Directions
    for Program Evaluation, 66, 19–26.

    American Evaluation Association. (2004). Guiding principles for evaluators. Retrieved from
    http://www.eval.org/Publications/GuidingPrinciples.asp

    Ayton, P. (1998). How bad is human judgment? In G. Wright & P. Goodwin (Eds.),
    Forecasting with judgement (pp. 237–267). Chichester, West Sussex, UK: John Wiley.

    Bamberger, M., Rugh, J., & Mabry, L. (2012). Real world evaluation: Working under budget,
    time, data, and political constraints (2nd ed.). Thousand Oaks, CA: Sage.

    Basilevsky, A., & Hum, D. (1984). Experimental social programs and analytic methods: An
    evaluation of the U.S. income maintenance projects. Orlando, FL: Academic Press.

    Berk, R. A., & Rossi, P. H. (1999). Thinking about program evaluation (2nd ed.). Thousand
    Oaks, CA: Sage.

    Bickman, L. (1997). Evaluating evaluation: Where do we go from here? Evaluation Practice,
    18(1), 1–16.

    Brandon, P., Smith, N., & Hwalek, M. (2011). Aspects of successful evaluation practice at an
    established private evaluation firm. American Journal of Evaluation, 32(2), 295–307.

    Campbell Collaboration. (2010). About us. Retrieved from
    http://www.campbellcollaboration.org/about_us/index.php

    Campbell, D. T. (1991). Methods for the experimenting society. Evaluation Practice, 12(3),
    223–260.

    Canadian Evaluation Society. (2012a). CES guidelines for ethical conduct. Retrieved from
    http://www.evaluationcanada.ca/site.cgi?s=
    5&ss=4&_lang=en

    Canadian Evaluation Society. (2012b). Program evaluation standards. Retrieved from
    http://www.evaluationcanada.ca/site.cgi?s=
    6&ss=10&_lang=EN

    Canadian Institutes of Health Research, Natural Sciences and Engineering Research Council of
    Canada, & Social Sciences and Humanities Research Council of Canada. (2010). Tri-

    council policy statement: Ethical conduct for research involving humans, December 2010.
    Retrieved from http://www.pre.ethics.gc.ca/pdf/eng/tcps2/
    TCPS_2_FINAL_Web

    Chen, H. T., Donaldson, S. I., & Mark, M. M. (2011). Validity frameworks for outcome
    evaluation. New Directions for Evaluation, 2011(130), 5–16.

    Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues for
    field settings. Chicago, IL: Rand McNally.

    Cook, T. D., Scriven, M., Coryn, C. L., & Evergreen, S. D. (2010). Contemporary thinking
    about causation in evaluation: A dialogue with Tom Cook and Michael Scriven. American
    Journal of Evaluation, 31(1), 105–117.

    Cooksy, L. J. (2008). Challenges and opportunities in experiential learning. American Journal
    of Evaluation, 29(3), 340–342.

    Cronbach, L. J. (1980). Toward reform of program evaluation (1st ed.). San Francisco, CA:
    Jossey-Bass.

    Cronbach, L. J. (1982). Designing evaluations of educational and social programs (1st ed.).
    San Francisco, CA: Jossey-Bass.

    Epstein, R. M. (1999). Mindful practice. Journal of the American Medical Association, 282(9),
    833–839.

    Epstein, R. M. (2003). Mindful practice in action (I): Technical competence, evidence-based
    medicine, and relationship-centered care. Families, Systems & Health, 21(1), 1–9.

    Epstein, R. M., Siegel, D. J., & Silberman, J. (2008). Self-monitoring in clinical practice: A
    challenge for medical educators. Journal of Continuing Education in the Health
    Professions, 28(1), 5–13.

    Eraut, M. (1994). Developing professional knowledge and competence. Washington, DC:
    Falmer Press.

    Fish, D., & Coles, C. (1998). Developing professional judgement in health care: Learning
    through the critical appreciation of practice. Boston, MA: Butterworth-Heinemann.

    Ford, R., Gyarmati, D., Foley, K., Tattrie, D., & Jimenez, L. (2003). Can work incentives pay
    for themselves? Final report on the Self-Sufficiency Project for welfare applicants. Ottawa,
    Ontario, Canada: Social Research and Demonstration Corporation.

    Garvin, D. A. (1993). Building a learning organization. Harvard Business Review, 71(4), 78
    –90.

    Ghere, G., King, J. A., Stevahn, L., & Minnema, J. (2006). A professional development unit
    for reflecting on program evaluator competencies. American Journal of Evaluation, 27(1),
    108–123.

    Gibbins, M., & Mason, A. K. (1988). Professional judgment in financial reporting. Toronto,
    Ontario, Canada: Canadian Institute of Chartered Accountants.

    Gustafson, P. (2003). How random must random assignment be in random assignment
    experiments? Ottawa, Ontario, Canada: Social Research and Demonstration Corporation.

    Henry, G. T., & Mark, M. M. (2003). Toward an agenda for research on evaluation. New
    Directions for Evaluation, 97, 69–80.

    Higgins, J., & Green, S. (Eds.). (2011). Cochrane handbook for systematic reviews of
    interventions: Version 5.0.2 (updated March 2011). The Cochrane Collaboration 2011.
    Retrieved from www.cochrane-handbook.org

    House, E. R., & Howe, K. R. (1999). Values in evaluation and social research. Thousand
    Oaks, CA: Sage.

    Human Resources Development Canada. (1998). Quasi-experimental evaluation (Publication
    No. SP-AH053E-01–98). Ottawa, Ontario, Canada: Evaluation and Data Development
    Branch.

    Hurteau, M., Houle, S., & Mongiat, S. (2009). How legitimate and justified are judgments in
    program evaluation? Evaluation, 15(3), 307–319.

    Jewiss, J., & Clark-Keefe, K. (2007). On a personal note: Practical pedagogical activities to
    foster the development of “reflective practitioners.” American Journal of Evaluation, 28
    (3), 334–347.

    Katz, J. (1988). Why doctors don’t disclose uncertainty. In J. Dowie & A. S. Elstein (Eds.),
    Professional judgment: A reader in clinical decision making (pp. 544–565). Cambridge,
    MA: Cambridge University Press.

    Kelling, G. L. (1974a). The Kansas City preventive patrol experiment: A summary report.
    Washington, DC: Police Foundation.

    Kelling, G. L. (1974b). The Kansas City preventive patrol experiment: A technical report.
    Washington, DC: Police Foundation.

    King, J. A., Stevahn, L., Ghere, G., & Minnema, J. (2001). Toward a taxonomy of essential
    evaluator competencies. American Journal of Evaluation, 22(2), 229–247.

    Kitchener, K. S. (1984). Intuition, critical evaluation and ethical principles: The foundation for
    ethical decisions in counseling psychology. The Counseling Psychologist, 12(3), 43–55.

    Krasner, M. S., Epstein, R. M., Beckman, H., Suchman, A. L., Chapman, B., Mooney, C. J., &
    Quill, T. E. (2009). Association of an educational program in mindful communication with
    burnout, empathy, and attitudes among primary care physicians. Journal of the American
    Medical Association, 302(12), 1284–1293.

    Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago, IL: University of Chicago
    Press.

    Kundin, D. M. (2010). A conceptual framework for how evaluators make everyday practice
    decisions. American Journal of Evaluation, 31(3), 347–362.

    Larson, R. C. (1982). Critiquing critiques: Another word on the Kansas City preventive patrol
    experiment. Evaluation Review, 6(2), 285–293.

    Levin, H. M., & McEwan, P. J. (Eds.). (2001). Cost-effectiveness analysis: Methods and
    applications (2nd ed.). Thousand Oaks, CA: Sage.

    Mabry, L. (1997). Ethical landmines in program evaluation. In R. E. Stakes (Chair), Grounds
    for turning down a handsome evaluation contract. Symposium conducted at the meeting of
    the AERA, Chicago, IL.

    Mark, M. M., Henry, G. T., & Julnes, G. (2000). Evaluation: An integrated framework for
    understanding, guiding, and improving policies and programs (1st ed.). San Francisco,
    CA: Jossey-Bass.

    Mason, J. (2002). Qualitative researching (2nd ed.). Thousand Oaks, CA: Sage.
    Mayne, J. (2008). Building an evaluative culture for effective evaluation and results

    management. Retrieved from http://www.cgiar-ilac.org/files/publications/briefs/
    ILAC_Brief20_Evaluative_Culture

    Modarresi, S., Newman, D. L., & Abolafia, M. Y. (2001). Academic evaluators versus
    practitioners: Alternative experiences of professionalism. Evaluation and Program
    Planning, 24(1), 1–11.

    Morris, M. (1998). Ethical challenges. American Journal of Evaluation, 19(3), 381–382.
    Morris, M. (Ed.). (2008). Evaluation ethics for best practice: Cases and commentaries. New

    York: Guilford Press.
    Morris, M. (2011). The good, the bad, and the evaluator: 25 years of AJE ethics. American

    Journal of Evaluation, 32(1), 134–151.
    Mowen, J. C. (1993). Judgment calls: High-stakes decisions in a risky world. New York:

    Simon & Schuster.
    Newman, D. L., & Brown, R. D. (1996). Applied ethics for program evaluation. Thousand

    Oaks, CA: Sage.
    No Child Left Behind Act of 2001, Pub. L. No. 107-110, 115 Stat. 1425.
    Office of Management and Budget. (2004). What constitutes strong evidence of a program’s

    effectiveness? Retrieved from
    http://www.whitehouse.gov/omb/part/2004_program_eval

    Patton, M. Q. (1997). Utilization-focused evaluation: The new century text (3rd ed.). Thousand
    Oaks, CA: Sage.

    Patton, M. Q. (2008). Utilization-focused evaluation (4th ed.) Thousand Oaks, CA: Sage.
    Pawson, R., & Tilley, N. (1997). Realistic evaluation. Thousand Oaks, CA: Sage.
    Picciotto, R. (2011). The logic of evaluation professionalism. Evaluation, 17(2), 165–180.
    Polanyi, M. (1958). Personal knowledge: Towards a post-critical philosophy. Chicago, IL:

    University of Chicago Press.
    Polanyi, M., & Grene, M. G. (1969). Knowing and being: Essays. Chicago, IL: University of

    Chicago Press.
    Rossi, P. H., Lipsey, M. W., & Freeman, H. E. (2004). Evaluation: A systematic approach.

    Thousand Oaks, CA: Sage.
    Sanders, J. R. (1994). Publisher description for the program evaluation standards: How to

    assess evaluations of educational programs. Retrieved from
    http://catdir.loc.gov/catdir/enhancements/
    fy0655/94001178-d.html

    Schön, D. A. (1987). Educating the reflective practitioner: Toward a new design for teaching
    and learning in the professions (1st ed.). San Francisco, CA: Jossey-Bass.

    Schön, D. A. (1988). From technical rationality to reflection-in-action. In J. Dowie & A. S.
    Elstein (Eds.), Professional judgment: A reader in clinical decision making (pp. 60–77).
    New York: Cambridge University Press.

    Schwandt, T. A. (2000). Three epistemological stances for qualitative enquiry. In N. K. Denzin
    & Y. S. Lincoln (Eds.), Handbook of qualitative research (2nd ed., pp. 189–213).
    Thousand Oaks, CA: Sage.

    Schwandt, T. A. (2007). Expanding the conversation on evaluation ethics. Evaluation and
    Program Planning, 30(4), 400–403.

    Schwandt, T. A. (2008). The relevance of practical knowledge traditions to evaluation practice.
    In N. L. Smith & P. R. Brandon (Eds.), Fundamental issues in evaluation (pp. 29–40). New
    York: Guilford Press.

    Schwandt, T. A., & Dahler-Larsen, P. (2006). When evaluation meets the “rough ground” in
    communities. Evaluation, 12(4), 496–505.

    Schweigert, F. J. (2007). The priority of justice: A framework approach to ethics in program
    evaluation. Evaluation and Program Planning, 30(4), 394–399.

    Scriven, M. (1994). The final synthesis. Evaluation Practice, 15(3), 367–382.
    Scriven, M. (2004). Causation. Unpublished manuscript, University of Auckland, Auckland,

    New Zealand.
    Scriven, M. (2008). A summative evaluation of RCT methodology & an alternative approach

    to causal research. Journal of Multidisciplinary Evaluation, 5(9), 11–24.
    Seiber, J. (2009). Planning ethically responsible research. In L. Bickman & D. Rog (Eds.), The

    Sage handbook of applied social research methods (2nd ed., pp. 106–142). Thousand
    Oaks, CA: Sage.

    Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental
    designs for generalized causal inference. Boston, MA: Houghton Mifflin.

    Simons, H. (2006). Ethics in evaluation. In I. Shaw, J. Greene, & M. M. Mark (Eds.), The Sage
    handbook of evaluation (pp. 243–265). Thousand Oaks, CA: Sage.

    Skolits, G. J., Morrow, J. A., & Burr, E. M. (2009). Reconceptualizing evaluator roles.
    American Journal of Evaluation, 30(3), 275–295.

    Smith, M. L. (1994). Qualitative plus/versus quantitative: The last word. New Directions for
    Program Evaluation, 61, 37–44.

    Smith, N. L. (1998). Professional reasons for declining an evaluation contract. American
    Journal of Evaluation, 19(2), 177–190.

    Smith, N. L. (2007). Empowerment evaluation as evaluation ideology. American Journal of
    Evaluation, 28(2), 169–178.

    CHAPTER 11

    PROGRAM EVALUATION AND PROGRAM
    MANAGEMENT

    Joining Theory and Practice

    Introduction
    Can Management and Evaluation Be Joined? An Overview of the Issues
    Evaluators and Managers as Partners in Evaluation

    Building an Evaluative Culture in Organizations: An Expanded Role for Evaluators

    Creating Ongoing Streams of Evaluative Knowledge

    Obstacles to Building and Sustaining an Evaluative Culture

    Manager Involvement in Evaluations: Limits and Opportunities

    Intended Evaluation Uses and Managerial Involvement

    Evaluating for Accountability

    Evaluating for Program Improvement

    Manager Bias in Evaluations: Limits to Manager Involvement

    Striving for Objectivity in Program Evaluations

    Can Program Evaluators Be Objective?

    Looking for a Defensible Definition of Objectivity

    A Natural Science Definition of Objectivity

    Implications for Evaluation Practice

    Criteria for High-Quality Evaluations: The Varying Views of Evaluation Associations
    Summary
    Discussion Questions
    References

    INTRODUCTION

    Chapter 11 explores the relationship between program managers and evaluators, and how that
    relationship is influenced by evaluation purposes and organizational contexts. We begin by
    reviewing Wildavsky’s (1979) seminal work on this relationship. Because Wildavsky was
    skeptical of how organizations could be self-evaluating, we then look at organizational cultures
    that support evaluation. Given that many evaluators do their work as participants in the
    organizations in which they do evaluations, we describe the ways in which internal evaluations
    can occur in such organizations. An evaluative culture is a special case where evaluative
    thinking and practices have been suffused throughout the organization, and we discuss the
    prospects for realizing such cultures in contemporary public sector organizations. We then turn
    to the limitations and opportunities for how managers can be involved in evaluations and how
    the differences between formative and summative evaluations offer incentives that can bias
    manager involvement in evaluations of their own programs.

    The last part of Chapter 11 looks at the question of whether program evaluations can be
    objective. We discuss what it would take for evaluations to be objective and whether it is
    possible to claim that evaluations are objective. Finally, based on the guidelines and principles
    offered by evaluation associations, we offer some general guidance for evaluators in
    positioning themselves as practitioners able to make claims for doing high-quality evaluations.

    Program evaluation is intended to be a flexible and situation-specific means of answering
    program questions, testing hypotheses, and understanding program processes and outcomes.
    Evaluations can focus on a broad range of issues, spanning needs, to program resources, to
    program outcomes. They generally are intended to yield information that reduces the level of
    uncertainty about the issues that prompted the evaluation.

    As we learned in Chapter 1, program evaluations can be formative; that is, they can aim at
    producing findings, conclusions, and recommendations that are intended to improve the
    program. Formative evaluations are typically done with a view to offering program and
    organizational managers information that they can use to improve the efficiency and/or the
    effectiveness of an existing program. Generally, questions about the continuation of support for
    the program itself are not part of formative evaluation agendas.

    Program evaluations can also be summative—that is, intended to render judgments on the
    value of the program. Summative evaluations are more directly linked to accountability
    requirements that are often built into the program management cycle, which was introduced in
    Chapter 1. Summative evaluations can focus on issues that are similar to those included in
    formative evaluations (e.g., program effectiveness), but the intention is to produce information
    that can be used to make decisions about the program’s future, such as whether to reallocate
    resources elsewhere or whether to terminate the program. Typically, summative program
    evaluations entail some kind of external reporting that may include government central
    agencies as a key stakeholder. In Canada, for example, most program evaluations conducted by
    federal departments and agencies are made public, and Treasury Board, as the principal central
    agency responsible for expenditure management across the government, is a recipient of the
    evaluations.

    The purposes of an evaluation affect the relationships between evaluators, managers, and
    other stakeholders. Generally, managers are more likely to view formative evaluations as
    “friendly” evaluations and, hence, are more likely to be willing to cooperate with the
    evaluators. They have an incentive to do so because the evaluation is intended to assist them
    without raising questions that could result in major changes, including reductions to or even
    the elimination of a program.

    Summative evaluations are generally viewed quite differently. Program managers face
    different incentives in providing information or even participating in such an evaluation.
    Notwithstanding the efforts by some organizations to build evaluative cultures (Mayne, 2008;
    Mayne & Rist, 2006) wherein managers are encouraged to treat mistakes and perhaps even
    program-related failures as opportunities to learn, the future of their programs may be at stake.

    From an evaluator’s standpoint, then, the experience of conducting a formative evaluation
    can be quite different from conducting a summative evaluation. The type of evaluation can also
    affect the evaluator’s relationship with the program manager(s). Typically, program evaluators
    depend on program managers to provide key information and to arrange access to people, data
    sources, and other sources of evaluation information (Chelimsky, 2008). Securing and
    sustaining cooperation is affected by the purposes of the evaluation—managerial reluctance or
    strategies to “put the best foot forward” might well be expected where the stakes include the
    future of the program itself. As Norris (2005) says, “Faced with high-stakes targets and the
    paraphernalia of the testing and performance measurement that goes with them, practitioner
    and organizations sometimes choose to dissemble” (p. 585).

    CAN MANAGEMENT AND EVALUATION BE JOINED? AN
    OVERVIEW OF THE ISSUES

    How does program evaluation, as a part of the performance management cycle, relate to
    program management? Are program evaluation and program management compatible roles in
    public and nonprofit organizations?

    Wildavsky (1979), in his seminal book Speaking Truth to Power, introduced his discussion
    of management and evaluation this way:

    Why don’t organizations evaluate their own activities? Why don’t they seem to manifest
    rudimentary self-awareness? How long can people work in organizations without
    discovering their objectives or determining how well they are carried out? I started out
    thinking that it was bad for organizations not to evaluate, and I ended up wondering why
    they ever do it. Evaluation and organization, it turns out, are somewhat contradictory. (p.
    212)

    When he questioned joining together management and evaluation, Wildavsky chiefly had
    in mind summative evaluations where the future of programs, and possibly reallocation of
    funding, would be an issue. Historically, the federal government of Canada, for example,
    offered this definition of program evaluation in its first publication on the purposes and scope
    of the then new evaluation function in federal departments and agencies:

    Program evaluation in federal departments and agencies should involve the systematic
    gathering of verifiable information on a program and demonstrable evidence on its results
    and cost-effectiveness. Its purpose should be to periodically produce credible, timely,
    useful and objective findings on programs appropriate for resource allocation, program
    improvement and accountability. (Office of the Comptroller General [OCG] of Canada,
    1981, p. 3)

    Central agencies still maintain this chiefly summative focus on evaluations. In its statement
    of the purposes of program evaluation, the Treasury Board of Canada Secretariat (2009), the

    central agency responsible for the government-wide evaluation function, offers a view of
    evaluation that is substantially the same as that offered nearly three decades earlier. In its
    “Policy on Evaluation,” the principal rationale for evaluation is that “evaluation provides
    Canadians, Parliamentarians, Ministers, central agencies and deputy heads an evidence-based,
    neutral assessment of the value for money, i.e. relevance and performance, of federal
    government programs” (p. 3). The main thrust of the policy is clearly a summative view of
    evaluation that focuses on “resource allocation and reallocation” and “providing objective
    information to help Ministers understand how new spending proposals fit with existing
    programs, identifying synergies and avoid wasteful duplication” (p. 3).

    If evaluations are to be used to reallocate resources as well as to improve programs,
    organizations must have the capacity to participate in and respond to evaluations that have both
    formative and summative facets. This suggests an image of organizations that are amenable to
    rethinking existing commitments—managers would need to balance attachment to the stability
    of their programs with attachment to the evidence-based evaluation process. The
    rational/technical view of organizations (de Lancer Julnes & Holzer, 2001), which we
    discussed in Chapter 9, suggests that within such organizations, decision making would be
    based on evidence, managers and workers would behave in ways that do not undermine a
    results-focused culture, and summative evaluations would be welcomed as a part of regular
    management processes.

    Wildavsky’s (1979) view of organizations as settings where “speaking truth to power” is a
    challenge is similar to the political/cultural image of organizations offered by de Lancer Julnes
    and Holzer (2001). Wildavsky views the respective roles of evaluators and managers as painted
    in contrasting colors. Evaluators are described as people who question assumptions, who are
    skeptical, who are detached, who view organizations/programs as means and not ends in
    themselves, whose currency is evidence, and who ultimately focus on the social needs that the
    program serves rather than on organizational needs.

    By contrast, in Wildavsky’s view, organizational/program managers can be characterized
    as people who are committed to their programs, who are advocates for what they do and what
    their programs do, and who do not want to see their commitments curtailed or their resources
    diminished.

    How, then, even for formative evaluation capacity, do organizations resolve the question of
    who has the power and authority to make decisions, who constructs evaluation information,
    and who controls its interpretation and distribution? In one scenario, evaluators could be a
    central part of program and policy design, implementation, and assessment of results. They
    may suggest that new programs or policies should be implemented as experiments or quasi-
    experiments (perhaps as pilot programs), with clear objectives, well-constructed comparisons,
    baseline measurements, and sufficient control over the implementation process, to ensure the
    internal and construct validities of the evaluation process. This view of trying out new
    programs was the essence of Donald Campbell’s image of the experimenting society (Watson,
    1986).

    Managers, however, may prefer to implement programs to more immediately meet
    organizational and client needs. Objectives may, in that case, be stated in ways that facilitate
    flexible interpretations of what was important to convey, depending on the audience. Managers
    would want program objectives to be able to withstand the scrutiny of stakeholders with
    different values and expectations. As we might anticipate, experimentation can create political
    problems: What does the organization tell prospective clients who want the program but cannot
    get access to it because they are members of a “control group”? What do executives tell the
    elected officials, when client groups question either the lack of flexibility in the service (to

    maintain construct validity of the evaluation) or its lack of availability (to increase internal
    validity of the evaluation)?

    Where the evaluation function is internal, it may be much more challenging to experiment
    with a program before its launch. An example of the dilemmas and controversies involved in
    designing and implementing a randomized controlled trial in a setting where there is an acute
    social need is the New York City Department of Homeless Services’ 2-year experiment to
    evaluate the Homebase program. The Homebase program is intended to provide housing-
    related services to families that are at risk or are already homeless. The evaluation was started
    in the fall of 2010, and for the ensuing 2 years, those in the control group (200 families) are
    excluded from accessing the bundle of services that constitute the Homebase program (New
    York City Department of Homeless Services, 2010).

    The social dilemmas inherent in this kind of situation raise the question: Where should the
    evaluation function be located in organizations, or even governments? One possible solution is
    to make program evaluation an external function. Thus, evaluators would be a part of an
    agency that is not under the administrative control of the organization’s managers. This
    solution, however, does face challenges as well. In British Columbia, for example, the
    Secretary of Treasury Board at one point outlined a plan for creating a centralized evaluation
    capacity in the government (Wolff, 1979). This approach would have been similar to the way
    external auditors function in governments. Treasury Board analysts housed in that central
    agency would have conducted evaluations of line department programs with a view to
    preparing reports for Treasury Board managers. The plan was never implemented, however, in
    part because the line departments strongly objected to the creation of a central evaluation unit
    that would not be accountable to line department executives. In fact, at that point, some
    departments were developing in-house evaluation units, which were intended to perform
    functions that executives argued would be duplicated by any centralized evaluation unit.

    Centralized evaluation functions have certainly been developed for summative evaluation
    purposes. Under the Bush administration in the United States, the Office of Management and
    Budget (OMB), an executive agency responsible for budget preparation and expenditure
    management, was responsible for assessing all federal programs on a cyclical basis using the
    Program Assessment Rating Tool (PART) process (U.S. OMB, 2002, 2004). From 2002
    through 2009, OMB assessed about 20% of all programs every year. These PART reviews
    were, in effect, summative evaluations that relied in part on existing program evaluation and
    performance measurement information, but offered an independent assessment conducted by
    OMB analysts.

    EVALUATORS AND MANAGERS AS PARTNERS IN
    EVALUATION

    Wildavsky’s (1979) view of self-evaluating organizations was quite pessimistic and reflected a
    view that saw evaluation as a form of research best done by those who had some distance from
    the programs being evaluated. He saw evaluation and management as being quite separate,
    with distinct roles for managers and evaluators. But in the past several decades, there has been
    a broad movement in the field of evaluation to find ways of knitting evaluation and
    management together. Instead of seeing evaluation as an activity that challenges management,
    this contrasting view assumes that evaluators can work with managers to define and execute
    evaluations that combine the best of what both parties bring to that relationship. Utilization-
    focused evaluation (Patton, 2008), for example, is premised on producing evaluations that

    managers and other stakeholders will use—and ensuring use means developing a working
    relationship between evaluators and managers. Managers are expected to be participants in the
    evaluation process. Patton (1997) characterizes the role of the evaluator this way:

    The evaluator facilitates judgment and decision-making by intended users rather than
    acting as a distant, independent judge. Since no evaluation can be value-free, utilization-
    focused evaluation answers the question of whose values will frame the evaluation by
    working with clearly identified, primary intended users who have responsibility to apply
    evaluation findings and implement recommendations. In essence, I shall argue, evaluation
    use is too important to be left to evaluators. (p. 21)

    Utilization-focused evaluation (Patton, 2008) and participatory evaluation (Cousins &
    Whitmore, 1998) are among a growing number of approaches that emphasize the importance
    of evaluators engaging with, and in some respects becoming a part of, the organizations in
    which they do their work. The traditional view of evaluators as experts who conduct arms-
    length “evaluation studies” of programs, and offer their written reports to stakeholders at the
    end of the process, is giving way to the view that evaluators should not stand aside from
    organizations but instead should get involved (Mayne & Rist, 2006).

    Cousins and Whitmore (1998) suggest that the evaluation team and the practitioner team
    both need to be committed to improving the program. The evaluation process—identifying the
    key questions, design of the evaluation, collection of the data, and reporting of the results—can
    be shared between the evaluators and the practitioners (see also King, Cousins, & Whitmore,
    2007).

    Love (1991) elaborated an approach that is premised on the assumption that evaluators can
    be a part of organizations (i.e., paid employees who report to organizational executives) and
    can contribute to improving the efficiency and effectiveness of programs. For Love, “internal
    evaluation is the process of using staff members who have the responsibility for evaluating
    programs or problems of direct relevance to an organization’s managers” (p. 2).

    Internal evaluation units are common and are the norm in some governments. In the federal
    government of Canada, for example, each department or agency typically has its own
    evaluation unit, which reports to the administrative head of that organization. These units are
    expected to work with departmental executives and managers to identify evaluation priorities
    and undertake program evaluations. Although external consultants are often hired to conduct
    parts of such projects, they are managed by internal evaluators.

    Love (1991) outlines six stages in the development of internal evaluation capacity,
    beginning with ad hoc program evaluations and ending with strategically focused cost–benefit
    analyses:

    • Ad hoc evaluations focused on single programs
    • Regular evaluations that describe program processes and results
    • Program goal setting, measurement of program outcomes, program monitoring,

    adjustment
    • Evaluations of program effectiveness, improving organizational performance
    • Evaluations of technical efficiency and cost-effectiveness
    • Strategic evaluations including cost–benefit analyses

    These six stages can be seen as a gradual transformation of the intentions of evaluations
    from formative to summative purposes. Love (1991) highlights the importance of an internal

    working environment where organizational members are encouraged to participate in
    evaluations, and where trust of evaluators and their commitment to the organization is part of
    the culture. What Love is suggesting in his approach is that it is possible to transform an
    organizational culture so that it embraces evaluation as a strategic asset. We will consider the
    prospects for building evaluative cultures in the next section of this chapter.

    Building an Evaluative Culture in Organizations: An Expanded Role for Evaluators

    Mayne (2008) and Patton (2011) are among the advocates for a broader role for evaluation
    and evaluators in organizations. Like Love (1991), their view is that it is possible to build
    organizational capacity to perform evaluation that ultimately transforms the organization.
    Mayne (2008) has outlined the key features of an evaluative culture. We have summarized his
    main points in Table 11.1.

    For Mayne (2008) and Mayne and Rist (2006), the roles of evaluators are broader than
    doing evaluation studies/projects—they need to encompass knowledge management for the
    organization. Evaluators need to be prepared to engage with executives and program managers,
    offer them advice and assistance, take a lead role in training and other kinds of events that
    showcase and mainstream evaluation, and generally play a supportive role in building an
    organizational culture that values and relies on timely, reliable, valid, and relevant information
    on programs and policies. In Wildavsky’s (1979) words, an evaluative culture is one wherein
    both managers and evaluators feel supported in “speaking truth to power.”

    Table 11.1 Characteristics of an Evaluative Culture in Organizations An organization that
    has a strong evaluative culture:

    • Engages in self-reflection and self-examination by
    ◦ Seeing evidence on what it is achieving, using both monitoring and evaluation

    approaches
    ◦ Using evidence of results to challenge and support what it is doing
    ◦ Valuing candor, challenge, and genuine dialogue both horizontally and

    vertically within the organization
    • Engages in evidence-based learning by

    ◦ Allocating time and resources for learning events
    ◦ Acknowledging and learning from mistakes and poor performance
    ◦ Encouraging and modeling knowledge sharing and fostering the view that

    knowledge is a resource and not a political weapon
    • Encourages experimentation and change by

    ◦ Supporting program and policy implementation in ways that facilitate
    evaluation and learning

    ◦ Supporting deliberate risk taking
    ◦ Seeking out new ways of doing business

    Source: Adapted from Mayne (2008, p. 1).

    Organizations with evaluative cultures can also be seen as learning organizations. Morgan
    (2006), following on Senge (1990), suggests that learning organizations develop capacities to

    • Scan and anticipate change in the wider environment to detect significant variations …
    • Develop an ability to question, challenge, and change operating norms and assumptions


    • Allow an appropriate strategic direction and pattern of organization to emerge. (Morgan,

    2006, p. 87)

    Key to establishing a learning organization is what Morgan (2006) calls double-loop
    learning—that is, learning that critically assesses existing organizational goals and priorities in
    light of evidence and includes options for adopting new goals and objectives. Organizations
    must get outside their established structures and procedures and instead focus on processes to
    create new information, which in turn can be used to challenge the status quo and make
    changes.

    Garvin (1993) has suggested five “building blocks” for creating learning organizations,
    which are similar to key characteristics of organizations that have evaluative cultures: (1)
    systematic problem solving using evidence, (2) experimentation and evaluation of outcomes
    before broader implementation, (3) learning from past performance, (4) learning from others,
    (5) and treating knowledge as a resource that should be widely communicated.

    Creating Ongoing Streams of Evaluative Knowledge

    Streams of evaluative knowledge comprise both program evaluations and performance
    measurement results (Rist & Stame, 2006). In Chapter 9, we outlined 12 steps that are
    important in building and sustaining performance measurement systems in organizations. In
    the chapter we discussed the importance of real-time performance measurement and results
    being available to managers. By itself, building a performance measurement system to meet
    periodic external accountability expectations will not ensure that performance information will
    be used internally by organizational managers. The same point can apply to program
    evaluation. Key to a working evaluative culture would be the usefulness of ongoing evaluative
    information to managers, and the responsiveness of evaluators to managerial priorities.

    Patton (1994, 2011) has introduced developmental evaluation as an alternative to
    formative and summative program evaluations. Developmental evaluations view organizations
    as co-evolving in complex environments. Organizational objectives (and hence program
    objectives) and/or the organizational environment may be in flux. Conventional evaluation
    approaches that assume a relatively static program structure in which it is possible to build
    logic models, for example, may have limited application in co-evolving settings. Patton
    suggests that evaluators should take on the role of organizational development specialists,
    working with managers and other stakeholders as team members to offer evaluative
    information in real time so that programs and policies can take advantage of a range of periodic
    and dynamic evaluative information.

    Obstacles to Building and Sustaining an Evaluative Culture

    What are the prospects for building evaluative cultures? Recall that in Chapter 10, we
    suggested that adversarial political cultures can inhibit developing and sustaining performance
    measurement and reporting systems—one effect of making performance results high stakes

    where there are significant internal consequences to reporting performance failures is to
    discourage managers from using externally reported performance results for internal
    management purposes. In effect, managers, when confronted by situations where public
    performance results need to be sanitized or at least carefully presented to reduce political risks,
    tend to decouple those measures from internal performance management uses, preferring
    instead to develop and use other measures that remain internal to the organization.

    Mayne (2008), Mayne and Rist (2006), Patton (2011), and other proponents of evaluative
    cultures are offering us a normative view of what “ought” to occur in organizations. But many
    public sector and nonprofit organizations have to navigate environments or governments that
    are adversarial, engendering negative consequences to managers (and their political masters) if
    programs or policies are not “successful,” or if candid information about the weaknesses in
    performance becomes public. What we must keep in mind, much as we did in Chapter 10 when
    we were assessing the prospects for performance measurement and public reporting systems to
    be used for both accountability and performance improvement, is that the environments in
    which public and nonprofit organizations are embedded play an important role in the ways
    organizational cultures evolve and co-adapt.

    To build and sustain an evaluative culture, Mayne (2008) suggests, among other things,
    that

    managers need adequate autonomy to manage for results—Managers seeking to achieve
    outcomes need to be able to adjust their operations as they learn what is working and what
    is not. Managing only for planned outputs does not foster a culture of inquiry about what
    are the impacts of delivering those outputs. (p. 2)

    Refocusing organizational managers on outcomes instead of inputs and offering them
    incentives to perform to those (desired) outcomes has been linked to New Public Management
    ideals of loosening the process constraints on organizations so that managers would have more
    autonomy to improve efficiency and effectiveness (Hood, 1995). But as Moynihan (2008) and
    Gill (2011) point out, what has tended to happen in settings where political cultures are
    adversarial is that performance expectations (objectives, targets, and measures) have been
    layered on top of existing process controls instead of replacing them. In effect, from a
    managerial perspective, there are more controls in place now that performance measurement
    and reporting are part of the picture and less “freedom to manage.”

    What effect does this have on building evaluative cultures? The main issue is the impact on
    the willingness to take risks. Where organizational environments are substantially risk-averse,
    that will condition and limit the prospects for developing an organizational culture that
    encourages risk taking. In short, building and sustaining evaluative cultures requires not only
    supportive organizational leadership but also a political and organizational environment that
    permits reporting evaluative results that are able to acknowledge below-par performance, when
    it occurs.

    MANAGER INVOLVEMENT IN EVALUATIONS: LIMITS AND
    OPPORTUNITIES

    Increasingly, program managers are expected to play a role in evaluating their own programs.
    In many situations, particularly for managers in nonprofit organizations, resources to conduct
    evaluations are scarce. But expectations that programs will be evaluated (and that information

    will be provided that can be used by funders to make decisions about the program’s future) are
    growing. Designing and implementing performance measurement systems also presumes a key
    role for managers.

    In Chapter 10, we discussed the ways in which setting up performance measures to make
    summative judgments about programs can produce unintended consequences—managers will
    respond to the incentives that are implied by the consequences of reporting performance results
    and will shape their behavior accordingly. The “naming and shaming” system of England’s
    health care providers from 2000 to 2005 resulted in substantial problems with the validity of
    the performance data (Bevan & Hamblin, 2009).

    Involving managers, indeed giving them a central role in evaluations that are intended to
    meet external accountability requirements, is different from involving them or even giving
    them the lead in formative evaluations. Because the field of evaluation is so broad and diverse,
    we see a range of views on how much and in what ways managers should be involved in
    evaluations (including performance measurement systems).

    Intended Evaluation Uses and Managerial Involvement

    Most contemporary evaluation approaches emphasize the importance of the ultimate uses
    of evaluations. In fact, there is a growing literature that examines and categorizes different
    kinds of uses (Leviton, 2003; Mark & Henry, 2004). Patton (2008), in his book Utilization-
    Focused Evaluation, points out that the evaluation field has evolved toward making uses of
    evaluations a key criterion. The Program Evaluation Standards (Yarbrough, Shulha, Hopson,
    & Caruthers, 2011), developed by the Joint Committee on Standards for Educational
    Evaluation, make utility one of the five standards for evaluation quality. The other four are
    feasibility, propriety, accuracy, and accountability.

    Many evaluation approaches support involving program managers in the process of
    evaluating programs. Participatory evaluation approaches, for example, emphasize the
    importance of having practitioners involved in evaluations, principally to increase the
    likelihood that the evaluations will be used (Cousins & Whitmore, 1998; Smits & Champagne,
    2008).

    Some evaluation approaches (empowerment evaluation is an example) emphasize
    evaluation use but go beyond practitioner involvement to making social justice–related
    outcomes an important goal of the evaluation process. Empowerment evaluation is intended in
    part to make evaluation part of the normal planning and management of programs and to
    ultimately put managers and staff in charge of their own destinies. “Too often,” argue
    Fetterman, Kaftarian, and Wandersman (1996),

    external evaluation is an exercise in dependency rather than an empowering experience: in
    these instances the process ends when the evaluator departs, leaving participants without
    the knowledge or expertise to continue for themselves. In contrast, an evaluation
    conducted by program participants is designed to be ongoing and internalized in the
    system, creating the opportunity for capacity building. (p. 9)

    Initially, Fetterman seemed to view evaluation as a formative process. He argued that the
    assessment of a program’s worth is not an end point in itself but part of an ongoing process of
    program improvement. Fetterman (2001) acknowledged, however, that

    the value or strength of empowerment evaluation is directly linked to the purpose of the
    evaluation.… Empowerment evaluation makes a significant contribution to internal

    accountability, but has serious limitations in the area of external accountability … An
    external audit or assessment would be more appropriate if the purpose of the evaluation
    was external accountability. (p. 145)

    In a more recent rebuttal of criticism of empowerment evaluation, Fetterman and
    Wandersman (2007) suggest that their approach is capable of producing unbiased evaluations
    and, by implication, evaluations that are defensible as summative products. In response to
    criticism by Cousins (2005), they suggest,

    contrary to Cousins’ (2005) position that “collaborative evaluation approaches … [have]
    … an inherent tendency toward self-serving bias” (p. 206), we have found many
    empowerment evaluations to be highly critical of their own operations, in part because
    they are tired of seeing the same problems and because they want their programs to work.
    Similarly, empowerment evaluators may be highly critical of programs that they favor
    because they want them to be effective and accomplish their intended goals. It may appear
    counterintuitive, but in practice we have found appropriately designed empowerment
    evaluations to be more critical and penetrating than many external evaluations. (Fetterman
    & Wandersman, 2007, p. 184)

    Below, we expand on managerial involvement in evaluation for accountability and
    evaluation for program improvement.

    Evaluating for Accountability

    Public accountability has become nearly a universal expectation in both the public and the
    nonprofit sectors internationally. There are many countries where some regime of public
    accountability exists at both the national and the subnational levels. Evaluating for
    accountability is typically summative, and often the key stakeholders are outside the
    organizations in which the programs being evaluated are located. Stakeholders can include
    central agencies, funders, elected officials, and others, including interest groups and citizens.

    Summative evaluations can be aimed at meeting accountability requirements, but they do
    not have to be. It is possible to have an evaluation that looks at the merit or worth of a program
    (Lincoln & Guba, 1980) but is intended for stakeholders within an organization. A volunteer
    nonprofit board, for example, may be the principal client for a summative evaluation of a
    program, and although the decisions flowing from such an evaluation could affect the future of
    the program, the evaluation could be seen as internal to the organization.

    A good example of an organization that conducts high-stakes accountability evaluations is
    the Government Accountability Office (GAO) in the United States. Although a part of the
    Congress, the GAO straddles the boundary between the executive and the legislative branches
    of the U.S. federal government. Eleanor Chelimsky (2008), from the GAO, in a candid
    discussion describes the “clash of cultures” between evaluation and politics, and makes a
    strong case for the importance of evaluator independence in the case of summative evaluations
    for accountability. She points to the American division-of-powers structure as both prompting
    a demand for evaluation and, at the same time, threatening evaluator independence:

    Because our government’s need for evaluation arises from its checks-and-balances
    structure—which, as you know, features separation of powers, legislative oversight, and
    accountability to the people as protectors for individual liberty—evaluators working
    within that structure must deal, not exceptionally but routinely and regularly, with

    political infringements on their independence that result directly from that structure. (p.
    400)

    For Chelimsky (2008), evaluator independence is an essential asset for the GAO in its work
    with the Congress. At the same time, the GAO relies on government agencies to contribute to
    its work. It needs to secure the cooperation of the agencies in which the programs being
    evaluated are located. It needs the data that are housed in federal departments and agencies, to
    be able to construct key lines of evidence for evaluations. What Chelimsky has observed over
    time is a growing trend toward limiting access to agency data:

    Between 1980 and 1994—that is, across the Carter, Reagan, Bush, and Clinton
    presidencies—we found that secrecy and classification of information were becoming
    prevalent in an increasing number of agencies. Yet it would be hard to find a more critical
    issue for evaluation than this one. (p. 407)

    Chelimsky’s (2008) view is that this issue, if anything, became more critical under the
    Bush administrations (2001–2008). In effect, agency and managerial involvement in GAO
    evaluations has become a significant political issue in the American government.

    The GAO model of independent evaluations is exceptional—most governments do not
    have a substantial institutional capacity to conduct independent evaluations. Instead, a more
    typical model would be the one in the Canadian federal government, wherein each department
    and agency has at least some evaluation capacity built into the organizational structure but
    evaluation unit heads report to the administrative head of the agency. This model is similar to
    the one advocated by Love (1991) in his description of internal evaluation. Unlike audit, where
    there are typically both internal and external auditors to examine administrative processes and
    even performance, evaluation continues to be an internal function.

    In the Canadian example of the federal evaluation function, housing evaluation capacity in
    departments and agencies makes sense from a formative standpoint; evaluators report to the
    heads of the agencies, and their work would, in principle, be useful for making program-related
    changes. But the overall thrust of the 2009 Federal Evaluation Policy is summative; that is, the
    emphasis in the policy is on evaluations providing information to senior elected and appointed
    officials and being used to fulfill accountability expectations. Evaluators who work in the
    Canadian federal government are expected to wear two hats: They are members of the
    organizations in which they do their evaluation work, but at the same time, they are expected to
    meet the policy requirements set forth by Treasury Board. Like their counterparts in the GAO,
    they need to work with managers to be able to do their work, but unlike the GAO, they do not
    have an institutional base that is independent of the programs they are expected to evaluate.

    Evaluating for Program Improvement

    Most evaluation approaches emphasize the importance of evaluating to improve programs.
    In Chapter 10, we saw that when public sector performance measurement systems are intended
    to be used for both public accountability and performance improvement purposes, one use can
    crowd out the other use. Specifically, requiring performance results to be publicly reported (to
    fulfill accountability expectations) can affect the ways that information is viewed and used
    within organizations. Evaluating to improve programs while evaluating to meet accountability
    expectations can have similar effects as happens for performance measurement systems. If
    organizational managers are invited to be a part of an evaluation where the results will become
    public and may have significant consequences for their programs or their organizations,

    suggesting that the evaluation is intended as well to improve the program will be viewed with
    some skepticism.

    The political culture in which the organization is embedded will affect perceptions of risk,
    willingness to be candid, and perhaps even willingness to provide information for the
    evaluation. Chelimsky (2008) points out that organizationally based information is critical to
    constructing credible program evaluations. Making program evaluation high stakes, that is,
    making evaluation results central to deciding the future of programs or even organizations, will
    weaken the connections between evaluators and evaluands (the programs and managers being
    evaluated), and affect the likelihood of successful future evaluation engagements.

    Manager Bias in Evaluations: Limits to Manager Involvement

    We began with Wildavsky’s (1979) view that managers and evaluators have quite different
    and, in some respects, conflicting roles. The whole field of evaluation has moved toward a
    position that makes room for manager involvement in evaluations and raises the question of
    what limits, if any, there are in how managers can participate in evaluations.

    At one end of a continuum of manager involvement, Fetterman and Wandersman (2007)
    suggest that empowerment evaluation as a participatory approach facilitates managers and
    other organizational members taking the lead in conducting both formative and summative
    evaluations of their own programs. This view has been challenged by those who advocate for a
    central role for program evaluators as judges of the merit and worth of programs (Scriven,
    2005). Stufflebeam (1994) challenged advocates of empowerment evaluation around the issue
    of whether managers and other stakeholders (not the evaluator[s]) should make the decisions
    about the evaluation process and evaluation findings. His view is that ceding that amount of
    control invites “corrupt or incompetent evaluation activity” (p. 324):

    Many administrators caught in political conflicts over programs or needing to improve
    their public relations image likely would pay handsomely for such friendly, non-
    threatening, empowering evaluation service. Unfortunately, there are many persons who
    call themselves evaluators who would be glad to sell such services. Unhealthy alliances of
    this type can only delude those who engage in such pseudo evaluation practices, deceive
    those whom they are supposed to serve, and discredit the evaluation field as a legitimate
    field of professional practice. (p. 325)

    Although Stufflebeam’s view is a strong critique of empowerment evaluation and, by
    implication, other evaluative approaches that cede the central position that evaluation
    professionals have in conducting both formative and summative evaluations, the roles that
    evaluators and managers have often differ. The views put forward by advocates for
    empowerment evaluation (Fetterman & Wandersman, 2007) suggest assumptions about what
    motivates program managers that are similar to Le Grand’s (2010) suggestion that historically,
    public servants in Britain were assumed to be interested in “doing the right thing” in their
    work. In other words, managers would not be self-serving but instead would be motivated by a
    desire to serve the public. Le Grand (2010) called such public servants “knights.” His own
    view is that this assumption is naïve and needs to be tempered by considering the incentives
    that shape behaviors.

    The nature of organizational politics and the interactions between organizations and their
    environments usually mean that managerial interests in preserving and enhancing programs is
    challenged by the role that evaluators play in judging the merit and worth of programs.

    Expecting managers to evaluate their own programs can result in biased program
    evaluations. Indeed, a culture can be built up around the evaluation function such that
    evaluators are expected to be advocates for programs. Under such conditions, departments and
    agencies would use their evaluation capacity to defend their programs, structuring evaluations
    and presenting results so that programs are seen to be above criticism. In the language used in
    Chapter 10 to describe situations where performance measurement systems produced
    unintended results: Gaming the program evaluation function can occur.

    Evaluations produced by organizations under such conditions will tend to be viewed
    outside the organization with skepticism. Funders, or analysts who are employed by the
    funders, will work hard to expose weaknesses in the methodologies used and cast doubt on the
    information in the evaluation reports. In effect, adversarial relationships can develop, which
    serve to “expose” weaknesses in evaluations, but are generally not conducive to building self-
    evaluating or learning organizations. As well, such controversies can undermine a sense that
    the organization is accountable.

    The reality is that expecting program managers to evaluate their own programs, particularly
    where evaluation results are likely to be used in funding decisions, is likely to produce
    evaluations that reflect the natural incentives and risk aversion inherent in such situations.
    They are not necessarily credible even to the managers themselves. Program evaluation, as an
    organizational function, becomes distorted and contributes to a view that evaluations are
    biased.

    Parenthetically, Nathan (2000), who has worked with several top American policy research
    centers, points out that internal evaluations are not the only ones that may reflect incentives
    that bias evaluation results:

    Even when outside organizations conduct evaluations, the politics of policy research can
    be hard going. To stay in business, a research organization (public or private) has to
    generate a steady flow of income. This requires a delicate balance in order to have a
    critical mass of support for the work one wants to do and at the same time maintain a high
    level of scientific integrity. (p. 203)

    Nevertheless, such incentives are likely to be more prevalent and stronger with internal
    evaluations.

    Should managers participate in evaluations of their own programs? Generally, scholars and
    practitioners who have addressed this question have favored managerial involvement. Love
    (1991) envisions (internal) evaluators working closely with program managers to produce
    evaluations on issues that are of direct relevance to the managers. Patton (2008) stresses that
    among the fundamental premises of utilization-focused evaluation, the first is commitment to
    working with the intended users to ensure that the evaluation actually gets used.

    STRIVING FOR OBJECTIVITY IN PROGRAM EVALUATIONS

    Chelimsky (2008), in her description of the challenges to independence that are endemic in the
    work that the GAO does, makes a case for the importance of evaluations being objective:

    The strongest defense for an evaluation that’s in political trouble is its technical
    credibility, which, for me, has three components. First, the evaluation must be technically
    competent, defensible, and transparent enough to be understood, at least for the most part.
    Second, it must be objective: That is, in Matthew Arnold’s terms (as cited in Evans,

    2006), it needs to have “a reverence for the truth.” And third, it must not only be but also
    seem objective and competent: That is, the reverence for truth and the methodological
    quality need to be evident to the reader of the evaluation report. So, by technical
    credibility, I mean methodological competence and objectivity in the evaluation, and the
    perception by others that both of these characteristics are present. (p. 411)

    Clearly, Chelimsky sees the value in claiming that high-stakes GAO evaluations are
    objective. “Objective” is also a desired attribute of the information produced in federal
    evaluations in Canada: “Evaluation … informs government decisions on resource allocation
    and reallocation by … providing objective information to help Ministers understand how new
    spending proposals fit” (Treasury Board of Canada Secretariat, 2009, sec. 3.2).

    Evaluation is fundamentally about linking theory and practice. Notwithstanding the
    practitioner views cited above, that objectivity is desirable, academics in the field have not
    tended to emphasize “objectivity” as a criterion for good-quality evaluations (Conley-Tyler,
    2005; Patton, 2008). Stufflebeam (1994), one exception, emphasizes the importance of what he
    calls “objectivist evaluation” (p. 326) in professional evaluation practice. His definition of
    objectivist evaluation picks up some of the themes articulated by Chelimsky (2008) above. For
    Stufflebeam (1994),

    objectivist evaluations are based on the theory that moral good is objective and
    independent of personal or merely human feelings. They are firmly grounded in ethical
    principles, strictly control bias or prejudice in seeking determinations of merit and worth,
    … obtain and validate findings from multiple sources, set forth and justify conclusions
    about the evaluand’s merit and/or worth, report findings honestly and fairly to all-right-to
    know audiences, and subject the evaluation process and findings to independent
    assessments against the standards of the evaluation field. Fundamentally, objectivist
    evaluations are intended to lead to conclusions that are correct—not correct or incorrect
    relative to a person’s position, standing or point of view. (p. 326)

    Scriven has also advocated for good evaluations to be objective. For Scriven (1997),
    objectivity is defined as “with basis and without bias” (p. 480), and an important part of being
    able to claim that an evaluation is objective is to maintain an appropriate distance between the
    evaluator and what is being evaluated (the evaluand). There is a crucial difference, for Scriven,
    between being an evaluator and being an evaluation consultant. The former relies on validity as
    one’s stock-in-trade, and objectivity is a central part of being able to claim that one’s work is
    valid. The latter work with their clients and stakeholders, but according to Scriven, in the end
    they cannot offer analysis, conclusions, or recommendations that are not tainted by interactions
    and the biases that they entail.

    In addition to Scriven’s view that objectivity is a key part of evaluation practice, other
    related professions have asserted, and continue to assert, that professional practice is, or at least
    ought to be, objective. In the 2003 edition of the Government Auditing Standards (GAO,
    2003), government auditors are enjoined to perform their work this way:

    Professional judgment requires auditors to exercise professional skepticism, which is an
    attitude that includes a questioning mind and a critical assessment of evidence. Auditors
    use the knowledge, skills, and experience called for by their profession to diligently
    perform, in good faith and with integrity, the gathering of evidence and the objective
    evaluation of the sufficiency, competency, and the relevancy of evidence. (p. 51)

    Should evaluators claim that their work is also objective? Objectivity has a certain cachet,
    and as a practitioner, it would be appealing to be able to assert to prospective clients that one’s
    work is objective. Indeed, in situations where evaluators are competing with auditors for
    clients, claiming objectivity could be an important factor in convincing clients to use the
    services of an evaluator.

    Can Program Evaluators Be Objective?

    If giving managers a (substantial) stake in evaluations compromises evaluator and
    evaluation objectivity, then it is important to unpack what is entailed by claims that evaluations
    or audits are objective. Is Scriven’s definition of objectivity defensible? Is objectivity a
    meaningful criterion for high-quality program evaluations? Could we defend a claim to a
    prospective client that our work would be objective?

    Scriven (1997) suggests a metaphor to understand the work of an evaluator: When we do
    program evaluations, we can think of ourselves as expert witnesses. We are, in effect, called to
    “testify” about a program, we offer our expert opinions, and the “court” (our client) can decide
    what to do with our contributions.

    Scriven (1997) takes the courtroom metaphor further when he asserts that in much the same
    way that witnesses are sworn to tell “the truth, the whole truth, and nothing but the truth” (p.
    496), evaluators can rely on a common-sense notion of the truth as they do their work. If such
    an oath “works” in courts (Scriven believes it does), then despite the philosophical questions
    that can be raised by a claim that something is true, we can and should continue to rely on a
    common-sense notion of what is true and what is not.

    Scriven’s main point is that program evaluators should be prepared to offer objective
    evaluations and that to do so, it is essential that we recognize the difference between
    conducting ourselves in ways that promote our objectivity and ways that do not. Even those
    who assert that there cannot be any truths in our work are, according to Scriven, uttering a self-
    contradictory assertion: They wish to claim the truth of a statement that there are no truths.

    Although Scriven’s argument has a common-sense appeal, it is important to examine it
    more closely. There are essentially two main issues in the approach he takes.

    First, Scriven’s metaphor of evaluators as expert witnesses does have some limitations. In
    courts of law, expert witnesses are routinely challenged by their counterparts and by opposing
    lawyers. Unlike Scriven’s evaluators, who do their work, offer their report, and then absent
    themselves to avoid possible compromises of their objectivity, expert witnesses in courts
    undergo a high level of scrutiny. Even where expert witnesses have offered their version of the
    truth, it is often not clear whether that is their view or the views of a party to a legal dispute.
    Expert witnesses can sometimes be “purchased.”

    Second, witnesses speaking in court can be severely penalized if it is discovered that they
    have lied under oath. For program evaluators, it is far less likely that sanctions will be brought
    to bear even if it could be demonstrated that an evaluator did not speak “the truth.”
    Undoubtedly, an evaluator’s place in the profession can be affected when the word gets around
    that he or she has been “bought” by a client, but the reality is that in the practice of program
    evaluation, clients can and do shop for evaluators who are likely to “do the job right.” “Doing
    the job right” can mean that evaluators are paid to not speak “the truth, the whole truth, and
    nothing but the truth.”

    Looking for a Defensible Definition of Objectivity

    Are there other definitions of objectivity that are useful in terms of assisting our practice of
    program evaluation? The Federal Government of Canada’s OCG (Office of the Comptroller
    General) was among the government jurisdictions that historically advocated the importance of
    objectivity in evaluations. In one statement, objectivity was defined this way:

    Objectivity is of paramount importance in evaluative work. Evaluations are often
    challenged by someone: a program manager, a client, senior management, a central
    agency or a minister. Objectivity means that the evidence and conclusions can be verified
    and confirmed by people other than the original authors. Simply stated, the conclusions
    must follow from the evidence. Evaluation information and data should be collected,
    analyzed and presented so that if others conducted the same evaluation and used the same
    basic assumptions, they would reach similar conclusions. (Treasury Board of Canada
    Secretariat, 1990, p. 28)

    This definition of objectivity emphasizes the reliability of evaluation findings and
    conclusions, and is similar to the way auditors define high-quality work in their profession.
    This implies, at least in principle, that the work of one evaluator or one evaluation team could
    be repeated, with the same results, by a second evaluation of the same program.

    A Natural Science Definition of Objectivity

    The OCG criterion of repeatability is similar in part to the way scientists do their work.
    Findings and conclusions, to be accepted by the discipline, must be replicable.

    There is, however, an important difference between program evaluation practice and the
    practice of scientific disciplines. In the sciences, the methodologies and procedures that are
    used to conduct research and report the results are intended to facilitate replication. Methods
    are scrutinized by one’s peers, and if the way the work has been conducted and reported passes
    this test, it is then “turned over” to the community of researchers, where it is subjected to
    independent efforts to replicate the results. In other words, meaningfully claiming objectivity
    requires both the use of replicable methodologies and actual replications of programs and
    policies. In practical terms, satisfying both of these criteria is rare.

    If a particular set of findings cannot be replicated by independent researchers, the
    community of research peers eventually discards the results as an artifact of the setting or the
    scientist’s biases. Transparent methodologies are necessary but not sufficient to establish
    objectivity of scientific results. The initial reports of cold fusion reactions (Fleischmann &
    Pons, 1989), for example, prompted additional attempts to replicate the reported findings, to no
    avail. Fleischman and Pons’s research methods proved to be faulty, and cold fusion did not
    pass the test of replicability.

    A more contemporary controversy that also hinges on being able to replicate experimental
    results is the question of whether high-energy neutrinos can travel faster than the speed of
    light. If such a finding were corroborated (reproduced by independent teams of researchers), it
    would undermine a fundamental assumption of Einstein’s relativity theory—that no particle
    can travel faster than the speed of light. The back-and-forth “dialogue” in the high-energy
    physics community is illustrated by a publication that claims that the one set of experimental
    results (apparently replicating the original experiment) were wrong and that Einstein’s theory
    is safe (Antonello et al., 2012). The dialogue between the experimentalists and the
    theoreticians in physics on whether neutrinos actually have been measured traveling faster than

    the speed of light has the potential to change physics as we know it. The stakes are high, and
    therefore, the canons of scientific research must be respected.

    For scientists, then, objectivity has two important elements, both of which are necessary.
    Methods and procedures need to be constructed and applied so that the work done, as well as
    the findings, are open to scrutiny by one’s peers. Although the process of doing a given
    science-based research project does not by itself make the research objective, it is essential that
    this process be transparent. Scrutability of methods facilitates repeating the research. If
    findings can be replicated independently, the community of scholars engaged in similar work
    confers objectivity on the research. Even then, scientific findings are not treated as absolutes.
    Future tests might raise questions, offer refinements, and generally increase knowledge.

    This working definition of objectivity does not imply that objectivity confers “truth” on
    scientific findings. Indeed, the idea that objectivity is about scrutability and replicability of
    methods and repeatability of findings is consistent with Kuhn’s (1962) notion of paradigms.
    Kuhn suggested that communities of scientists who share a “worldview” are able to conduct
    research and interpret the results. Within a paradigm, “normal science” is about solving
    puzzles that are implied by the theoretical structure that undergirds the paradigm. “Truth” is
    agreement, based on research evidence, among those who share a paradigm.

    In program evaluation practice, much of what we call methodology is tailored to particular
    settings. Increasingly, we are taking advantage of mixed qualitative–quantitative methods
    (Creswell, 2009; Hearn, Lawler, & Dowswell, 2003) when we design and conduct evaluations,
    and our own judgment as professionals plays an important role in how evaluations are designed
    and data are gathered, interpreted, and reported. Owen and Rogers (1999) make this point
    when they state,

    no evaluation is totally objective: it is subject to a series of linked decisions [made by the
    evaluator]. Evaluation can be thought of as a point of view rather than a statement of
    absolute truth about a program. Findings must be considered within the context of the
    decisions made by the evaluator in undertaking the translation of issues into data
    collection tools and the subsequent data analysis and interpretation. (p. 306)

    Although the OCG criterion of repeatability (Treasury Board of Canada Secretariat, 1990)
    in principle might be desirable, it is rarely applicable to program evaluation practice. Even in
    the audit community, it is rare to repeat the fieldwork that underlies an audit report. Instead,
    the fieldwork is conducted so that all findings are documented and corroborated by more than
    one line of evidence (or one source of information). In effect, there is an audit trail for the
    evidence and the findings.

    Implications for Evaluation Practice

    Where does this leave us? Scriven’s (1997) criteria for objectivity—with basis and without
    bias—has some defensibility limitations in as much as they usually depend on the “objectivity”
    of individual evaluators in particular settings. Not even in the natural sciences, where the
    subject matter and methods are far more conducive to Scriven’s definition, do researchers rely
    on one scientist’s assertions about “facts” and “objectivity.” Instead, the scientific community
    demands that the methods and results be stated so that the research results can be corroborated
    or disconfirmed, and it is via that process that “objectivity” is conferred. Objectivity is not an
    attribute of one researcher but instead is predicated on the process in the scientific community
    in which that researcher practices.

    In some professional settings where teams of evaluators work on projects, it may be
    possible to construct internal challenge functions and even share draft reports externally to
    increase the likelihood that the final product will be viewed as defensible and robust. But
    repeating an evaluation to confirm the replicability of the findings is almost never done.

    The realities of the practice of program evaluation weaken claims that we evaluators can be
    objective in the work we do. Evaluation is not a science. Instead, it is a craft that mixes
    together methods with professional judgment to produce products that are methodologically
    defensible, tailored to contexts, and almost always have unique characteristics.

    CRITERIA FOR HIGH-QUALITY EVALUATIONS: THE
    VARYING VIEWS OF EVALUATION ASSOCIATIONS

    Many professional associations that represent the interests and views of program evaluators
    have developed codes of ethics or best practice guidelines. A review of several of these
    guideline documents indicates that, with one exception (American Educational Research
    Association [AERA], 2011), there is little specific attention to “objectivity” among the criteria
    suggested for good evaluations (AERA, 2011; American Evaluation Association, 2004;
    Australasian Evaluation Society, 2010; Yarbrough, Shulha, Hopson, & Caruthers, 2011).

    Historically, Scriven (1997), Stufflebeam (1994), and, more recently, Chelimsky (2008)
    have emphasized objectivity as a key commodity of program evaluations, and there are
    government organizations that in their guidelines for assessing evaluation reports do discuss
    the issue of objectivity (see, e.g., Treasury Board of Canada Secretariat, 1990, 2009; U.S.
    OMB, 2004b). Markiewicz (2008) provides a provocative discussion about challenges of
    independence and objectivity in the political context of evaluation, noting,

    the challenges presented by the political and stakeholder context of evaluation do raise the
    longstanding paradigm wars between scientific realists and social constructionists. The
    former group of evaluators tend to uphold concepts of objectivity and independence in
    evaluation, while the latter group of evaluators view themselves as negotiators of different
    realities. (p. 35)

    There is one research and evaluation association that has explicitly included objectivity as a
    criterion for high-quality studies. The AERA (2008, p. 1) defines scientifically based research
    as “the use of rigorous, systematic, and objective methodologies to obtain valid and reliable
    knowledge.” The full set of criteria includes the following:

    a. development of a logical, evidence-based chain of reasoning;

    b. methods appropriate to the questions posed;

    c. observational or experimental designs and instruments that provide reliable and
    generalizable findings;

    d. data and analysis adequate to support the findings;

    e. explication of procedures and results clearly and in detail, including specification of
    the population to which the findings can be generalized;

    f. adherence to professional norms of peer review;

    g. dissemination of the findings to contribute to scientific knowledge; and

    h. access to data for reanalysis, replication, and the opportunity to build on findings.

    Evaluating program effectiveness (assessing cause-and-effect relationships) requires
    “experimental designs using random assignment or quasi-experimental or other designs that
    substantially reduce plausible competing explanations for the obtained results” (AERA, 2008,
    p. 1).

    The AERA has been part of the policy changes in the United States in the field of
    education evaluation that began with the No Child Left Behind Act of 2002 (Duffy, Giordano,
    Farrell, Paneque, & Crump, 2008). Duffy et al. (2008) point out that the phrase “scientifically-
    based research” appeared over 100 times in the legislation. The working definition of that
    phrase is very similar to the AERA definition above. Since the No Child Left Behind Act was
    passed, privileging quantitative, experimental, and quasi-experimental evaluation designs has
    had an impact on the whole evaluation community in the United States (Smith, 2007).

    The key question for us is whether the AERA definition of “scientifically based research”
    offers a credible alternative to other standards or guidelines. The AERA definition highlights
    the objectivity of research methodologies and mentions replication as one possible outcome
    from a study. But when we look at the field of education evaluation (and evaluation more
    broadly), we see that the efficacy of randomized controlled trials is substantially limited by
    contextual variables.

    Lykins (2009), in his assessment of the impacts of U.S. federal policy on education
    research, offers this example of the limits of “scientific research” in education:

    Take for instance the much-studied Tennessee STAR experiment in class-size reduction.
    The results of the randomized trial suggested that class-size reduction caused modest
    gains in the test scores of children in early grades. Boruch, De Moya, and Synder (2002)
    cite this study as evidence that “a single RFT can help to clarify the effect of a particular
    intervention against a backdrop of many nonrandomized trials” (p. 74). In fact, the
    experiment taught, at most, only that class-size reductions were responsible for increased
    test-scores for these particular students. It did not lend warrant to the claim that class-size
    reductions are an effective way for raising achievement as such. This became clear when
    California implemented a state-wide policy of class-size reduction. The California
    program not only failed to increase student achievement, but may have been responsible
    for a substantial increase in the number of poorly qualified teachers in high-poverty
    schools, thus actually harming student performance. (pp. 94–95, italics in original)

    The practical effect of privileging (experimental) methodologies that are aimed at
    examining cause-and-effect relationships is that program evaluations are limited in their
    generalizability. Cronbach (1982) pointed this out and effectively countered the then dominant
    view in evaluation that experimental designs, with their overriding emphasis on internal
    validity, were the gold standard.

    For evaluation associations and for evaluators, there are other quality-related criteria that
    are more relevant. With the exception of the AERA, the evaluation profession as a whole has
    generally not been prepared to emphasize objectivity as a criterion for high-quality evaluations.
    Instead, professional evaluation organizations tend to mention the accuracy and credibility of
    evaluation information (American Evaluation Association, 2004; Canadian Evaluation Society,
    2012; Organisation for Economic Cooperation and Development, 2010; Yarbrough et al.,
    2011), the honesty and integrity of evaluators and the evaluation process (American Evaluation
    Association, 2004; Australasian Evaluation Society, 2010; Canadian Evaluation Society, 2012;

    Yarbrough et al., 2011), the fairness of evaluation assessments (Australasian Evaluation
    Society, 2010; Canadian Evaluation Society, 2012; Yarbrough et al., 2011), and the validity
    and reliability of evaluation information (American Evaluation Association, 2004; Canadian
    Evaluation Society, 2012; Organisation for Economic Cooperation and Development, 2010;
    Yarbrough et al., 2011).

    In addition, professional guidelines emphasize the importance of declaring and avoiding
    conflicts of interest (American Evaluation Association, 2004; Australasian Evaluation Society,
    2010; Canadian Evaluation Society, 2012; Yarbrough et al., 2011) and the importance of
    impartiality in reporting findings and conclusions (Organisation for Economic Cooperation and
    Development, 2010). Evaluator independence is also mentioned as a criterion (Markiewicz,
    2008). Also, guidelines tend to emphasize the importance of competence in conducting
    evaluations, and the importance of upgrading evaluation skills (American Evaluation
    Association, 2004; Australasian Evaluation Society, 2010; Canadian Evaluation Society,
    2012). Collectively, these guidelines cover many of the characteristics of evaluators and
    evaluations that we might associate with objectivity: accuracy, credibility, validity, reliability,
    fairness, honesty, integrity, and competence. Transparency is also a criterion mentioned in
    some guidelines and standards (see, e.g., Organisation for Economic Cooperation and
    Development, 2010; Yarbrough et al., 2011). But—and this is a key point—objectivity is more
    than just having good evaluators or even good evaluations; it is a process that involves
    corroboration of one’s findings by one’s peers. Our profession is so diverse and includes so
    many different epistemological and methodological stances that asserting “objectivity” would
    not be supported by most evaluators.

    But, the evaluation profession does not exist alone in the current world of professionals
    who claim expertise in evaluating programs. The movement to connect evaluation to
    accountability expectations in public sector and nonprofit organizations has created situations
    where evaluation professionals, with their diverse backgrounds and standards, are compared
    with accounting professionals or with management consultants, who generally have a more
    uniform view of their respective professional standards. Because the public sector auditing
    community in particular has predicated objectivity of their practice, it is arguable that they
    have a marketing advantage with prospective clients (see Everett, Green, & Neu, 2005;
    Radcliffe, 1998). Furthermore, with some key central agencies asserting that in assessing the
    quality of evaluations one of the key criteria should be objectivity of the findings (Treasury
    Board of Canada Secretariat, 2009), that criterion confers an advantage on practitioners who
    claim that their process and products are objective. Patton (2008) refers to the politics of
    objectivity, meaning that for some evaluators it is important to be able to declare that their
    work is objective.

    What should evaluators tell prospective clients who, having heard that the auditing
    profession or management consultants (Institute of Management Consultants, 2008) make
    claims about their work being objective, expect the same from a program evaluation? If we tell
    clients that we cannot produce an objective evaluation, there may be a risk of their going
    elsewhere for assistance. On the other hand, claims that we can be objective are not supported,
    given the evaluators’ work.

    Perhaps the best way to respond is to offer criteria that cover much of the same ground as is
    covered if one conducts evaluations with a view to their being “objective.” Criteria like
    accuracy, credibility, honesty, completeness, fairness, impartiality, avoiding conflicts of
    interest, competence in conducting evaluations, and a commitment to staying current in skills
    are all relevant. They would be among the desiderata that scientists and others who can make
    defensible claims about objectivity would include in their own practice. The criteria mentioned

    are also among the principal ones included by auditors and accountants in their own standards
    (GAO, 2003).

    Patton (2008) takes a pragmatic stance in his own assessment of whether to claim that
    evaluations are objective:

    Words such as fairness, neutrality, and impartiality carry less baggage than objectivity and
    subjectivity. To stay out of the argument about objectivity, I talk with intended users
    about balance, fairness, and being explicit about what perspectives, values, and priorities
    have shaped the evaluation, both the design and the findings. (p. 452)

    To sum up, current guidelines and standards that have been developed by professional
    evaluation associations generally do not claim that program evaluations should be objective.
    Correspondingly, as practicing professionals, we should not be making such claims in our
    work. That does not mean that we are without standards, and indeed, we should be striving to
    be honest, accurate, fair, impartial, competent, highly skilled, and credible in the work we do.
    If we are these things, we can justifiably claim that our work meets the same professional
    standards as work done by scholars and practitioners who might claim to be objective.

    SUMMARY

    The relationships between managers and evaluators are affected by the incentives that each
    party faces in particular contexts. If evaluators have been commissioned to conduct a
    summative evaluation, it is more likely that program managers will defend their programs,
    particularly where the stakes are perceived to be high. Expecting managers, under these
    conditions, to participate as neutral parties in an evaluation ignores the potential for conflicts of
    commitments, which can affect the accuracy and completeness of information that managers
    provide about their own programs. This problem parallels the problem that exists in
    performance measurement systems, where public, high-stakes, summative uses of performance
    results will tend to induce gaming of the system by those who are affected by the consequences
    of disseminating performance results.

    Formative evaluations, where it is generally possible to project a “win-win” scenario for
    managers and evaluators, offer incentives for managers to be forthcoming so that they benefit
    from an assessment based on an accurate and complete understanding of their programs.
    Historically, a majority of evaluations have been formative. Although advocates for program
    evaluation and performance measurement imply that evaluations can be used for resource
    allocation/reallocation decisions, it is comparatively rare to have an evaluation that does that.
    There has been a gap between the promise and the performance of evaluation functions in
    governments in that regard (Muller-Clemm & Barnes, 1997).

    Many evaluation approaches encourage or even mandate manager or organizational
    participation in evaluations. Where utilization of evaluation results is a central concern of
    evaluation processes, managerial involvement has been shown to increase uses of evaluation
    findings. Some evaluation approaches—empowerment evaluation is an example of an
    important and relatively new approach—suggest that control of the evaluation process should
    be devolved to those in the organizations and programs being evaluated. This view is contested
    in the evaluation field and continues to be deliberated by other evaluation scholars and
    practitioners.

    Promoting quality standards for evaluations continues to be an important indicator of the
    professionalization of evaluation practice. Although objectivity has been a desired feature of

    “good” evaluations in the past, professional associations have generally opted not to emphasize
    objectivity among the criteria that define high-quality evaluations.

    Evaluators, accountants, and management consultants will continue to be connected with
    efforts by government and nonprofit organizations to be more accountable. In some situations,
    evaluation professionals, accounting professionals, and management consultants will compete
    for work with clients. Because the accounting profession continues to assert that their work is
    objective, evaluators will have to address the issue of how to characterize their own practice,
    so that clients can be assured that the work of evaluators meets standards of rigor, defensibility,
    and ethical practice.

    DISCUSSION QUESTIONS

    1. Why are summative evaluations more challenging to do than formative evaluations?
    2. How should program managers be involved in evaluations of their own programs?
    3. What is a learning organization, and how is the culture of a learning organization

    supportive of evaluation?
    4. What are the advantages and disadvantages of relying on internal evaluators in public

    sector and nonprofit organizations?
    5. What is an evaluative culture in an organization? What roles would evaluators play in

    building and sustaining such a culture?
    6. What would it take for an evaluator to claim that her or his evaluation is objective?

    Given those requirements, is it possible for any evaluator to say that his or her evaluation
    is objective? Under what circumstances, if any?

    7. Suppose that you are a practicing evaluator and you are discussing a possible contract to
    do an evaluation for an agency. The agency director is very interested in your proposal
    but, in the discussions, says that he wants an objective evaluation. If you are willing to
    tell him that your evaluation will be objective, you have the contract. How would you
    respond to this situation?

    8. Other professions like medicine, law, accounting, and social work have guidelines for
    professional practice that can be enforced against individual practitioners, if need be.
    Evaluation has guidelines, but they are not enforceable. What would be the advantages
    and disadvantages of the evaluation profession having enforceable practice guidelines?
    Who would do the enforcing?

    REFERENCES

    American Educational Research Association. (2008). Definition of scientifically based
    research. Retrieved from
    http://www.aera.net/Portals/38/docs/About_AERA/KeyPrograms/
    DefinitionofScientificallyBasedResearch

    American Educational Research Association. (2011). Code of ethics: American Educational
    Research Association—approved by the AERA Council February 2011. Retrieved from
    http://www.aera.net/Portals/38/docs/
    About_AERA/CodeOfEthics(1)

    American Evaluation Association. (2004). Guiding principles for evaluators. Retrieved from
    http://www.eval.org/Publications/GuidingPrinciples.asp

    Antonello, M., Aprili, P., Baibussinov, B., Baldo Ceolin, M., Benetti, P., Calligarich, E., …
    Zmuda, J. (2012). A search for the analogue to Cherenkov radiation by high energy
    neutrinos at superluminal speeds in ICARUS. Physics Letters B, 711(3–4), 270–275.

    Australasian Evaluation Society. (2010). AES guidelines for the ethical conduct of evaluations.
    Retrieved from http://www.aes.asn.au/

    Bevan, G., & Hamblin, R. (2009). Hitting and missing targets by ambulance services for
    emergency calls: Effects of different systems of performance measurement within the UK.
    Journal of the Royal Statistical Society: Series A (Statistics in Society), 172(1), 161–190.

    Canadian Evaluation Society. (2012). Program evaluation standards. Retrieved from
    http://www.evaluationcanada.ca/site.cgi?s=6&
    ss=10&_lang=EN

    Chelimsky, E. (2008). A clash of cultures: Improving the “fit” between evaluative
    independence and the political requirements of a democratic society. American Journal of
    Evaluation, 29(4), 400–415.

    Conley-Tyler, M. (2005). A fundamental choice: Internal or external evaluation? Evaluation
    Journal of Australasia, 5(1&2), 3–11.

    Cousins, J. B. (2005). Will the real empowerment evaluation please stand up? A critical friend
    perspective. In D. Fetterman & A. Wandersman (Eds.), Empowerment evaluation
    principles in practice (pp. 183–208). New York, NY: Guilford Press.

    Cousins, J. B., & Whitmore, E. (1998). Framing participatory evaluation. New Directions for
    Evaluation, 80, 5–23.

    Creswell, J. W. (2009). Research design: Qualitative, quantitative, and mixed methods
    approaches. Thousand Oaks, CA: Sage.

    Cronbach, L. J. (1982). Designing evaluations of educational and social programs (1st ed.).
    San Francisco, CA: Jossey-Bass.

    de Lancer Julnes, P., & Holzer, M. (2001). Promoting the utilization of performance measures
    in public organizations: An empirical study of factors affecting adoption and
    implementation. Public Administration Review, 61(6), 693–708.

    Duffy, M., Giordano, V. A., Farrell, J. B., Paneque, O. M., & Crump, G. B. (2008). No Child
    Left Behind: Values and research issues in high-stakes assessments. Counseling and
    Values, 53(1), 53–66.

    Everett, J., Green, D., & Neu, D. (2005). Independence, objectivity and the Canadian CA
    profession. Critical Perspectives on Accounting, 16(4), 415–440.

    Fetterman, D. (2001). Foundations of empowerment evaluation. Thousand Oaks, CA: Sage.
    Fetterman, D., Kaftarian, S. J., & Wandersman, A. (1996). Empowerment evaluation:

    Knowledge and tools for self-assessment and accountability. Thousand Oaks, CA: Sage.
    Fetterman, D., & Wandersman, A. (2007). Empowerment evaluation: Yesterday, today, and

    tomorrow. American Journal of Evaluation, 28(2), 179–198.
    Fleischmann, M., & Pons, S. (1989). Electrochemically induced nuclear fusion of deuterium.

    Journal of Electroanalytical Chemistry, 261(2A), 301–308.
    Garvin, D. A. (1993). Building a learning organization. Harvard Business Review, 71(4), 78

    –90.
    Gill, D. (Ed.). (2011). The iron cage recreated: The performance management of state

    organisations in New Zealand. Wellington, NZ: Institute of Policy Studies.

    Government Accountability Office. (2003, August). Government auditing standards: 2003
    revision (GAO-03–673G). Washington, DC: Author.

    Hearn, J., Lawler, J., & Dowswell, G. (2003). Qualitative evaluations, combined methods and
    key challenges: General lessons from the qualitative evaluation of community intervention
    in stroke rehabilitation. Evaluation, 9(1), 30–54.

    Hood, C. (1995). The “new public management” in the 1980s: Variations on a theme.
    Accounting, Organizations and Society, 20(2–3), 93–109.

    Institute of Management Consultants. (2008). IMC code of ethics & member’s pledge.
    Retrieved from http://www.imc.org.au/Become-a-Member/Membership/
    IMC-CODE-OF-ETHICS-MEMBERS-PLEDGE.asp

    King, J. A., Cousins, J. B., & Whitmore, E. (2007). Making sense of participatory evaluation:
    Framing participatory evaluation. New Directions for Evaluation, 114, 83–105.

    Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago, IL: University of Chicago
    Press.

    Le Grand, J. (2010). Knights and knaves return: Public service motivation and the delivery of
    public services. International Public Management Journal, 13(1), 56–71.

    Leviton, L. C. (2003). Evaluation use: Advances, challenges and applications. American
    Journal of Evaluation, 24(4), 525–535.

    Lincoln, Y. S., & Guba, E. G. (1980). The distinction between merit and worth in evaluation.
    Educational Evaluation and Policy Analysis, 2(4), 61–71.

    Love, A. J. (1991). Internal evaluation: Building organizations from within. Newbury Park,
    CA: Sage.

    Lykins, C. (2009). Scientific research in education: An analysis of federal policy (Doctoral
    dissertation). Nashville, TN: Graduate School of Vanderbilt University. Retrieved from
    http://etd.library.vanderbilt.edu/available/etd-
    07242009-114615/unrestricted/lykins

    Mark, M. M., & Henry, G. T. (2004). The mechanisms and outcomes of evaluation influence.
    Evaluation, 10(1), 35–57.

    Markiewicz, A. (2008). The political context of evaluation: What does this mean for
    independence and objectivity? Evaluation Journal of Australasia, 8(2), 35–41.

    Mayne, J. (2008). Building an evaluative culture for effective evaluation and results
    management. Retrieved from http://www.cgiar-ilac.org/files/publications/briefs/
    ILAC_Brief20_Evaluative_Culture

    Mayne, J., & Rist, R. C. (2006). Studies are not enough: The necessary transformation of
    evaluation. Canadian Journal of Program Evaluation, 21(3), 93–120.

    Morgan, G. (2006). Images of organization (Updated ed.). Thousand Oaks, CA: Sage.
    Moynihan, D. P. (2008). The dynamics of performance management: Constructing information

    and reform. Washington, DC: Georgetown University Press.
    Muller-Clemm, W. J., & Barnes, M. P. (1997). A historical perspective on federal program

    evaluation in Canada. Canadian Journal of Program Evaluation, 12(1), 47–70.
    Nathan, R. P. (2000). Social science in government: The role of policy researchers (Updated

    ed.). Albany, NY: Rockefeller Institute Press.
    New York City Department of Homeless Services. (2010). City council hearing general

    welfare committee “Oversight: DHS’s Homebase Study.” Retrieved from

    http://nycppf.org/html/dhs/downloads/pdf/
    abt_testimony_120910

    Norris, N. (2005). The politics of evaluation and the methodological imagination. American
    Journal of Evaluation, 26(4), 584–586.

    Office of the Comptroller General of Canada. (1981). Guide on the program evaluation
    function. Ottawa, Ontario, Canada: Treasury Board of Canada Secretariat.

    Organisation for Economic Cooperation and Development. (2010). Evaluation in development
    agencies: Better aid. Paris, France: Author.

    Owen, J. M., & Rogers, P. J. (1999). Program evaluation: Forms and approaches
    (International ed.). Thousand Oaks, CA: Sage.

    Patton, M. Q. (1994). Developmental evaluation. Evaluation Practice, 15(3), 311–319.
    Patton, M. Q. (1997). Utilization-focused evaluation: The new century text (3rd ed.). Thousand

    Oaks, CA: Sage.
    Patton, M. Q. (2008). Utilization-focused evaluation (4th ed.). Thousand Oaks, CA: Sage.
    Patton, M. Q. (2011). Developmental evaluation: Applying complexity to enhance innovation

    and use. New York: Guilford Press.
    Radcliffe, V. S. (1998). Efficiency audit: An assembly of rationalities and programmes.

    Accounting, Organizations and Society, 23(4), 377–410.
    Rist, R. C., & Stame, N. (Eds.). (2006). From studies to streams: Managing evaluative systems

    (Vol. 12). New Brunswick, NJ: Transaction.
    Scriven, M. (1997). Truth and objectivity in evaluation. In E. Chelimsky & W. R. Shadish

    (Eds.), Evaluation for the 21st century: A handbook (pp. 477–500). Thousand Oaks, CA:
    Sage.

    Scriven, M. (2005). Review of the book: Empowerment evaluation principles in practice.
    American Journal of Evaluation, 26(3), 415–417.

    Senge, P. M. (1990). The fifth discipline: The art and practice of the learning organization (1st
    ed.). New York: Doubleday/Currency.

    Smith, N. L. (2007). Empowerment evaluation as evaluation ideology. American Journal of
    Evaluation, 28(2), 169–178.

    Smits, P., & Champagne, F. (2008). An assessment of the theoretical underpinnings of
    practical participatory evaluation. American Journal of Evaluation, 29(4), 427–442.

    Stufflebeam, D. L. (1994). Empowerment evaluation, objectivist evaluation, and evaluation
    standards: Where the future of evaluation should not go and where it needs to go.
    Evaluation Practice, 15(3), 321–338.

    Treasury Board of Canada Secretariat. (1990). Program evaluation methods: Measurement and
    attribution of program results (3rd ed.). Ottawa, Ontario, Canada: Deputy Comptroller
    General Branch, Government Review and Quality Services.

    Treasury Board of Canada Secretariat. (2009). Policy on evaluation. Retrieved from
    http://www.tbs-sct.gc.ca/pol/doc-eng.aspx?id=15024

    U.S. Office of Management and Budget. (2002). Program performance assessments for the FY
    2004 budget: Memorandum for heads of executive departments and agencies from Mitchell
    E. Daniels Jr. Retrieved from http://www.whitehouse.gov/sites/default/files/omb/
    assets/omb/memoranda/m02-10

    U.S. Office of Management and Budget. (2004). What constitutes strong evidence of a
    program’s effectiveness? Retrieved from http://www.whitehouse.gov/omb/part/2004_
    program_eval

    Watson, K. F. (1986). Programs, experiments, and other evaluations: An interview with
    Donald Campbell. Canadian Journal of Program Evaluation, 1(1), 83–86.

    Wildavsky, A. B. (1979). Speaking truth to power: The art and craft of policy analysis.
    Boston, MA: Little, Brown.

    Wolff, E. (1979). Proposed approach to program evaluation in the Government of British
    Columbia. Victoria, British Columbia, Canada: Treasury Board.

    Yarbrough, D., Shulha, L., Hopson, R., & Caruthers, F. (2011). Joint committee on standards
    for educational evaluation: A guide for evaluators and evaluation users (3rd ed.).
    Thousand Oaks, CA: Sage.

    CHAPTER 12

    THE NATURE AND PRACTICE OF
    PROFESSIONAL JUDGMENT IN PROGRAM
    EVALUATION

    Introduction
    The Nature of the Evaluation Enterprise

    What Is Good Evaluation Practice?

    Methodological Considerations

    Problems With Experimentation as a Criterion for Good Methodologies

    The Importance of Causality: The Core of the Evaluation Enterprise

    Alternative Perspectives on the Evaluation Enterprise

    Reconciling Evaluation Theory With the Diversity of Practice

    Working in the Swamp: The Real World of Evaluation Practice

    Common Ground Between Program Evaluators and Program Managers

    Situating Professional Judgment in Program Evaluation Practice

    Acquiring Knowledge and Skills for Evaluation Practice

    Professional Knowledge as Applied Theory

    Professional Knowledge as Practical Know-How

    Balancing Theoretical and Practical Knowledge in Professional Practice

    Understanding Professional Judgment

    A Modeling of the Professional Judgment Process

    The Decision Environment

    Values,

    Beliefs

    , and Expectations

    Acquiring Professional Knowledge

    Improving Professional Judgment in Evaluation Through Reflective Practice

    Guidelines for the Practitioner

    The Range of Professional Judgment Skills

    Ways of Improving Sound Professional Judgment Through Education and Training-Related

    Activities

    Teamwork and Professional Judgment

    Evaluation as a Craft: Implications for Learning to Become an Evaluation Practitioner

    Ethics for Evaluation Practice

    The Development of Ethics for Evaluation Practice

    Ethical Evaluation Practice

    Cultural Competence in Evaluation Practice

    The Prospects for an Evaluation Profession

    Summary
    Discussion Questions
    Appendix

    Appendix A: Fiona’s Choice: An Ethical Dilemma for a Program Evaluator
    Your Task
    References

    Good judgment is based on experience. Unfortunately, experience is based on poor
    judgment.

    —Anonymous

    INTRODUCTION

    Chapter 12 begins by reflecting on what good evaluation methodology is, and reminding the
    reader that in the evaluation field there continues to be considerable disagreement around how
    we should design evaluations to assess program effectiveness and, in so doing, examine causes
    and effects. We then look at the diversity of evaluation practice, and how evaluators actually
    do their work. Developing the capacity to exercise sound professional judgment is key to
    becoming a competent evaluator.

    Much of Chapter 12 is focused on what professional judgment is, how to cultivate sound
    professional judgment, and how evaluation education and training programs can build in
    opportunities to learn the practice of exercising professional judgment. Key to developing
    one’s own capacity to render sound professional judgments is learning how to be more
    reflective in one’s evaluation practice. We introduce evaluation ethics and connect ethics to
    professional judgment in evaluation practice.

    Throughout this book, we have referred to the importance of professional judgment in the
    practice of evaluation. Our view is that evaluators rely on their professional judgment in all
    evaluation settings. Although most textbooks in the field, as well as most academic programs
    that prepare evaluators for careers as practitioners, do not make the acquisition or practice of
    sound professional judgment an explicit part of evaluator training, this does not change the fact
    that professional judgment is an integral part of our practice.

    To ignore or minimize the importance of professional judgment suggests a scenario that
    has been described by Schön (1987) as follows:

    In the varied topography of professional practice, there is the high, hard ground
    overlooking a swamp. On the high ground, manageable problems lend themselves to
    solutions through the application of research-based theory and technique. In the swampy
    lowland, messy, confusing problems defy technical solutions.… The practitioner must
    choose. Shall he remain on the high ground where he can solve relatively unimportant
    problems according to prevailing standards of rigor, or shall he descend to the swamp of
    important problems and non-rigorous inquiry? (p. 3)

    THE NATURE OF THE EVALUATION ENTERPRISE

    Evaluation can be viewed as a structured process that creates and synthesizes information that
    is intended to reduce the level of uncertainty for stakeholders about a given program or policy.
    It is intended to answer questions (see the list of evaluation questions discussed in Chapter 1)
    or test hypotheses, the results of which are then incorporated into the information bases used
    by those who have a stake in the program or policy.

    What Is Good Evaluation Practice?
    Methodological Considerations

    Views of evaluation research and practice, and in particular about what they ought to be,
    vary widely. At one end of the spectrum, advocates of a highly structured (typically
    quantitative) approach to evaluations tend to emphasize the use of research designs that ensure

    sufficient internal and statistical conclusions validity that the key causal relationships between
    the program and outcomes can be isolated and tested. According to this view, experimental
    designs are the benchmark of sound evaluation designs, and departures from this ideal are
    associated with problems that either require specifically designed (and usually complex)
    methodologies to resolve limitations, or are simply not resolvable—at least to a point where
    plausible threats to internal validity are controlled.

    In the United States, evaluation policy for several major federal departments under the
    Bush administration (2001–2009) emphasized the importance of experimental research designs
    as the “gold standard” for program evaluations. As well, the Office of Management and Budget
    (OMB) reflected this view as it promulgated the use of the Program Assessment Rating Tool
    (PART) process between 2002 and 2009. In its 2004 guidance, the OMB states the following
    under the heading “What Constitutes Strong Evidence of a Program’s Effectiveness?” (OMB,
    2004):

    The revised PART guidance this year underscores the need for agencies to think about the
    most appropriate type of evaluation to demonstrate the effectiveness of their programs. As
    such, the guidance points to the randomized controlled trial (RCT) as an example of the
    best type of evaluation to demonstrate actual program impact. Yet, RCTs are not suitable
    for every program and generally can be employed only under very specific circumstances.
    (p. 1)

    The No Child Left Behind Act (2002) had, as a central principle, the idea that a key
    criterion for the availability of federal funds for school projects should be that a reform

    has been found, through scientifically based research to significantly improve the
    academic achievement of students participating in such program as compared to students
    in schools who have not participated in such program; or … has been found to have strong
    evidence that such program will significantly improve the academic achievement of
    participating children. (Sec. 1606(a)11(A & B))

    In Canada, a major federal department (Human Resources and Skills Development Canada)
    that funds evaluations of social service programs has specified in guidelines that randomized
    experiments are ideal for evaluations, but at minimum, evaluation designs must be based on
    quasi-experimental research designs that include comparison groups that permit before-and-
    after assessments of program effects between the program and the control groups (Human
    Resources Development Canada, 1998).

    Problems With Experimentation as a Criterion for Good Methodologies

    In the United States, privileging experimental and quasi-experimental designs for
    evaluations in the federal government has had a significant impact on the evaluation
    community. Although the “paradigm wars” were thought to have ended or at least been set
    aside in the 1990s (Patton, 1997), they were resurrected as U.S. government policies
    emphasizing the importance of “scientifically based research” were implemented. The merits
    of randomized controlled trials (RCTs) as the benchmark for high-quality evaluation designs
    have again been debated in conferences, evaluation journals, and Internet listserv discussions.

    Continued disagreements among evaluators about the best or most appropriate ways of
    assessing program effectiveness will affect the likelihood that evaluation will emerge as a
    profession. Historically, advocates for experimental approaches have argued in part that the

    superiority of their position rests in the belief that sound, internally valid research designs
    obviate the need for the evaluator to “fill in the blanks” with information that is not gleaned
    directly from the (usually) quantitative comparisons built into the designs. The results of a
    good experimental design are said to be more valid and credible, and therefore more defensible
    as a basis for supporting decisions about a program or policy. Experimentation has also been
    linked to fostering learning cultures where new public policies are assessed incrementally and
    rationally. Donald Campbell (1991) was among the first to advocate for an “experimenting
    society.”

    Although experimental evaluations continue to be important (Ford, Gyarmati, Foley,
    Tattrie, & Jimenez, 2003; Gustafson, 2003) and are central to both the Cochrane Collaboration
    (Higgins & Green, 2011) in health-related fields and the Campbell Collaboration (2010) in
    social program fields as an essential basis for supporting the systematic reviews that are the
    mainstay of these collaborations, the view that experiments are the “gold standard” does not
    dominate the whole evaluation field. The experiences with large-scale evaluations of social
    programs in the 1970s, when enthusiasm for experimental research designs was at its highest,
    suggested that implementing large-scale RCTs was problematical (Pawson & Tilley, 1997).

    Social experiments tended to be complex and were often controversial as evaluations. The
    Kansas City Preventive Patrol Experiment (Kelling, 1974a, 1974b) was an example of a major
    evaluation that relied on an experimental design that was intended to resolve a key policy
    question: whether the level of routine preventive patrol (assigned randomly to samples of
    police patrol beats in Kansas City, Missouri) made any differences to the actual and perceived
    levels of crime and safety in the selected patrol districts of Kansas City. Because the patrol
    levels were kept “secret” to conceal them from the citizens (and presumably potential law
    breakers), the experimental results encountered a substantial external validity problem—even
    if the findings supported the hypothesis that the level of routine preventive patrol had no
    significant impacts on perceived levels of safety and crime, or on actual levels of crime, how
    could any other police department announce that it was going to reduce preventive patrol
    without jeopardizing citizen (and politicians’) confidence? Even in the experiment itself, there
    was evidence that the police officers who responded to calls for service in the reduced patrol
    beats did so with more visibility (lights and sirens) than normal—suggesting that they wanted
    to establish their visibility in the low-patrol beats (Kelling, 1974a; Larson, 1982). In other
    words, there was a construct validity threat that was not adequately controlled—compensatory
    rivalry was operating, whereby the patrol officers in the low-patrol beats acted to “beef up” the
    perceived level of patrol in their beats.

    The Importance of Causality: The Core of the Evaluation Enterprise

    Picciotto (2011) points to the centrality of program effectiveness as a core issue for
    evaluation as a discipline/profession:

    What distinguishes evaluation from neighboring disciplines is its unique role in bridging
    social science theory and policy practice. By focusing on whether a policy, a programme
    or project is working or not (and unearthing the reasons why by attributing outcomes)
    evaluation acts as a transmission belt between the academy and the policy-making. (p.
    175)

    Advocates for experimental research designs point out that since most evaluations include,
    as a core issue, whether the program was effective, experimental designs are the best and least
    ambiguous way to answer these causal questions.

    Michael Scriven (2008) has taken an active role, since the changes in U.S. policies have
    privileged RCTs, in challenging the primacy of experimental designs as the methodological
    backbone of evaluations; in the introduction to a paper published in 2008, he asserts, “The
    causal wars are still raging, and the amount of collateral damage is increasing” (p. 11). In a
    series of publications (Cook, Scriven, Coryn, & Evergreen, 2010; Scriven, 2008), he has
    argued that it is possible to generate valid causal knowledge in many other ways, and argues
    for a pluralism of methods that are situationally appropriate in the evaluation of programs. In
    one of his presentations, a key point he makes is that human beings are “hardwired” to look for
    causal relationships in the world around them. In an evolutionary sense, we have a built-in
    capacity to observe causal connections. In Scriven’s (2004) words,

    Our experience of the world and our part in it, is not only well understood by us but
    pervasively, essentially, perpetually a causal experience. A thousand times a day we
    observe causation, directly and validly, accurately and sometimes precisely. We see
    people riding bicycles, driving trucks, carrying boxes up stairs, turning the pages of
    books, picking goods off shelves, calling names, and so on. So, the basic kind of causal
    data, vast quantities of highly reliable and checkable causal data, comes from observation,
    not from elaborate experiments. Experiments, especially RCTs, are a marvelously
    ingenious extension of our observational skills, enabling us to infer to causal conclusions
    where observation alone cannot take us. But it is surely to reverse reality to suppose that
    they are the primary or only source of reliable causal claims: they are, rather, the realm of
    flight for such claims, where the causal claims of our everyday lives are the ground traffic
    of them. (pp. 6–7)

    Alternative Perspectives on the Evaluation Enterprise

    At the other end of the spectrum are approaches that eschew positivistic or post-positivistic
    approaches to evaluation and advocate methodologies that are rooted in anthropology or
    subfields of sociology or other disciplines. Advocates of these interpretivist (generally
    qualitative) approaches have pointed out that the positivist view of evaluation is itself based on
    a set of beliefs about observing and measuring patterns of human interactions. We introduced
    different approaches to qualitative evaluation in Chapter 5 and pointed out that in the 1980s a
    different version of “paradigm wars” happened in the evaluation field—that “war” was
    between the so-called quals and the quants.

    Quantitative methods cannot claim to eliminate the need for evaluators to use professional
    judgment. Smith (1994) argues that quantitative methods necessarily involve judgment calls:

    Decisions about what to examine, which questions to explore, which indicators to choose,
    which participants and stakeholders to tap, how to respond to unanticipated problems in
    the field, which contrary data to report, and what to do with marginally significant
    statistical results are judgment calls. As such they are value-laden and hence subjective.…
    Overall the degree of bias that one can control through random assignment or blinded
    assessment is a minute speck in the cosmos of bias. (pp. 38–39)

    Moreover, advocates of qualitative approaches argue that quantitative methods miss the
    meaning of much of human behavior. Understanding intentions is critical to getting at what the
    “data” really mean, and the only way to do that is to embrace methods that treat individuals as
    unique sense-makers who need to be understood on their own terms (Schwandt, 2000).

    Kundin (2010) advocates the use of qualitative, naturalistic approaches to understand how
    evaluators use their “knowledge, experience and judgment to make decisions in their everyday
    work” (p. 350). Her view is that the methodology-focused logic of inquiry that is often
    embedded in textbooks does not reflect actual practice:

    Although this logic is widely discussed in the evaluation literature, some believe it is
    rarely relied upon in practice. Instead researchers … suggest that evaluators use their
    intuition, judgment, and experience to understand the evaluand, and by doing so, they
    understand its merits through an integrated act of perceiving and valuing. (p. 352)

    More recently, evaluators who have focused on getting evaluations used have tended to
    take a pragmatic stance in their approaches, mixing qualitative and quantitative
    methodologies in ways that are intended to be situationally appropriate (Patton, 2008). They
    recognize the value of being able to use structured designs where they are feasible and
    appropriate, but also recognize the value of employing a wide range of complementary (mixed
    methods) approaches in a given situation to create information that is credible, and hence more
    likely to be used. We discussed mixed-methods designs in Chapter 5.

    Reconciling Evaluation Theory With the Diversity of Practice

    The practice of program evaluation is even more diverse than the range of normative
    approaches and perspectives that populate the textbook and coursework landscape.
    Experimental evaluations continue to be done and are still viewed by many practitioners as
    exemplars (Chen, Donaldson, & Mark, 2011; Henry & Mark, 2003). Substantial investments in
    time and resources are typically required, and this limits the number and scope of evaluations
    that are able to randomly assign units of analysis (usually people) to program and control
    conditions.

    Conducting experimental evaluations entails creating a structure that may produce
    statistical conclusions and internal validity yet fails to inform decisions about implementing
    the program or policy in non-experimental settings (the Kansas City Preventive Patrol
    Experiment is an example of that) (Kelling, 1974a; Larson, 1982). Typically, experiments are
    time limited, and as a result participants can adjust their behaviors to their expectations of how
    long the experiment will last, as well as to what their incentives are as it is occurring. Cronbach
    (1982) was eloquent in his criticisms of the (then) emphasis on internal validity as the central
    criterion for judging the quality of research designs for evaluations of policies and programs.
    He argued that the uniqueness of experimental settings undermines the extent to which well-
    constructed (internally valid) experiments can be generalized to other units of analysis,
    treatments, observing operations, and settings (UTOS). Cronbach argued for the primacy of
    external validity of evaluations to make them more relevant to policy settings. Shadish, Cook,
    and Campbell’s (2002) book can be seen in part as an effort to address Cronbach’s criticisms
    of the original Cook and Campbell (1979) book on experimental and quasi-experimental
    research designs.

    The existence of controversies over the construction, execution, and interpretation of many
    of the large-scale social experiments that were conducted in the 1970s to evaluate programs
    and policies suggest that very few methodologies are unassailable—even experimental
    research designs (Basilevsky & Hum, 1984). The craft of evaluation research, even research
    that is based on randomly assigned treatment and control conditions, is such that its
    practitioners do not agree about what exemplary practice is, even in a given situation.

    In the practice of evaluation, it is rare to have the resources and the control over the
    program setting needed to conduct even a quasi-experimental evaluation. Instead, typical
    evaluation settings are limited by significant resource constraints and the expectation that the
    evaluation process will somehow fit into the existing administrative process that has
    implemented a policy or a program. The widespread interest in performance measurement as
    an evaluation approach tends to be associated with an assumption that existing managerial and
    information technological resources will be sufficient to implement performance measurement
    systems, and produce information for formative and summative evaluative purposes. We have
    pointed out the limitations of substituting performance measurement for program evaluation in
    Chapters 1 and 8 of this textbook.

    Working in the Swamp: The Real World of Evaluation Practice

    Typical program evaluation methodologies rely on multiple, independent data sources to
    “strengthen” research designs that are case study or implicit designs (diagrammed in Chapter 3
    as XO designs, where X is the program and O is the set of observations on the outcomes that
    are expected to be affected by the program). The program has been implemented at some time
    in the past, and now, the evaluator is expected to assess program effectiveness. There is no
    pretest and no control group; there are insufficient resources to construct these comparisons,
    and in most situations, comparison groups would not exist. Although multiple data sources
    permit triangulation of findings, that does not change the fact that the basic research design is
    the same; it is simply repeated for each data source (which is a strength since measurement
    errors would likely be independent) but is still subject to all the weaknesses of that design. In
    sum, typical program evaluations are conducted after the program is implemented, in settings
    where the evaluation team has to rely on evidence about the program group alone (i.e., there is
    no control group). In most evaluation settings, these designs rely on mixed qualitative and
    quantitative lines of evidence.

    In such situations, some evaluators would advocate not using the evaluation results to make
    any causal inferences about the program. In other words, it is argued that such evaluations
    ought not to be used to try to address the question: “Did the program make a difference, and if
    so, what difference(s) did it make?” Instead the evaluation should be limited to addressing the
    question of whether intended outcomes were actually achieved, regardless of whether the
    program “produced” those outcomes. That is essentially what performance measurement
    systems do.

    But, many evaluations are commissioned with the need to know whether the program
    worked and why. Even formative evaluations often include questions about the effectiveness of
    the program (Cronbach, 1980; Weiss, 1998). Answering the “why” question entails looking at
    causes and effects.

    In situations where a client wants to know if and why the program was effective, and there
    is clearly insufficient time, money, and control to construct an evaluation design that meets
    criteria that are textbook-appropriate for answering those questions using an experimental
    design, evaluators have a choice. They can advise their client that wanting to know whether the
    program or policy worked—and why—is perhaps not feasible, or they can proceed,
    understanding that their work may not be as defensible as some textbooks would advocate.

    Usually, some variation of the work proceeds. Although comparisons between program and
    no-program groups are not possible, comparisons among program recipients, comparisons over
    time for program recipients who have participated in the program, and comparisons among the
    perspectives of other stakeholders are all possible. We maintain that the way to answer causal
    questions without research designs that can categorically rule out rival hypotheses is to

    acknowledge that in addressing issues like program effectiveness (which we take to be the
    central question in most evaluations), we cannot offer definitive findings or conclusions.
    Instead, our findings, conclusions, and our recommendations, supported by the evidence at
    hand and by our professional judgment, will reduce the uncertainty associated with the
    question.

    In this textbook we have argued that in all evaluations, regardless of how sophisticated they
    are, evaluators use one form or another of professional judgment. The difference between the
    most sophisticated experimentally designed evaluation and an evaluation based on a case
    study/implicit design is the amount and the kinds of professional judgments that are
    entailed—not that the former is appropriate for assessing program effectiveness and the latter
    is not. Unlike some who have commented on the role of professional judgment in program
    evaluations and see making judgments as a particular phase in the evaluation process (Skolits,
    Morrow, & Burr, 2009), we see professional judgment being exercised throughout the entire
    evaluation process.

    Where a research design is (necessarily) weak, we introduce to a greater extent our own
    experience and our own assessments, which in turn are conditioned by our values, beliefs, and
    expectations. These become part of the basis on which we interpret the evidence at hand and
    are also a part of the conclusions and the recommendations. This professional judgment
    component in every evaluation means that we should be aware of what it is, and learn how to
    cultivate sound professional judgment.

    Common Ground Between Program Evaluators and Program Managers

    The view that all evaluations incorporate professional judgments to a greater or lesser
    extent means that evaluators have a lot in common with program managers. Managers often
    conduct assessments of the consequences of their decisions—informal evaluations, if you will.
    These are not usually based on a necessarily systematic gathering of information, but instead
    often rely on a manager’s own observations, values, beliefs, expectations, and
    experiences—their professional judgment. That these assessments are done “on the fly” and
    are often based on information that is gathered using research designs that do not warrant
    causal conclusions does not vitiate their being the basis for good management practice. Good
    managers become skilled at being able to recognize patterns in the complexity of their
    environments. Inferences from observed or sensed patterns (Mark, Henry, & Julnes, 2000) to
    causal linkages are informed by their experience and judgment, are tested by observing and
    often participating in the consequences of a decision, and in turn add to the fund of knowledge
    and experience that contributes to their capacity to make sound professional judgments.

    Situating Professional Judgment in Program Evaluation Practice

    Scriven (1994) emphasizes the centrality of judgment (judgments of merit and worth) in
    the synthesis of evaluation findings/lines of evidence to render an overall assessment of a
    program. For Scriven, the process of building toward and then rendering a holistic evaluation
    judgment is a central task for evaluators, a view reflected by others (Skolits et al., 2009).
    Judgments can be improved by constructing rules or decision processes that make explicit how
    evidence will be assessed and weighted. Scriven (1994) suggests that, generally, judgments
    supported by decision criteria are superior to those that are intuitive.

    Although evaluations typically use several different lines of evidence to assess a program’s
    effectiveness and, in the process, have different measures of effectiveness, methodologies such

    as cost–utility analysis exist for weighting and amalgamating findings that combine multiple
    measures of program effectiveness (Levin & McEwan, 2001). However, they are data
    intensive, and apart from the health sector, they are not widely used. The more typical situation
    is described by House and Howe (1999). They point out that the findings from various lines of
    evidence in an evaluation may well contain conflicting information about the worth of a
    program. In this situation, evaluators use their professional judgment to produce an overall
    conclusion. The process of rendering such a judgment engages the evaluator’s own knowledge,
    values, beliefs, and expectations. House and Howe (1999) describe this process:

    The evaluator is able to take relevant multiple criteria and interests and combine them into
    all-things-considered judgments in which everything is consolidated and related.… Like a
    referee in a ball game, the evaluator must follow certain sets of rules, procedures, and
    considerations—not just anything goes. Although judgment is involved, it is judgment
    exercised within the constraints of the setting and accepted practice. Two different
    evaluators might make different determinations, as might two referees, but acceptable
    interpretations are limited. In the sense that there is room for the evaluator to employ
    judgment, the deliberative process is individual. In the sense that the situation is
    constrained, the judgment is professional. (p. 29)

    There are also many situations where evaluators must make judgments in the absence of
    clear methodological constraints or rules to follow. House and Howe (1999) go on to point out
    that

    for evaluators, personal responsibility is a cost of doing business, just as it is for
    physicians, who must make dozens of clinical judgments each day and hope for the best.
    The rules and procedures of no profession are explicit enough to prevent this. (p. 30)

    Although House and Howe point out that evaluators must make judgments, the process by
    which judgments are made is nevertheless not well understood. Hurteau, Houle, and Mongiat
    (2009), in a meta-analysis of 50 evaluation studies, examined the ways that judgments are
    evidenced and found that in only 20 of those studies had the evaluator(s) made a judgment
    based on the findings. In addition, in none of those 20 studies do the evaluators describe the
    process that they used to render the judgment(s).

    Program evaluators are currently engaged in debates around the issue of professionalizing
    evaluation. One element of that debate is whether and how our knowledge and our practice can
    be codified so that evaluation is viewed as a coherent body of knowledge and skills, and
    practitioners are seen to be having a consistent set of competencies (King, Stevahn, Ghere, &
    Minnema, 2001; Stevahn, King, Ghere, & Minnema, 2005b). This debate has focused in part
    on what is needed to be an effective evaluator—core competencies that provide a framework
    for assessing the adequacy of evaluation training as well as the adequacy of evaluator practice.

    In a study of 31 evaluation professionals in the United States, practitioners were asked to
    rate the importance of 49 evaluator competencies (King et al., 2001) and then try to come to a
    consensus about the ratings, given feedback on how their peers had rated each item. The 49
    items were grouped into four broad clusters of competencies: (1) systematic inquiry (most
    items were about methodological knowledge and skills), (2) competent evaluation practice
    (most items focused on organizational and project management skills), (3) general skills for
    evaluation practice (most items were on communication, teamwork, and negotiation skills),
    and (4) evaluation professionalism (most items focused on self-development and training,
    ethics and standards, and involvement in the evaluation profession).

    Among the 49 competencies, one was “making judgments” and referred to an overall
    evaluative judgment, as opposed to recommendations, at the end of an evaluation (King et al.,
    2001, p. 233). It was rated the second lowest on average among all the competencies. This
    finding suggests that judgment, comparatively, is not perceived to be that important (although
    the item average was still 74.68 out of 100 possible points). King et al. (2001) suggested that
    “some evaluators agreed with Michael Scriven that to evaluate is to judge; others did not” (p.
    245). The “reflects on practice” item, however, was given an average rating of 93.23—a
    ranking of 17 among the 49 items. Schön (1987) makes reflection on one’s practice the key
    element in being able to develop sound professional judgment. For both of these items, there
    was substantial variation among the practitioners about their ratings, with individual ratings
    ranging from 100 (highest possible score) to 20. The discrepancy between the low overall
    score for “making judgments” and the higher score for “reflects on practice” may be related to
    the difference between making a judgment, as an action, and reflecting on practice, as a
    personal quality.

    We see professional judgment being a part of the whole process of working with clients,
    framing evaluation questions, designing and conducting evaluation research, analyzing and
    interpreting the information, and communicating the findings, conclusions, and
    recommendations to stakeholders. If you go back to the outline of a program evaluation
    process offered in Chapter 1, or the outline of the design and implementation of a performance
    measurement system offered in Chapter 9, you will see professional judgment is a part of all
    the steps in both processes. Furthermore, we see different kinds of professional judgment being
    more or less important at different stages in evaluation processes. We will come back to the
    relationships between evaluation competencies and professional judgment later in this chapter.

    ACQUIRING KNOWLEDGE AND SKILLS FOR EVALUATION
    PRACTICE

    The idea that evaluation is a profession, or aspires to be a profession, is an important part of
    contemporary discussions of the scope and direction of the enterprise (Altschuld, 1999).
    Modarresi, Newman, and Abolafia (2001) quote Leonard Bickman (1997), who was president
    of the American Evaluation Association (AEA) in 1997, in asserting that “we need to move
    ahead with professionalizing evaluation or else we will just drift into oblivion” (p. 1). Bickman
    and others in the evaluation field were aware that other related professions continue to carve
    out territory, sometimes at the expense of evaluators. Picciotto (2011) points out, however, that
    “heated doctrinal disputes within the membership of the AEA have blocked progress [towards
    professionalization] in the USA” (p. 165).

    What does it mean to be a professional? What distinguishes a profession from other
    occupations? Eraut (1994) suggests that professions are characterized by the following: a core
    body of knowledge that is shared through the training and education of those in the profession;
    some kind of government-sanctioned license to practice; a code of ethics and standards of
    practice; and self-regulation (and sanctions for wrongdoings) through some kind of
    professional association to which members of the practice community must belong.

    Professional Knowledge as Applied Theory

    The core body of knowledge that is shared among members of a profession can be
    characterized as knowledge that is codified, publicly available (taught for and learned by

    aspiring members of the profession), and supported by validated theory (Eraut, 1994). One
    view of professional practice is that competent members of a profession apply this validated
    theoretical knowledge in their work. Competent practitioners are persons who have the
    credentials of the profession (including evidence that they have requisite knowledge) and have
    established a reputation for being able to translate theoretical knowledge into sound practice.

    Professional Knowledge as Practical Know-How

    An alternative view of professional knowledge is that it is the application of practical
    know-how to particular situations. The competent practitioner uses his or her experiential and
    intuitive knowledge to assess a situation and offer a diagnosis (in the health field) or a decision
    in other professions (Eraut, 1994). Although theoretical knowledge is a part of what competent
    practitioners rely on in their work, practice is seen as more than applying theoretical
    knowledge. It includes a substantial component that is learned through practice itself. Although
    some of this knowledge can be codified and shared (Schön, 1987; Tripp, 1993), part of it is
    tacit, that is, known to individual practitioners, but not shareable in the same ways that we
    share the knowledge in textbooks, lectures, or other publicly accessible learning and teaching
    modalities (Schwandt, 2008).

    Polanyi (1958) described tacit knowledge as the capacity we have as human beings to
    integrate “facts” (data and perceptions) into patterns. He defined tacit knowledge in terms of
    the process of discovering theory: “This act of integration, which we can identify both in the
    visual perception of objects and in the discovery of scientific theories, is the tacit power we
    have been looking for. I shall call it tacit knowing” (Polanyi & Grene, 1969, p. 140).

    For Polanyi, tacit knowledge cannot be communicated directly. It has to be learned through
    one’s own experiences—it is by definition personal knowledge. Knowing how to ride a
    bicycle, for example, is in part tacit. We can describe to others how the physics and the
    mechanics of getting onto a bicycle and riding it works, but the experience of getting onto the
    bicycle, pedaling, and getting it to stay up is quite different from being told how to do so.

    One implication of acknowledging that what we know is in part personal is that we cannot
    teach everything that is needed to learn a skill. The learner can be guided with textbooks, good
    examples, and even demonstrations, but that knowledge (Polanyi calls it impersonal
    knowledge) must be combined with the learner’s own capacity to tacitly know—to experience
    the realization (or a series of them) that he or she understands how to use the skill.

    Clearly, from this point of view, practice is an essential part of learning. One’s own
    experience is essential for fully integrating impersonal knowledge into working knowledge.
    But because the skill that has been learned is in part tacit, when the learner tries to
    communicate it, he or she will discover that, at some point, the best advice is to suggest that
    the new learner try it and “learn by doing.” This is a key part of craftsmanship.

    Balancing Theoretical and Practical Knowledge in Professional Practice

    The difference between the applied theory and the practical know-how views of
    professional knowledge (Fish & Coles, 1998; Schwandt, 2008) has been characterized as the
    difference between knowing that (publicly accessible, propositional knowledge and skills) and
    knowing how (practical, intuitive, experientially grounded knowledge that involves wisdom, or
    what Aristotle called praxis) (Eraut, 1994).

    These two views of professional knowledge highlight different views of what professional
    practice is and indeed ought to be. The first view can be illustrated with an example. In the

    field of medicine, the technical/rational view of professional knowledge and professional
    practice continues to support efforts to construct and use expert systems—software systems
    that can offer a diagnosis based on a logic model that links combinations of symptoms in a
    probabilistic tree to possible diagnoses (Fish & Coles, 1998). By inputting the symptoms that
    are either observed or reported by the patient, the expert system (embodying the public
    knowledge that is presumably available to competent practitioners) can treat the diagnosis as a
    problem to solve. Clinical decision making employs algorithms that produce a probabilistic
    assessment of the likelihood that symptom, drug, and other technical information will support
    one or another alternative diagnoses.

    The second view of professional knowledge as practical know-how embraces a view of
    professional practice as craftsmanship and artistry. Although it acknowledges the importance
    of experience in becoming a competent practitioner, it also complicates our efforts to
    understand the nature of professional practice. If practitioners know things that they cannot
    share and their knowledge is an essential part of sound practice, how do professions find ways
    of ensuring that their members are competent?

    Schwandt (2008) recognizes the importance of balancing applied theory and practical
    knowledge in evaluation. His concern is with the tendency, particularly in performance
    management systems where practice is circumscribed by a focus on outputs and outcomes, to
    force “good practice” to conform to some set of performance measures and performance
    results:

    The fundamental distinction between instrumental reason as the hallmark of technical
    knowledge and judgment as the defining characteristic of practical knowledge is
    instinctively recognizable to many practitioners … Yet the idea that “good” practice
    depends in a significant way on the experiential, existential, knowledge we speak of as
    perceptivity, insightfulness, and deliberative judgment is always in danger of being
    overrun by (or at least regarded as inferior to) an ideal of “good” practice grounded in
    notions of objectivity, control, predictability, generalizability beyond specific
    circumstances, and unambiguous criteria for establishing accountability and success. This
    danger seems to be particularly acute of late, as notions of auditable performance, output
    measurement, and quality assurance have come to dominate the ways in which human
    services are defined and evaluated. (p. 37)

    The idea of balance is further explored in the section below, where we discuss various
    aspects of professional judgment.

    UNDERSTANDING PROFESSIONAL JUDGMENT

    What are the different kinds of professional judgment? How does professional judgment
    impact the range of decisions that evaluators make? Can we construct a model of how
    professional judgment affects evaluation-related decisions?

    Fish and Coles (1998) have constructed a typology of four kinds of professional judgment
    in the health care field. We believe that these can be generalized to the evaluation field. Each
    builds on the previous one; the extent and kinds of judgment differ across the four kinds. At
    one end of the continuum, practitioners apply technical judgments that are about specific
    issues involving routine tasks. Typical questions include the following: What do I do now?
    How do I apply my existing knowledge and skills to do this routine task? In an evaluation, an

    example of this kind of judgment would be how to select a random sample from a population
    of case files in a social service agency.

    The next level is procedural judgment, which focuses on procedural questions and
    involves the practitioner comparing the skills/tools that he or she has available to accomplish a
    task. Practitioners ask questions like “What are my choices to do this task?” “From among the
    tools/knowledge/skills available to me, which combination works best for this task?” An
    example from an evaluation would be deciding how to contact clients in a social service
    agency—whether to use a survey (and if so, mailing, telephone, interview format, or some
    combination) or use focus groups (and if so, how many, where, how many participants in each,
    how to gather them).

    The third level of professional judgment is reflective. It again assumes that the task or the
    problem is a given, but now the practitioner is asking the following questions: How do I tackle
    this task? Given what I know, what are the ways that I could proceed? Are the tools that are
    easily within reach adequate, or instead, should I be trying some new combination or perhaps
    developing some new ways of dealing with this task or problem? A defining characteristic of
    this third level of professional judgment is that the practitioner is reflecting on his or her
    practice and seeking ways to enhance his or her practical knowledge and skills and perhaps
    innovate to address a given situation.

    An example from a needs assessment for child sexual abuse prevention programs in an
    urban school district serves to illustrate reflective judgment on the part of the evaluator in
    deciding on the research methodology. Classes from an elementary school are invited to attend
    a play acted by school children of the same ages as the audience. The play is called “No More
    Secrets” and is about an adult–child relationship that involves touching and other activities. At
    one point in the play, the “adult” tells the “child” that their touching games will be their secret.
    The play is introduced by a professional counselor, and after the play, children are invited to
    write questions on cards that the counselor will answer. The children are told that if they have
    questions about their own relationships with adults, these questions will be answered
    confidentially by the counselor. The evaluator, having obtained written permissions from the
    parents, contacts the counselor, who, without revealing the identities of any of the children,
    indicates to the evaluator the number of potentially abusive situations among the students who
    attended the play. Knowing the proportion of the school district students that attended the play,
    the evaluator is able to roughly estimate the incidence of potentially abusive situations in that
    school-age population.

    The fourth level of professional judgment is deliberative—it explicitly involves a
    practitioner’s own values. Here the practitioner is asking the following question: What ought I
    to be doing in this situation? No longer are the ends or the tasks fixed, but instead the
    professional is taking a broad view that includes the possibility that the task or problem may or
    may not be an appropriate one to pursue. Professionals at this level are asking questions about
    the nature of their practice and connecting what they do as professionals with their broader
    values and moral standards. We discuss evaluation ethics later in this chapter. The case study
    in Appendix A of this chapter is an example of a situation that would involve deliberative
    judgment.

    A Modeling of the Professional Judgment Process

    Since professional judgment spans the evaluation process, it will influence a wide range of
    decisions that evaluators make in their practice. The four types of professional judgment that
    Fish and Coles (1998) describe suggest decisions of increasing complexity from discrete
    technical decisions to global decisions that can affect an evaluator’s present and future roles as

    an evaluation practitioner. Figure 12.1 displays a model of the way that professional judgment
    is involved in evaluator decision making. The model focuses on a single decision—a typical
    evaluation would involve many such decisions of varying complexity. In the model, evaluator
    values, beliefs, and expectations, together with both shareable and practical (tacit) knowledge
    combine to create a fund of experience that is tapped for professional judgments. In turn,
    professional judgments influence the decision at hand.

    We will present the model and then discuss it, elaborating on the meanings of the key
    constructs in the model.

    Evaluator decisions have consequences. They may be small—choosing a particular alpha
    (α) level for tests of statistical significance will have an impact on which findings are
    noteworthy, given a criterion that significant findings are worth reporting; or they may be
    large—deciding not to conduct an evaluation in a situation where the desired findings are being
    specified in advance by a key stakeholder could affect the career of an evaluation practitioner.
    These consequences feed back to our knowledge (both our shareable and our practical know-
    how), values, beliefs, and expectations. Evaluators have an opportunity to learn from each
    decision, and one of our challenges as professionals is to increase the likelihood that we take
    advantage of such learning opportunities. We will discuss reflective practice later in this
    chapter.

    Figure 12.1 The Professional Judgment Process

    The model can be unpacked by discussing key constructs in it. Some constructs have been
    elaborated in this chapter already (shareable knowledge, practical know-how, and professional
    judgment), but it is worthwhile to define each one explicitly in one table. Table 12.1
    summarizes the constructs in Figure 12.1 and offers a short definition of each. Several of the
    constructs will then be discussed further to help us understand what roles they play in the
    process of forming and applying professional judgment.

    Table 12.1 Definitions of Constructs in the Model of the Professional Judgment Process

    Constructs in
    the Model

    Definitions

    Values Values are statements about what is desirable, what ought to be, in a given
    situation.

    Beliefs

    Beliefs are about what we take to be true—our assumptions about how we
    know what we know (our epistemologies are examples of beliefs).

    Expectations Expectations are assumptions that are typically based on what we have
    learned and what we have come to accept as normal. Expectations can limit
    what we are able to “see” in particular situations.

    Shareable
    knowledge

    Knowledge that is typically found in textbooks or other such media;
    knowledge that forms the core of the formal training and education of
    professionals in a field.

    Practical know-
    how

    Practical know-how is the knowledge that is gained through practice. It
    complements shareable knowledge and can be tacit—that is, acquired from
    one’s professional practice and not shareable.

    Experience Experience is an amalgam of our knowledge, values, beliefs, expectations,
    and practical know-how. For a given decision, we have a “fund” of
    experience that we can draw from. We can augment that fund with learning,
    and from the consequences of the decisions we make as professionals.

    Professional
    judgment

    Professional judgment is a process that relies on our experience and ranges
    from technical judgments to deliberative judgments.

    Decision In a typical evaluation, evaluators make hundreds of decisions that
    collectively define the evaluation process. Decisions are choices—a choice
    made by an evaluator about everything from discrete methodological issues
    to global values–based decisions that affect the whole evaluation (and
    perhaps future evaluations).

    Consequences Each decision has consequences—for the evaluator and for the evaluation
    process. Consequences can range from discrete to global, commensurate
    with the scope and implications of the decision.

    Decision
    environment

    The decision environment is the set of factors that influences the decision-
    making process, including the stock of knowledge that is available to the
    evaluator.
    Among the factors that could impact an evaluator decision are professional
    standards, resources (including time and data), incentives (perceived
    consequences that induce a particular pattern of behavior), and constraints
    (legal, institutional, and regulatory requirements that specify the ways that
    evaluator decisions must fit a decision environment).

    The Decision Environment

    The particular situation or problem at hand influences how a program evaluator’s
    professional judgment will be exercised. Each opportunity for professional judgment will have
    unique characteristics that will demand that it be approached in particular ways. For example, a
    methodological issue will require a different kind of professional judgment from one that
    centers on an ethical issue. Even two cases involving a similar question of methodological
    choice will have facts about each of them that will influence the professional judgment
    process. We would agree with evaluators who argue that methodologies need to be
    situationally appropriate, avoiding a one-size-fits-all approach (Patton, 2008). The extent to

    which the relevant information about a particular situation is known or understood by the
    evaluator will affect the professional judgment process.

    The decision environment includes constraints and incentives and costs and benefits, both
    real and perceived, that affect professional judgment. Some examples include the expectations
    of the client, the professional’s lines of accountability, tight deadlines, complex and conflicting
    objectives, and financial constraints. For people working within an organization—for example,
    internal evaluators—the organization also presents a significant set of environmental factors, in
    that its particular culture, goals, and objectives may have an impact on the way the professional
    judgment process operates.

    Relevant professional principles and standards such as the AEA’s (2004) “Guiding
    Principles for Evaluators” also form part of the judgment environment because, to some extent,
    they interact with and condition the free exercise of judgment by professionals and replace
    individual judgment with collective judgment (Gibbins & Mason, 1988). We will come back to
    evaluation standards later in this chapter.

    Values, Beliefs, and Expectations

    Professional judgment is influenced by personal characteristics of the person exercising it.
    It must always be kept in mind that “judgment is a human process, with logical, psychological,
    social, legal, and even political overtones” (Gibbins & Mason, 1988, p. 18). Each of us has a
    unique combination of values, beliefs, and expectations that make us who we are, and each of
    us has internalized a set of professional norms that make us the kind of practitioner that we are.
    These personal factors can lead two professionals to make quite different professional
    judgments about the same situation (Tripp, 1993).

    Among the personal characteristics that can influence one’s professional judgment,
    expectations are among the most important. Expectations have been linked to paradigms;
    perceptual and theoretical structures that function as frameworks for organizing one’s
    perspectives, even one’s beliefs about what is real and what is taken to be factual. Kuhn (1962)
    has suggested that paradigms are formed through our education and training. Eraut (1994) has
    suggested that the process of learning to become a professional is akin to absorbing an
    ideology.

    Our past experiences (including the consequences of previous decisions we have made in
    our practice) predispose us to understand or even expect some things and not others, to
    interpret situations, and consequently to behave in certain ways rather than in others. As
    Abercrombie (1960) argues, “We never come to an act of perception with an entirely blank
    mind, but are always in a state of preparedness or expectancy, because of our past experiences”
    (p. 53). Thus, when we are confronted with a new situation, we perceive and interpret it in
    whatever way makes it most consistent with our existing understanding of the world, with our
    existing paradigms. For the most part, we perform this act unconsciously. We are not aware of
    how our particular worldview influences how we interpret and judge the information we
    receive on a daily basis in the course of our work or how it affects our subsequent behavior.

    How does this relate to our professional judgment? Our expectations lead us to see things
    we are expecting to see, even if they are not actually there, and to not see things we are not
    expecting, even if they are there. Abercrombie (1960) calls our worldview our “schemata” and
    illustrates its power over our judgment process with the following figure (Figure 12.2).

    Figure 12.2 The Three Triangles

    In most cases, when we first read the phrases contained in the triangles, we do not see the
    extra words. As Abercrombie (1960) points out, “it’s as though the phrase ‘Paris in the Spring,’
    if seen often enough, leaves a kind of imprint on the mind’s eye, into which the phrase in the
    triangle must be made to fit” (p. 35). She argues that “if [one’s] schemata are not sufficiently
    ‘living and flexible,’ they hinder instead of help [one] to see” (p. 29). Our tendency is to ignore
    or reject what does not fit our expectations. Thus, similar to the way we assume the phrases in
    the triangles make sense and therefore unconsciously ignore the extra words, our professional
    judgments are based in part on our preconceptions and thus may not be appropriate for the
    situation.

    Expectations can also contribute to improving our judgment by allowing us to
    unconsciously know how best to act in a situation. When the consequences of such a decision
    are judged to be salutary, our expectations are reinforced.

    Acquiring Professional Knowledge

    Our professional training and education are key influences; they can affect professional
    judgment in positive ways by not only allowing us to understand and address problems in a
    manner that those without the same education could not, but they also predispose us to
    interpret situations in particular ways. Indeed, professional education is often one of the most
    pervasive reasons for our acceptance of “tried and true” ways of approaching problems in
    professional practice. As Katz (1988) observes, “Conformity and orthodoxy, playing the game
    according to the tenets of the group to which students wish to belong, are encouraged in … all
    professional education” (p. 552). Thus, somewhat contrary to what would appear to be
    common sense, professional judgment does not necessarily improve in proportion to increases
    in professional training and education. Similarly, professional judgment does not necessarily
    improve with increased professional experience, if such experience does not challenge but only
    reinforces already accepted ideologies. Ayton (1998) makes the point that even experts in a
    profession are not immune to poor professional judgment:

    One view of human judgment is that people—including experts—not only suffer various
    forms of myopia but are somewhat oblivious of the fact.… Experts appear to have very
    little insight into their own judgment.… This oblivion in turn might plausibly be
    responsible for further problems, e.g. overconfidence … attributed, at least in part, to a
    failure to recognize the fallibility of our own judgment. (pp. 238–239)

    On the other hand, Mowen (1993) notes that our experience, if used reflectively and
    analytically to inform our decisions, can be an extremely positive factor contributing to good
    professional judgment. Indeed, he goes so far as to argue that “one cannot become a peerless
    decision maker without that well-worn coat of experience … the bumps and bruises received
    from making decisions and seeing their outcomes, both good or bad, are the hallmark of
    peerless decision makers” (p. 243).

    IMPROVING PROFESSIONAL JUDGMENT IN EVALUATION
    THROUGH REFLECTIVE PRACTICE

    Having reviewed the ways that professional judgment is woven through the fabric of
    evaluation practice and having shown how professional judgment plays a part in our decisions
    as evaluation practitioners, we can turn to discussing ways of self-consciously improving our
    professional judgment. Key to this process is becoming aware of one’s own decision-making
    processes.

    Guidelines for the Practitioner

    Epstein (1999) suggests that a useful stance for professional practice is mindfulness.
    Krasner et al. (2009) define mindfulness this way:

    The term mindfulness refers to a quality of awareness that includes the ability to pay
    attention in a particular way: on purpose, in the present moment, and nonjudgmentally.
    Mindfulness includes the capacity for lowering one’s own reactivity to challenging
    experiences; the ability to notice, observe, and experience bodily sensations, thoughts, and
    feelings even though they may be unpleasant; acting with awareness and attention (not
    being on autopilot); and focusing on experience, not on the labels or judgments applied to
    them. (p. 1285)

    Epstein and others have developed programs to help medical practitioners become more
    mindful (Krasner et al., 2009). In a study involving 70 primary care practitioners in Rochester,
    New York, participants were trained through an 8-week combination of weekly sessions and an
    all-day session to become more self-aware. The training was accompanied by opportunities to
    write brief stories to reflect on their practice and to use appreciative inquiry to identify ways
    that they had been successful in working through challenging practice situations.

    The before-versus-after results suggested that for the doctors “increases in mindfulness
    correlated with reductions in burnout and total mood disturbance. The intervention was also
    associated with increased trait emotional stability (i.e. greater resilience)” (p. 1290).

    Mindfulness is the cultivation of a capacity to observe, in a nonjudgmental way, one’s own
    physical and mental processes during and after tasks. In other words, it is the capacity for self-
    reflection that facilitates bringing to consciousness our values, assumptions, expectations,
    beliefs, and even what is tacit in our practice. Epstein (1999) suggests, “Mindfulness informs
    all types of professionally relevant knowledge, including propositional facts, personal
    experiences, processes, and know-how each of which may be tacit or explicit” (p. 833).

    Although mindfulness can be linked to religious and philosophical traditions, it is a secular
    way of approaching professional practice that offers opportunities to continue to learn and
    improve (Epstein, 2003). A mindful practitioner is one who has cultivated the art of self-
    observation (cultivating the compassionate observer). Epstein characterizes mindful practice
    this way:

    When practicing mindfully, clinicians approach their everyday tasks with critical
    curiosity. They are present in the moment, seemingly undistracted, able to listen before
    expressing an opinion, and able to be calm even if they are doing several things at once.
    These qualities are considered by many to be prerequisite for compassionate care. (p. 2)

    The objective of mindfulness is to see what is, rather than what one wants to see or even
    expects to see. Mindful self-monitoring involves several things: “access to internal and
    external data; lowered reactivity to inner experiences such as thoughts and emotions; active
    and attentive observation of sensations, images, feelings, and thoughts; curiosity; adopting a
    nonjudgmental stance; presence, [that is] acting with awareness …; openness to possibility;
    adopting more than one perspective; [and] ability to describe one’s inner experience” (Epstein,
    Siegel, & Silberman, 2008, p. 10).

    Epstein (1999) suggests that there are at least three ways of nurturing mindfulness: (1)
    mentorships with practitioners who are themselves well regarded in the profession; (2)
    reviewing one’s own work, taking a nonjudgmental stance; and (3) meditation to cultivate a
    capacity to observe one’s self.

    In order to cultivate the capacity to make sound professional judgments it is essential to
    become aware of the unconscious values and other personal factors that may be influencing
    one’s professional judgment. For only through coming to realize how much our professional
    judgments are influenced by these personal factors can we become more self-aware and work
    toward extending our conscious control of them and their impacts on our judgment. As Tripp
    (1993) argues, “Without knowing who we are and why we do things, we cannot develop
    professionally” (p. 54). By increasing our understanding of the way we make professional
    judgments, we improve our ability to reach deliberate, fully thought-out decisions rather than
    simply accepting as correct the first conclusion that intuitively comes to mind.

    But how can we, as individuals, learn what factors are influencing our own personal
    professional judgment? One way is to conduct a systematic questioning of professional
    practice (Fish & Coles, 1998). Professionals should consistently reflect on what they have done
    in the course of their work and then investigate the issues that arise from this review.
    Reflection should involve articulating and defining the underlying principles and rationale
    behind our professional actions and should focus on discovering the “intuitive knowing
    implicit in the action” (Schön, 1988, p. 69).

    Tripp (1993) suggests that this process of reflection can be accomplished by selecting and
    then analyzing critical incidents that have occurred during our professional practice in the past
    (critical incident analysis). A critical incident can be any incident that occurred in the course of
    our practice that sticks in our mind and hence, provides an opportunity to learn. What makes it
    critical is the reflection and analysis that we bring to it. Through the process of critical incident
    analysis, we can gain an increasingly better understanding of the factors that have influenced
    our professional judgments. As Fish and Coles (1998) point out,

    Any professional practitioner setting out to offer and reflect upon an autobiographical
    incident from any aspect of professional practice is, we think, likely to come sooner or
    later to recognize in it the judgments he or she made and be brought to review them. (p.
    254)

    For it is only in retrospect, in analyzing our past decisions, that we can see the complexities
    underlying what at the time may have appeared to be a straightforward, intuitive professional
    judgment. “By uncovering our judgments … and reflecting upon them,” Fish and Coles (1998)
    maintain, “we believe that it is possible to develop our judgments because we understand more
    about them and about how we as individuals come to them” (p. 285).

    Jewiss and Clark-Keefe (2007) connect reflective practice for evaluators to developing
    cultural competence. The Guiding Principles for Evaluators (AEA, 2004) makes cultural
    competence, “seeking awareness of their own culturally-based assumptions, their
    understanding of the worldviews of culturally-different participants and stakeholders in the

    evaluation,” one of the core competencies for evaluators. They believe that to become more
    culturally competent, evaluators would benefit from taking a constructivist stance in
    reflecting on their own practice:

    Constructivism has indeed helped signal evaluators’ responsibility for looking out[ward]:
    for attending to and privileging program participants’ expressions as the lens through
    which to learn about, change, and represent programs. Just as important, constructivism
    conveys evaluators’ responsibilities for looking in[ward]: for working to develop and
    maintain a critically self-reflective stance to examine personal perspectives and to monitor
    bias. (p. 337)

    Self-consciously challenging the routines of our practice, the “high hard ground” that
    Schön refers to in the quote at the outset of this chapter, is an effective way to begin to develop
    a more mindful stance. In our professional practice, each of us will have developed routines for
    addressing situations that occur frequently. As Tripp (1993) points out, although routines

    may originally have been consciously planned and practiced, they will have become
    habitual, and so unconscious, as expertise is gained over time. Indeed, our routines often
    become such well-established habits that we often cannot say why we did one thing rather
    than another, but tend to put it down to some kind of mystery such as “professional
    intuition.” (p. 17)

    Another key way to critically reflect on our professional practice and understand what
    factors influence the formation of our professional judgments is to discuss our practice with
    our colleagues. Colleagues, especially those who are removed from the situation at hand or
    under discussion, can act as “critical friends” and can help in the work of analyzing and
    critiquing our professional judgments with an eye to improving them. With different education,
    training, and experience, our professional peers often have different perspectives from us.
    Consequently, involving colleagues in the process of analyzing and critiquing our professional
    practice allows us to compare with other professionals our ways of interpreting situations and
    choosing alternatives for action. Moreover, the simple act of describing and summarizing an
    issue so that our colleagues can understand it can reveal and provide much insight into the
    professional judgments we have incorporated.

    The Range of Professional Judgment Skills

    There is considerable interest in the evaluation field in outlining competencies that define
    sound practice (Ghere, King, Stevahn, & Minnema, 2006; King et al., 2001; Stevahn, King,
    Ghere, & Minnema, 2005a). Although there are different versions of what these competencies
    are, there is little emphasis on acquiring professional judgment skills as a distinct competency.
    Efforts to establish whether practitioners themselves see judgment skills as being important
    indicate a broad range of views, reflecting some important differences as to what evaluation
    practice is and ought to be (King et al., 2001).

    If we consider linkages between types of professional judgment and the range of activities
    that comprise evaluation practice, we can see that some kinds of professional judgment are
    more important for some clusters of activities than others. But for many evaluation activities,
    several different kinds of professional judgment can be relevant. Table 12.2 summarizes
    clusters of activities that reflect the design and implementation of a typical program evaluation
    or a performance measurement system. These clusters are based on the outlines for the design

    and implementation of program evaluations and performance measurement systems included in
    Chapters 1 and 9, respectively. Although they are not comprehensive, that is, do not absolutely
    represent the detailed range of activities discussed earlier in this textbook, they illustrate the
    ubiquity of professional judgment in all areas of our practice.

    Table 12.2 suggests that for most clusters of evaluation activities, several different types of
    professional judgment are in play. The notion that somehow we could practice by exercising
    only technical and procedural professional judgment, or confining our judgment calls to one
    part of the evaluation process, is akin to staying on Schön’s (1987) “high hard ground.”

    Ways of Improving Sound Professional Judgment Through Education and Training-
    Related Activities

    Developing sound professional judgment depends substantially on being able to develop
    and practice the craft of evaluation. Schön (1987) and Tripp (1993), among others, have
    emphasized the importance of practice as a way of cultivating sound professional judgment.
    Although textbook knowledge (“knowing what”) is also an essential part of every evaluator’s
    toolkit, a key part of evaluation curricula are opportunities to acquire experience.

    Table 12.2 Types of Professional Judgment That Are Relevant to Program Evaluation and
    Performance Measurement

    There are at least six complementary ways that evaluation curricula can be focused to
    provide opportunities for students to develop their judgment skills. Some activities are more
    discrete, that is, are relevant for developing skills that are specific. These are generally limited
    to a single course or even a part of a course. Others are more generic, offering opportunities to
    acquire experience that spans entire evaluation processes. These are typically activities that

    integrate coursework in real work experiences. Table 12.3 summarizes ways that academic
    programs can inculcate professional judgment capacities in their students.

    The types of learning activities in Table 12.3 are typical of many programs that train
    evaluators, but what is important is realizing that each of these kinds of activities contributes
    directly to developing a set of skills that all practitioners need and will use in all their
    professional work. In an important way, identifying these learning activities amounts to
    making explicit what has largely been tacit in our profession.

    Table 12.3 Learning Activities to Increase Professional Judgment Capacity in Novice
    Practitioners

    Learning Activities Examples
    Course-based activities
    Problem/puzzle solving Develop a coding frame and test the coding categories for

    intercoder reliability for a sample of open-ended
    responses to an actual client survey that the instructor has
    provided

    Case studies Make a decision for an evaluator who finds himself or
    herself caught between the demands of his or her superior
    (who wants evaluation interpretations changed) and the
    project team who see no reason to make any changes

    Simulations Using a scenario and role playing, negotiate the terms of
    reference for an evaluation

    Course projects Students are expected to design a practical,
    implementable evaluation for an actual client
    organization

    Program-based activities
    Apprenticeships/internships/work
    terms

    Students work as apprentice evaluators in organizations
    that design and conduct evaluations, for extended periods
    of time (at least 4 months)

    Conduct an actual program
    evaluation

    Working with a client organization, develop the terms of
    reference for a program evaluation, conduct the
    evaluation, including preparation of the evaluation report,
    deliver the report to the client, and follow up with
    appropriate dissemination activities

    Teamwork and Professional Judgment

    Evaluators and managers often work in organizational settings where teamwork is
    expected. Successful teamwork requires establishing norms and expectations that encourage
    good communication, sharing of information, and a joint commitment to the task at hand.
    Being able to select team members and foster a work environment wherein people are willing
    to trust each other, and be open and honest about their own views on issues, is conducive to
    generating information that reflects a diversity of perspectives. Even though there will still be

    individual biases, the views expressed are more likely to be valid than simply the perceptions
    of a dominant individual or coalition in the group. An organizational culture that emulates
    features of learning organizations (Garvin, 1993; Mayne, 2008) will tend to produce
    information that is more valid as input for making decisions and evaluating policies and
    programs.

    Managers and evaluators who have the skills and experience to be able to call on others
    and, in doing so, be reasonably confident that honest views about an issue are being offered,
    have a powerful tool to complement their own knowledge and experience and their systematic
    inquiries. Good professional judgment, therefore, is partly about selecting and rewarding
    people who themselves have demonstrated a capacity to deliver sound professional judgment.

    Evaluation as a Craft: Implications for Learning to Become an Evaluation Practitioner

    Evaluation has both a methodological aspect, where practitioners are applying tools, albeit
    with the knowledge that the tools may not fit the situation exactly, and an aesthetic aspect,
    which entails developing an appreciation for the art of design, the conduct of evaluation-related
    research, and the interpretation of results. As Berk and Rossi (1999) contend, mastering a craft
    involves more than learning the techniques and tools of the profession; it involves developing
    “intelligence, experience, perseverance, and a touch of whimsy” (p. 99), which all form part of
    professional judgment. Traditionally, persons learning a craft apprenticed themselves to more
    senior members of the trade. They learned by doing, with the guidance and experience of the
    master craftsperson.

    We have come to think that evaluation can be taught in classrooms, often in university
    settings or in professional development settings. Although these experiences are useful, they
    are no substitute for learning how evaluations are actually done. Apprenticing to a person or
    persons who are competent senior practitioners is an important part of becoming a practitioner
    of the craft. Some evaluators apprentice themselves in graduate programs, preparing master’s
    or doctoral theses with seasoned practitioners. Others work with practitioners in work
    experience settings (e.g., co-op placements). Still others join a company or organization at a
    junior level and, with time and experience, assume the role of full members of the profession.

    Apprenticeship complements what can be learned in classrooms, from textbooks and other
    such sources. Schön (1987) points out that an ideal way to learn a profession is to participate in
    practical projects wherein students design for actual situations, under the guidance of
    instructors or coaches who are themselves seasoned practitioners. Students then learn by doing
    and also have opportunities, with the guidance of coaches, to critically reflect on their practice.

    In evaluation, an example of such an opportunity might be a course that is designed as a
    hands-on workshop to learn how to design and conduct a program evaluation. Cooksy (2008)
    describes such a course at Portland State University. Students work with an instructor who
    arranges for client organizations, who want evaluations done, to participate in the course.
    Students work in teams, and teams are matched with clients. As the course progresses, each
    team is introduced to the skills that are needed to meet client and instructor expectations for
    that part of the evaluation process. There are tutorials to learn skills that are needed for the
    teams’ work, and opportunities for teams to meet as a class to share their experiences and learn
    from each other and the instructor. Clients are invited into these sessions to participate as
    stakeholders and provide the class and the instructor with relevant and timely feedback. The
    teams are expected to gather relevant lines of evidence, once their evaluation is designed, and
    analyze the evidence. Written reports for the clients are the main deliverables for the teams,
    together with oral presentations of the key results and recommendations in class, with the
    clients in attendance.

    ETHICS FOR EVALUATION PRACTICE

    The Development of Ethics for Evaluation Practice

    In this chapter we have alluded to ethical decision-making as a part of the work evaluators
    do; it is a consideration in how they exercise professional judgment. The evaluation guidelines,
    standards, and principles that have been developed for the evaluation profession all speak, in
    different ways, to ethical practice. Although evaluation practice is not guided by a set of
    professional norms that are enforceable (Rossi, Lipsey, & Freeman, 2004), ethical guidelines
    are an important reference point for evaluators. Increasingly, organizations that involve people
    (e.g., clients or employees) in research are expected to take into account the rights of their
    participants across the stages of the evaluation: As the study objectives are framed, measures
    and data collection are designed and implemented, results are interpreted, and findings are
    disseminated. In universities, human research ethics committees routinely scrutinize research
    plans to ensure that they do not violate the rights of participants. In both the United States and
    Canada, there are national policies or regulations that are intended to protect the rights of
    persons who are participants in research (Canadian Institutes of Health Research, Natural
    Sciences and Engineering Research Council of Canada, & Social Sciences and Humanities
    Research Council of Canada, 2010; U.S. Department of Health and Human Services, 2009).

    The past quarter century has witnessed major developments in the domain of evaluation
    ethics. These include publication of the original and revised versions of the Guiding Principles
    for Evaluators (AEA, 1995, 2004), and the second and third editions of the Program Evaluation
    Standards (Sanders, 1994; Yarbrough, Shulha, Hopson, & Caruthers, 2011). Two examples of
    books devoted to program evaluation ethics (Morris, 2008; Newman & Brown, 1996) as well
    as chapters on ethics in handbooks in the field (Seiber, 2009; Simons, 2006) are additional
    resources. The AEA is active in promoting evaluation ethics with the creation of the Ethical
    Challenges section of the American Journal of Evaluation (Morris, 1998) and the addition of
    an ethics training module to the website of the AEA, as described in Morris’s The Good, the
    Bad, and the Evaluator: 25 Years of AJE Ethics (Morris, 2011).

    Morris (2011) has followed the development of evaluation ethics over the past quarter
    century and notes that there are few empirical studies that focus on evaluation ethics to date.
    Additionally, he argues that “most of what we know (or think we know) about evaluation
    ethics comes from the testimonies and reflections of evaluators”—leaving out the crucial
    perspectives of other stakeholders in the evaluation process (p. 145). Textbooks on the topic of
    evaluation range in the amount of attention that is paid to evaluation ethics—in some
    textbooks, it is the first topic of discussion on which the rest of the chapters rest, as in, for
    example, Qualitative Researching by Jennifer Mason (2002). In others, the topic arises later, or
    in some cases it is left out entirely.

    Newman and Brown (1996) have undertaken an extensive study of evaluation practice to
    establish ethical principles that are important for evaluators in the roles they play. Underlying
    their work are principles, which they trace to Kitchener’s (1984) discussions of ethical norms.
    Table 12.4 summarizes ethical principles that are taken in part from Newman and Brown
    (1996) and from the Tri-Council Policy on the Ethical Conduct for Research Involving
    Humans (Canadian Institutes of Health Research et al., 2010), and shows they correspond to
    the AEA’s Guiding Principles for Evaluators (AEA, 2004) and the Canadian Evaluation
    Society (CES) Guidelines for Ethical Conduct (CES, 2012a).

    The ethical principles summarized in Table 12.4 are not absolute and arguably are not
    complete. Each one needs to be weighed in the context of a particular evaluation project and

    balanced with other ethical considerations. For example, the “keeping promises” principle
    suggests that contracts, once made, are to be honored, and normally that is the case. But
    consider the following example: An evaluator makes an agreement with the executive director
    of a nonprofit agency to conduct an evaluation of a major program that is delivered by the
    agency. The contract specifies that the evaluator will deliver three interim progress reports to
    the executive director, in addition to a final report. As the evaluator begins his or her work, he
    or she learns from several agency managers that the executive director has been redirecting
    money from the project budget for office furniture, equipment, and his or her own travel
    expenses—none of these being connected with the program that is being evaluated. In his or
    her first interim report, he or she brings these concerns to the attention of the executive
    director, who denies any wrongdoings, and makes it clear that the interim reports are not to be
    shared with anyone else. The evaluator discusses this situation with his or her colleagues in the
    firm in which he or she is employed and decides to inform the chair of the board of directors
    for the agency. He or she has broken his or her contract but has called on a broader principle
    that speaks to the honesty and integrity of the evaluation process.

    Table 12.4 Relationships Between the American Evaluation Association Principles,
    Canadian Evaluation Society Guidelines for Ethical Conduct, and Ethical Principles for
    Evaluators

    In Appendix A, we have included a case that provides you with an opportunity to make a
    choice for an evaluator who works in a government department. The evaluator is in a difficult
    situation and has to decide what decision he or she should make, balancing ethical principles

    and his or her own well-being as the manager of an evaluation branch in that department.
    There is no right answer to this case. Instead, it gives you an opportunity to see how
    challenging ethical choice making can be, and it gives you an opportunity to make a choice and
    build a rationale for your choice. The case is a good example of what is involved in exercising
    deliberative judgment.

    Ethical Evaluation Practice

    Ethical behavior is not so much a matter of following principles as of balancing
    competing principles. (Stake & Mabry, 1998, p. 108)

    Ethical practice in program evaluation is situation specific and can be challenging. The
    guidelines and principles discussed earlier are general. Sound ethical evaluation practice is
    circumstantial, much like sound professional judgment. Practice with ethical decision making
    is essential, dialogue being a key part of learning how ethics principles apply to practice and
    how principles feel subjectively.

    How do we define ethical evaluation? Several definitions of sound ethical practice exist.
    Schwandt (2007, p. 401) refers to a “minimalist view,” in which evaluators develop sensitivity,
    empathy, and respect for others, and a “maximalist view,” which includes specific guidelines
    for ethical practice including “[recording] all changes made in the originally negotiated project
    plans, and the reasons why the changes were made” (AEA, 2004). Stake and Mabry (1998)
    define ethics as “the sum of human aspiration for honor in personal endeavor, respect in
    dealings with one another, and fairness in the collective treatment of others” (p. 99).
    Schweigert (2007) defines stated ethics in program evaluation as “limits or standards to
    prohibit intentional harms and name the minimum acceptable levels of performance” (p. 394).
    Ethical problems in evaluation are often indistinct, pervasive, and difficult to resolve with
    confidence (Stake & Mabry, 1998).

    Although guidelines and professional standards can help guide the evaluator toward more
    ethical decisions, they have been criticized as lacking enforceability and failing to anticipate
    the myriad situations inevitable in practice (Bamberger, Rugh, & Mabry, 2012)—hence the
    call for cultivating sound professional judgment (through reflective practice) in applying the
    principles and guidelines.

    Like other professional judgment decisions, appropriate ethical practice occurs throughout
    the evaluation process. It usually falls to the evaluator to lead by example, ensuring that ethical
    principles are adhered to and are balanced with the goals of the stake-holders. Brandon, Smith,
    and Hwalek (2011), in discussing a successful private evaluation firm, describe the process this
    way:

    Ethical matters are not easily or simply resolved but require working out viable solutions
    that balance professional independence with client service. These are not technical matters
    that can be handed over to well-trained staff or outside contractors, but require the
    constant, vigilant attention of seasoned evaluation leaders. (p. 306)

    In contractual engagements, the evaluator has to make a decision to move forward with a
    contract or, as Smith (1998) describes it, to determine if an evaluation contract may be “bad for
    business” (p. 178). Smith goes on to recommend declining a contract if the desired work is not
    possible at an “acceptable level of quality” (Smith, 1998, p. 178). For internal evaluators,
    turning down an evaluation contract may have career implications. The case study at the end of

    this chapter explores this dilemma. Smith (1998) cites Mabry (1997) in describing the
    challenge of adhering to ethical principles for the evaluator:

    Evaluation is the most ethically challenging of the approaches to research inquiry because
    it is the most likely to involve hidden agendas, vendettas, and serious professional and
    personal consequences to individuals. Because of this feature, evaluators need to exercise
    extraordinary circumspection before engaging in an evaluation study. (Mabry, 1997, p. 1,
    cited in Smith, 1998, p. 180)

    Cultural Competence in Evaluation Practice

    While issues of cultural sensitivity are addressed in Chapter 5, cultural sensitivity is as
    important for quantitative evaluation as it is for qualitative evaluation. We are including
    cultural competence in this section on ethics, as cultural awareness is an important feature of
    not only development evaluation where we explicitly work across cultures, but also virtually
    any evaluations conducted in our increasingly multicultural society. Evaluations in the health,
    education, or social sectors, for example, would commonly require that the evaluator have
    cultural awareness and sensitivity.

    There is evidence of a growing sense of the importance and the relevance of
    acknowledging cultural awareness for evaluations. Schwandt (2007) notes that “the Guiding
    Principles (as well as most of the ethical guidelines of academic and professional associations
    in North America) have been developed largely against the foreground of a Western
    framework of moral understandings” (p. 400) and are often framed in terms of individual
    behaviors, largely ignoring the normative influences of social practices and institutions. The
    AEA

    Still stressed with your coursework?
    Get quality coursework help from an expert!