EDU 530 Week 1 Discussion

Module 1 Discussion

Data Use & Teaching

Directions:

To make the best use of data, educators must go beyond the big tests and involve teachers and students in collecting and analyzing data. After studying

Module 1: Lecture Materials & Resources

, discuss the following:

Education Majors:
How can educators effectively involve both teachers and students in the data collection and analysis process?
What challenges might arise from this approach, and how can they be addressed to ensure data-driven decision-making benefits the learning environment?

Save Time On Research and Writing

Hire a Pro to Write You a 100% Plagiarism-Free Paper.

Get My Paper
Instructional Design Majors:
As an instructional designer, how would you design a framework that involves both teachers and students in the process of collecting and analyzing data?
What tools, strategies, or methods would you recommend to ensure the data collected is meaningful and actionable for improving learning outcomes?

Submission Requirements:

Each post is to be clear and concise, and students will lose points for improper grammar, punctuation, and misspelling.
Your initial post should be at least 200 words, formatted and cited in current APA style. You should research and reference one article, as well as reference the textbook readings.

Module 1: Lecture Materials & Resources

Introduction to Assessment & Data Analysis

Read and watch the lecture resources & materials below early in the week to help you respond to the discussion questions and to complete your assignment(s).

(Note: The citations below are provided for your research convenience. You should always cross-reference the current APA guide for correct styling of citations and references in your academic work.)

Read

· Popham, W. J. (2024).
Classroom assessment: What teachers need to know (10th ed.). Pearson.

· Chapter 6: Selected-Response Tests

· Chapter 7: Constructed-Response Tests

· Chapter 8: Performance Assessment

· Chapter 9: Portfolio Assessment

· McDonald, J. P. (2019). Toward more effective data use in teaching.
Phi Delta Kappan, 100(6), 50-54.

Toward More Effective Data Use in TeachingLinks to an external site.

· As part of your readings in this module, please also review the following:

Syllabus

APA and Research Guides

Watch

· Teachings in Education. (2016, December 18).
Assessment in education: Top 14 examples [Video]. YouTube. https://youtu.be/zTkQjH-_97c

Assessment in Education: Top 14 Examples (4:21)Links to an external site.

Supplemental Materials & Resources

· Popham, W. J. (2010).
Everything school leaders need to know about assessment. Corwin Press.
Print: 9781412979795
eText: 9781452271514

Module 1 Discussion

Data Use & Teaching

Directions:

To make the best use of data, educators must go beyond the big tests and involve teachers and students in collecting and analyzing data. After studying

Module 1: Lecture Materials & Resources

, discuss the following:

·
Education Majors:

· How can educators effectively involve both teachers and students in the data collection and analysis process?

· What challenges might arise from this approach, and how can they be addressed to ensure data-driven decision-making benefits the learning environment?

·
Instructional Design Majors:

· As an instructional designer, how would you design a framework that involves both teachers and students in the process of collecting and analyzing data?

· What tools, strategies, or methods would you recommend to ensure the data collected is meaningful and actionable for improving learning outcomes?

Submission Requirements:

· Each post is to be clear and concise, and students will lose points for improper grammar, punctuation, and misspelling.

· Your initial post should be at least 200 words, formatted and cited in current APA style. You should research and reference one article, as well as reference the textbook readings.

image3

image4

image1

image2

162

Chapter 6

Selected-Response
Tests

Chief Chapter Outcome

The ability to accurately employ professionally accepted item-writing
guidelines, both general and item-type specific, when constructing
selected-response items or evaluating those items constructed by others

Learning Objectives

6.1 Using the “Five General Item-Writing Precepts” found in this chapter,
identify and explain common characteristics of poorly constructed
selected-response test items.

6.2 Differentiate between the four different varieties of selected-response test
items (binary-choice items, multiple binary-choice items, multiple-choice
items, matching items) and be able to create an example of each.

In this and the following four chapters, you will learn how to construct almost
a dozen different kinds of test items you might wish to use for your own class-
room assessments. As suggested in the preceding chapters, you really need to
choose item types that mesh properly with the inferences you want to make about
students—and to be sure those inferences are directly linked to the educational
decisions you need to make. Just as the child who’s convinced that vanilla is ice
cream’s only flavor won’t benefit from a 36-flavor ice cream emporium, the more
item types you know about, the more appropriate your selection of item types will
be. So, in this chapter and the next four, you’ll be learning about blackberry-ripple
exams and mocha-mango assessment devices.

Realistically, what should you expect after wading through the exposition about
item construction contained in the upcoming text? Unless you’re a remarkably
quick study, you’ll probably finish this material and not be instantly transformed

M06_POPH0936_10_SE_C06.indd 162M06_POPH0936_10_SE_C06.indd 162 09/11/23 6:23 PM09/11/23 6:23 PM

Expanding Electronic Options 163

into a consummately skilled test constructor. It takes more than reading a brief
explanation to turn someone into a capable item developer. But, just as the journey
of a thousand miles begins with a single step, you’ll have initiated a tentative trot
toward the Test Construction Hall of Fame. You’ll have learned the essentials of how
to construct the most common kinds of classroom assessments.

What you’ll need, after having completed your upcoming, fun-filled study of
test construction is tons of practice in churning out classroom assessment devices.
And, if you remain in teaching for a while, such practice opportunities will surely
come your way. Ideally, you’ll be able to get some feedback about the quality of
your classroom assessment procedures from a supervisor or colleague who is
experienced and conversant with educational measurement. If a competent cohort
critiques your recent test construction efforts, you’ll profit by being able to make
needed modifications in how you create your classroom assessment instruments.

Expanding Electronic Options
Once upon a time, when teachers churned out all their own classroom tests, about
the only approach available was reliance on paper in order to present items that,
back then, students responded to using pencils or pens. Oh, if you head back far
enough in history, you might find Egyptian teachers relying on papyrus or, perhaps,
pre-history teachers dishing up pop quizzes on tree bark.

But we have evolved (some more comfortably than others) into a technological
age in which whole classrooms full of students possess laptop computers, elec-
tronic tablets, or super-smart cell phones that can be employed during instruction
and assessment. Accordingly, because the availability of such electronically provided
assessment options depends almost totally on what’s available for use in a particu-
lar district or school, some of the test construction guidelines you will encounter
here may need to be massaged because of electronic limitations in the way a test’s
items can be written. To illustrate, you’ll learn how to create “matching” items for a
classroom assessment. Well, one of the recommendations to teachers who use such
items is that they put everything for a given item on a single page (of paper)—so
that the students need not flip back and forth between pages when selecting their
answers. But what if the electronic devices that a teacher’s students have been
given do not provide sufficient room to follow this all-on-one-page guideline?

Well, in that situation it makes sense for a teacher to arrive at the most reason-
able/sensible solution possible. Thus, the test construction guidelines you’ll encounter
from here on in this book will be couched almost always in terms suitable for paper-
presented tests. If you must create classroom assessments using electronic options
that fail to permit implementation of the guidelines presented, just do the best job you
can in adapting a guideline to the electronic possibilities at hand. Happily, the use
of electronic hardware will typically expand, not truncate, your assessment options.

Before departing from our consideration of emerging digital issues concern-
ing educators, a traditional warning is surely warranted. Although the mission
and the mechanics of educational testing are generally understood by many

M06_POPH0936_10_SE_C06.indd 163M06_POPH0936_10_SE_C06.indd 163 09/11/23 6:23 PM09/11/23 6:23 PM

164 ChaptEr 6 Selected-response tests

educators, the arrival of brand-new digitally based assessment procedures is apt
to baffle today’s educators who fail to keep up with evolving versions of modern
educational measurement. New ways of testing students and more efficient ways
of doing so suggest that today’s teachers simply must keep up with innovative test-
ing procedures heretofore unimagined.

This advice was confirmed in an April 12, 2022, Education Week interview with
Sal Khan, founder of the nonprofit Khan Academy, which now counts 137 million
users in 190 nations. Kahn was asked how best to employ new technology and soft-
ware tools to close the learning gaps that emerged during the COVID-19 pandemic.
He remarked, “I think that’s going to be especially important because traditional
testing regimes have been broken. And it’s unclear what they’re going back to.”

Because of pandemic-induced limitations on large crowds, such as those rou-
tinely seen over the years when students were obliged to complete high-stakes
examinations, several firms have been exploring the virtues of providing custom-
ers with digitalization software that can track a test-taker’s eye movements and
even sobbing during difficult exams. Although the bulk of these development
efforts have been aimed at college-level students, it is almost certain that experi-
mental electronic proctoring systems will soon be aimed at lower grade levels.

A May 27, 2022, New York Times report by Kashmir Hill makes clear that the
COVID-19 pandemic, because of its contagion perils, created “a boom time for
companies that remotely monitor test-takers.” Suddenly, “millions of people were
forced to take bar exams, tests, and quizzes alone at home on their laptops.” Given
the huge number of potential customers in the nation’s K–12 schools, is not a shift
of digital proctoring companies into this enormous market a flat-out certainty?

Ten (Divided by Two) Overall
Item-Writing Precepts
As you can discern from its title, this text is going to describe how to construct
selected-response sorts of test items. You’ll learn how to create four different
varieties of selected-response test items—namely, binary-choice items, multiple
binary-choice items, multiple-choice items, and matching items. All four of these
selected-response kinds of items can be used effectively by teachers to derive
defensible inferences about students’ cognitive status—that is, the knowledge and
skills that teachers typically try to promote in their students.

But no matter whether you’re developing selected-response or
constructed-response test items, there are several general guidelines that, if
adhered to, will lead to better assessment procedures. Because many ancient sets
of precepts have been articulated in a fairly stern “Thou shall not” fashion, and
have proved successful in shaping many folks’ behavior through the decades,
we will now dish out five general item-writing commandments structured along
the same lines. Following these precepts might not get you into heaven, but it
will make your assessment schemes slightly more divine. All five item-writing

M06_POPH0936_10_SE_C06.indd 164M06_POPH0936_10_SE_C06.indd 164 09/11/23 6:23 PM09/11/23 6:23 PM

ten (Divided by two) Overall Item-Writing precepts 165

precepts are presented in a box below. A subsequent discussion of each precept
will help you understand how to adhere to the five item-writing mandates being
discussed. It will probably help if you refer to each of the following item-writing
precepts (guidelines) before reading the discussion of that particular precept.
Surely, no one would be opposed to your doing just a bit of cribbing!

Opaque Directions
Our first item-writing precept deals with a topic most teachers haven’t thought
seriously about—the directions for their classroom tests. Teachers who have been
laboring to create a collection of test items typically know the innards of those
items very well. Thus, because of the teacher’s intimate knowledge not only of the
items, but also of how students are supposed to deal with those items, it is often
the case that only sketchy directions are provided to students regarding how to
respond to a test’s items. Yet, of course, unclear test-taking directions can result
in confused test-takers. And the responses of confused test-takers don’t lead to
very accurate inferences about those test-takers.

Flawed test directions are particularly problematic when students are being
introduced to assessment formats with which they’re not very familiar, such as the
performance tests to be described in Chapter 8 or the multiple binary-choice tests
to be discussed later in this chapter. It is useful to create directions for students
early in the game when you’re developing an assessment instrument. When gener-
ated as a last-minute afterthought, test directions typically turn out to be tawdry.

Ambiguous Statements
The second item-writing precept deals with ambiguity. In all kinds of classroom
assessments, ambiguous writing is to be avoided. If your students aren’t really
sure about what you mean in the tasks you present to them, the students are apt
to misinterpret what you’re saying and, as a consequence, come up with incor-
rect responses, even though they might really know how to respond correctly.
For example, sentences in which pronouns are used can fail to make it clear to
which individual or individuals a pronoun refers. Suppose that, in a true–false
test item, you asked your students to indicate whether the following statement

Five General Item-Writing precepts
1. Thou shalt not provide opaque directions to students regarding how to respond.
2. Thou shalt not employ ambiguous statements in your assessment items.
3. Thou shalt not provide students with unintentional clues regarding appropriate

responses.
4. Thou shalt not employ complex syntax in your assessment items.
5. Thou shalt not use vocabulary that is more advanced than required.

M06_POPH0936_10_SE_C06.indd 165M06_POPH0936_10_SE_C06.indd 165 09/11/23 6:23 PM09/11/23 6:23 PM

166 ChaptEr 6 Selected-response tests

was true or false: “Leaders of developing nations have tended to distrust leaders
of developed nations due to their imperialistic tendencies.” Because it is unclear
whether the pronoun their refers to the “leaders of developing nations” or to the
“leaders of developed nations,” and because the truth or falsity of the statement
depends on the pronoun’s referent, students are likely to be confused.

Because you will typically be writing your own assessment items, you will
know what you mean. At least you ought to. However, try to slide yourself, at
least figuratively, into the shoes of your students. Reread your assessment items
from the perspective of the students, and then modify any statements apt to be
even a mite ambiguous for those less well-informed students.

Unintended Clues
The third of our item-writing precepts calls for you to intentionally avoid some-
thing unintentional. (Well, nobody said that following these assessment precepts
was going to be easy!) What this precept is trying to sensitize you to is the
tendency of test-development novices to inadvertently provide clues to students
about appropriate responses. As a consequence, students come up with correct
responses even if they don’t possess the knowledge or skill being assessed.

For example, inexperienced item-writers often tend to make the correct answer to
multiple-choice items twice as long as the incorrect answers. Even the most confused
students will often opt for the lengthy response; they get so many more words for
their choice. As another example of how inexperienced item-writers unintentionally
dispense clues, absolute qualifiers such as never and always are sometimes used for the

M06_POPH0936_10_SE_C06.indd 166M06_POPH0936_10_SE_C06.indd 166 09/11/23 6:23 PM09/11/23 6:23 PM

ten (Divided by two) Overall Item-Writing precepts 167

Computer-adaptive assessment: pros and Cons
Large-scale assessments, such as statewide
accountability tests or nationally standardized
achievement tests, are definitely different from the
classroom tests that teachers might, during a dreary
weekend, whip up for their students. Despite those
differences, the assessment tactics used in large-scale
tests should not be totally unknown to teachers. After
all, parents of a teacher’s students might occasionally
toss out questions at a teacher about such tests, and
what teacher wants to be seen, when it comes to
educational testing, as a no-knowledge ninny?

One of the increasingly prevalent variations of
standardized achievement testing encountered
in our schools is known as computer-adaptive
assessment. In some instances, computer-
adaptive testing is employed as a state’s annual,
large-scale accountability assessment. In other
instances, commercial vendors offer computer-
adaptive tests to cover shorter segments of
instruction, such as two or three months. School
districts typically purchase such shorter-duration
tests in an attempt to assist classroom teachers
in adjusting their instructional activities to the
progress of their students. In general, these
more instructionally oriented tests are known as
interim assessments, and we will consider such
assessments more deeply later (in Chapter 12).

Given the near certainty that students in many states
will be tested via computer-adaptive assessments,
a brief description of this distinctive assessment
approach is in order.

Not all assessments involving computers,
however, are computer-adaptive. Computer-based
assessments rely on computers to deliver test items
to students. Moreover, students respond to these
computer-transmitted items by using a computer.
In many instances, immediate scoring of students’
responses is possible. This form of computer-abetted
assessment is becoming more and more popular
as (1) schools acquire enough computers to make
the approach practicable and (2) states and school
districts secure sufficient “bandwidth” (whatever that
is!) to transmit tests and receive students’ responses
electronically. Computer-based assessments, as
you can see, rely on computers only as delivery
and retrieval mechanisms. Computer-adaptive
assessment is something quite different.

Here’s a shorthand version of how computer-
adaptive assessment is usually described. Notice,
incidentally, the key term adaptive in its name. That
word is your key to understanding how this approach
to educational assessment is supposed to function.
As a student takes this kind of test, the student is
given items of known difficulty levels. Then, based

false items in a true–false test. Because even uninformed students know there are few
absolutes in this world, they gleefully (and often unthinkingly) indicate such items
are false. One of the most blatant examples of giving unintended clues occurs when
writers of multiple-choice test items initiate those items with incomplete statements
such as “The bird in the story was an . . .” and then offer answer options in which only
the correct answer begins with a vowel. For instance, even though you had never
read the story referred to in the previous incomplete statement, if you encountered
the following four response options, it’s almost certain that you’d know the correct
answer: A. Falcon, B. Hawk, C. Robin, D. Owl. The article an gives the game away.

Unintended clues are seen more frequently with selected-response items than
with constructed-response items, but even in supplying background information
to students for complicated constructed-response items, the teacher must be wary
of unintentionally pointing truly unknowledgeable students deftly down a trail
to the correct response.

(Continued)

M06_POPH0936_10_SE_C06.indd 167M06_POPH0936_10_SE_C06.indd 167 09/11/23 6:23 PM09/11/23 6:23 PM

168 ChaptEr 6 Selected-response tests

on the student’s responses to those initial items, an
all-knowing computer supplies new items that are
tailored in difficulty level on the basis of the student’s
previous answers. For instance, if a student is
correctly answering the early items doled out by the
computer, then the next items popping up on the
screen will be more difficult ones. Conversely, if the
student stumbles on the initially presented items, the
computer will then cheerfully provide easier items
to the student, and so on. In short, the computer’s
program constantly adapts to the student’s responses
by providing items better matched to the student’s
assessment-determined level of achievement.

Using this adaptive approach, a student’s
overall achievement regarding whatever the test
is measuring can be determined with fewer items
than would typically be required. This is because,
in a typical test, many of the test’s items are likely
to be too difficult or too easy for a given student.
Accordingly, one of the advertised payoffs of
computer-adaptive testing is that it saves testing
time—that is, it saves those precious instructional
minutes often snatched away from teachers because
of externally imposed assessment obligations. And
there it is, the promotional slogan for computer-
adaptive testing: More Accurate Measurement in
Less Time! What clear-thinking teacher does not
get just a little misty eyed when contemplating the
powerful payoffs of computer-massaged testing?

But there are also limitations of computer-
adaptive testing that you need to recognize. The
first of these limitations stems from the necessity
for all the items in such assessments to be
measuring a single variable, such as students’
“mathematical mastery.” Because many items are
needed to make computer-adaptive assessment
purr properly, and because the diverse difficulties of
these items must all be linked to what’s sometimes
referred to as “a unidimensional trait” (for instance,
a child’s overall reading prowess), com puter-
adaptive assessment precludes the possibility of
providing student-specific diagnostic data. Too
few items dealing with a particular subskill or a
body of enabling knowledge can be administered
during a student’s abbreviated testing time. In other
words, whereas computer-adaptive assessment
can supply teachers with an efficiently garnered
general fix on a student’s achievement of what’s

often a broadly conceptualized curricular aim, it
often won’t reveal a student’s specific strengths
and weaknesses. Thus, from an instructional
perspectiv e, computer-adaptive assessment
usually falls short of supplying the instructionally
meaningful results most teachers need.

Second, as students wend their way merrily
through a computer-adaptive test, depending on
how they respond to certain items, different students
frequently receive different items. The adjustments in
the items that are dished up to a student depend on
the student’s computer-determined status regarding
whatever unidimensional trait is being measured.
Naturally, because of differences in students’ mastery
of this “big bopper” variable, different students receive
different items thereafter. Consequently, a teacher’s
students no longer end up taking the same exam. The
only meaningful way of comparing students’ overall test
performances, therefore, is by employing a scale that
represents the unidimensional trait (such as a student’s
reading capabilities) being measured. We will consider
such scale scores later, in Chapter 13. Yet, even before
we do so, it should be apparent that when a teacher
tries to explain to a parent why a parent’s child tackled
a test with a unique collection of items, the necessary
explanation is a challenging one—and sometimes an
almost impossible one, when the teacher’s explanations
hinge on items unseen by the parent’s child.

Finally, when most educators hear about
computer-adaptive assessment and get a general
sense of how it operates, they often assume that it
does its digital magic in much the same way—from
setting to setting. In other words, educators think
the program governing the operation of computer-
adaptive testing in State X is essentially identical to the
way the computer-adaptive testing program operates
in State Y. Not so! Teachers need to be aware that
the oft-touted virtues of computer-adaptive testing
are dependent on the degree of adaptivity embodied
in the program that’s being employed to analyze the
results. Many educators, once they grasp the central
thrust of computer-adaptive testing, believe that
after a student’s response to each of a test’s items,
an adjustment is made in the upcoming test items.
This, of course, would represent an optimal degree
of adaptivity. But item-adaptive adjustments are not
often employed in the real world because of such
practical considerations as the costs involved.

M06_POPH0936_10_SE_C06.indd 168M06_POPH0936_10_SE_C06.indd 168 09/11/23 6:23 PM09/11/23 6:23 PM

ten (Divided by two) Overall Item-Writing precepts 169

Complex Syntax
Complex syntax, although it sounds something like an exotic surcharge on
cigarettes and alcohol, is often encountered in the assessment items of neo-
phyte item-writers. Even though some teachers may regard themselves as
Steinbecks-in-hiding, an assessment instrument is no setting in which an
item-writer should wax eloquent. This fourth item-writing precept directs you
to avoid complicated sentence constructions and, instead, to use very simple
sentences. Although esteemed writers such as Thomas Hardy and James Joyce
are known for their convoluted and clause-laden writing styles, they might have
turned out to be mediocre item-writers. Too many clauses, except at Christmas-
time, mess up test items. (For readers needing a clue regarding the previous sen-
tence’s cryptic meaning, think of a plump, red-garbed guy who brings presents.)

Difficult Vocabulary
Our fifth and final item-writing precept is straightforward. It indicates that when
writing educational assessment items, you should eschew obfuscative verbiage. In
other words—and almost any other words would be preferable—use vocabulary

Accordingly, most of today’s computer-adaptive
tests employ what is called a cluster-adaptive approach.
One or more clusters of items, perhaps a half-dozen or
so items per cluster, are used to make an adjustment
in the upcoming items for a student. Such adjustments
are based on the student’s performance on the cluster
of items. Clearly, the more clusters of items that are
employed, the more tailored to a student’s status will be
the subsequent items. Thus, you should not assume
that the computer-adaptive test being employed in your
locale is based on an item-adaptive approach when,
in fact, it might be doing its adaptive magic based on
only a single set of cluster-adaptive items. Computer
adaptivity may be present, but it is a far cry from the
mistaken conception of sustained adaptivity that many
educators currently possess.

To sum up, then, computer-adaptive
assessments can, indeed, provide educators with
a pair of potent advantages—they deliver more
accurate assessment and take less time to do so.
However, of necessity, any meaningful diagnostic
dividends usually scurry out the window with such
computer-adaptive testing, and even the advertised
dividends of computer-adaptive testing tumble if
cost-conscious versions of computer adaptivity
are being employed. If computer-adaptive testing
is going on in your locale, find out how it does

its digital dance so that you can determine what
instructional help, if any, this avant-garde version
of educational assessment supplies to you. Few
classroom teachers will attempt, on their own, to
build computer-adaptive tests just for their own
students. Classroom teachers are too smart—
and have other things going on in their lives.
Accordingly, teachers’ knowledge about the general
way in which computer-adaptive testing functions
should position such teachers to better evaluate
any district-selected or state-selected computer-
adaptive assessments. Does a district-chosen
computer-adaptive test that purports to be a boon
to teachers’ instructional decision making show
evidence that it truly does so? Is a state-imposed
computer-adaptive test that’s intended to evaluate
the state’s schools accompanied by evidence that
it does so accurately? In short, teachers should not
assume that merely because an educational test is
computer-adaptive, this guarantees that the test is
reliable, valid, or fair for its intended uses. If you have
any required test being used in your setting without
enough evidence of its suitability for its intended
purpose, you should consider registering your
dismay over this educationally unsound practice.
And this is true even for assessments jauntily
cloaked in computer-adaptive attire.

M06_POPH0936_10_SE_C06.indd 169M06_POPH0936_10_SE_C06.indd 169 09/11/23 6:23 PM09/11/23 6:23 PM

170 ChaptEr 6 Selected-response tests

suitable for the students who’ll be taking your tests. Assessment time is not the
occasion for you to trot out your best collection of polysyllabic terms or to secure
a series of thesaurus-induced thrills. The more advanced that the vocabulary
level is in your assessment devices, the more likely you’ll fail to get a good fix
on your students’ true status. They will have been laid low by your overblown
vocabulary. In the case of the terminology to be used in classroom assessment
instruments, simple wins.

In review, you’ve now seen five item-writing precepts that apply to any
kind of classroom assessment device you develop. The negative admonitions in
those precepts certainly apply to tests containing selected-response items. And
that’s what we’ll be looking at in the rest of the chapter. More specifically, you’ll
be encountering a series of item-writing guidelines to follow when construct-
ing particular kinds of selected-response items. For convenience, we can refer
to these guidelines linked to particular categories of items as “item-category
guidelines” or “item-specific guidelines.” These item-category guidelines are
based either on empirical research evidence or on decades of teachers’ experi-
ence in using such items. If you opt to use any of the item types to be described
here in your own classroom tests, try to follow the guidelines for that kind of
item. Your tests will typically turn out to be better than if you hadn’t.

Here, and in several subsequent chapters, you’ll be encountering sets of
item-writing, item-revision, and response-scoring guidelines that you’ll be
encouraged to follow. Are the accompanying guidelines identical to the guide-
lines you’re likely to find in other texts written about classroom assessment? No,
they are not identical, but if you were to line up all of the texts ever written about
classroom testing, you’d find that their recommendations regarding the care and
feeding of items for classroom assessment are fundamentally similar.

In your perusal of selected-response options, when deciding what sorts of
items to employ, this is a particularly good time to make sure that your items
match the level of cognitive behavior that you want your students to display.
As always, a student’s performance on a batch of items can supply you with the
evidence that’s necessary to arrive at an inference regarding the student’s status.
If you are attempting to promote student’s higher-order cognitive skills, then
don’t be satisfied with the sorts of selected-response items that do little more
than tap into a student’s memorized information.

Binary-Choice Items
A binary-choice item gives students only two options from which to select. The
most common form of binary-choice item is the true–false item. Educators have
been using true–false tests probably as far back as Socrates. (True or False: Plato
was a type of serving-dish used by Greeks for special meals.) Other variations
of binary-choice items are those in which students must choose between the
pairs yes–no, right–wrong, correct–incorrect, fact–opinion, and so on.

M06_POPH0936_10_SE_C06.indd 170M06_POPH0936_10_SE_C06.indd 170 09/11/23 6:23 PM09/11/23 6:23 PM

inary-CCoice Items 171

The virtue of binary-choice items is they are typically so terse that students
can respond to many items in a short time. Therefore, it is possible to cover a
large amount of content in a brief assessment session. The greatest weakness of
binary-choice items is that, because there are only two options, students have
a 50–50 chance of guessing the correct answer even if they don’t have the fog-
giest idea of what’s correct. If a large number of binary-choice items are used,
however, this weakness tends to evaporate. After all, although students might
guess their way correctly through a few binary-choice items, they would need
to be extraordinarily lucky to guess their way correctly through 30 such items.

Here are five item-category guidelines for writing binary-choice items. A brief
discussion of each guideline is provided in the following paragraphs.

Item-Writing precepts for
inary-CCoice Items

1. Phrase items so that a superficial analysis by the student suggests a wrong
answer.

2. Rarely use negative statements, and never use double negatives.
3. Include only one concept in each statement.
4. Employ an approximately equal number of items representing the two

categories being tested.
5. Keep item length similar for both categories being tested.

Phrasing Items to Elicit Thoughtfulness
Typically, binary items are quite brief, but brevity need not reflect simplistic choices
for students. In order to get the most payoff from binary-choice items, you’ll want
to phrase items so that students who approach the items superficially will answer
them incorrectly. Thus, if you were creating the items for a true–false test, you would
construct statements for your items that were not blatantly true or blatantly false.
Blatancy in items rarely leads to accurate inferences about students. Beyond blatancy
avoidance, however, you should phrase at least some of the items so that if students
approach them unthinkingly, they’ll choose false for a true statement, and vice versa.
What you’re trying to do is to get students to think about your test items and, thereby,
give you a better idea about how much good thinking the students can do.

Minimizing Negatives
With binary-choice items, many students have a really difficult time responding to
negatively phrased items. For instance, suppose that in a true–false test you were
asked to decide about the truth or falsity of the following statement: “The League of
Nations was not formed immediately after the conclusion of World War II.” What the
item is looking for as a correct answer to this statement is true, because the League

M06_POPH0936_10_SE_C06.indd 171M06_POPH0936_10_SE_C06.indd 171 09/11/23 6:23 PM09/11/23 6:23 PM

172 ChaptEr 6 Selected-response tests

of Nations was in existence prior to World War II. Yet, the existence of the word not
in the item really will confuse some students. They’ll be apt to answer false even if
they know the League of Nations was functioning before World War II commenced.

Because seasoned teachers have churned out their share of true–false items
over the years, most such teachers know all too well how tempting it is to simply
insert a not into an otherwise true statement. But don’t yield to the temptation.
Only rarely succumb to the lure of the nagging negative in binary-choice items.
Items containing double negatives or triple negatives (if you could contrive one)
are obviously to be avoided.

Avoiding Double-Concept Items
The third guideline for binary-choice items directs you to focus on only a
single concept in each item. If you are creating a statement for a right–wrong
test, and you have created an item in which half of the statement is clearly
right and the other half is clearly wrong, you make it mighty difficult for
students to respond correctly. The presence of two concepts in a single item,
even if both are right or both are wrong, tends to confuse students and, as a
consequence, yields test results that are apt to produce inaccurate inferences
about those students.

Balancing Response Categories
If you’re devising a binary-choice test, try to keep somewhere near an equal num-
ber of items representing the two response categories. For example, if it’s a true–
false test, make sure you have similar proportions of true and false statements.
It’s not necessary to have exactly the same number of true and false items. The
numbers of true and false items, however, should be roughly comparable. This
fourth guideline is quite easy to follow if you simply keep it in mind when creat-
ing your binary-choice items.

Maintaining Item-Length Similarity
The fifth guideline is similar to the fourth because it encourages you to structure
your items so there are no give-away clues associated with item length. If your two
response categories are accurate and inaccurate, make sure the length of the accurate
statements is approximately the same as the length of the inaccurate statements.
When creating true–false tests, there is a tendency to toss in qualifying clauses for
the true statements so that those statements, properly qualified, are truly true—but
also long! As a result, there’s a systematic pattern wherein long statements tend
to be true and short statements tend to be false. As soon as students catch on to
this pattern, they can answer items correctly without even referring to an item’s
contents.

In review, we’ve considered five item-writing guidelines for binary-choice
items. If you follow those guidelines and keep your wits about you when creat-
ing binary-choice items, you’ll often find this type of test will prove useful in the
classroom. And that’s true, not false.

M06_POPH0936_10_SE_C06.indd 172M06_POPH0936_10_SE_C06.indd 172 09/11/23 6:23 PM09/11/23 6:23 PM

Multiple inary-CCoice Items 173

Multiple Binary-Choice Items
A multiple binary-choice item is one in which a cluster of items is presented to stu-
dents, requiring a binary response to each of the items in the cluster. Typically, but
not always, the items are related to an initial statement or set of statements. Multiple
binary-choice items are formatted so they look like traditional multiple-choice tests.
In a multiple-choice test, the student must choose one answer from several options,
but in the multiple binary-choice test, the student must make a response for each
statement in the cluster. Figure 6.1 is an example of a multiple binary-choice item.

David Frisbie (1992) reviewed research on such items and concluded that mul-
tiple binary-choice items are (1) highly efficient for gathering student achievement
data, (2) more reliable than other selected-response items, (3) able to measure the
same skills and abilities as multiple-choice items dealing with comparable content,
(4) a bit more difficult for students than multiple-choice tests, and (5) perceived by
students as more difficult but more efficient than multiple-choice items. Frisbie
believes that when teachers construct multiple binary-choice items, they must be
attentive to all of the usual considerations in writing regular binary-choice items.
However, he suggests the following two additional guidelines.

• • • Suppose that a dozen of your students completed a 10-item
multiple-choice test and earned the following numbers of correct scores:

5, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 10

9. The median for your students’ scores is 7.5. (True)

10. The mode for the set of scores is 8.0. (False)

11. The range of the students’ scores is 5.0. (True)

12. The median is different than the mean. (False)

Figure 6.1 an Illustrative Multiple true–False Item

Item-Writing precepts for Multiple
inary-CCoice Items

1. Separate item clusters vividly from one another.
2. Make certain that each item meshes well with the cluster’s stimulus material.

Separating Clusters
Because many students are familiar with traditional multiple-choice items, and because
each of those items is numbered, there is some danger that students may become con-
fused by the absence of numbers where such numbers are ordinarily found. Thus, be

M06_POPH0936_10_SE_C06.indd 173M06_POPH0936_10_SE_C06.indd 173 09/11/23 6:23 PM09/11/23 6:23 PM

174 ChaptEr 6 Selected-response tests

sure to use some kind of formatting system to make it clear that a new cluster is com-
mencing. In the illustrative item seen in Figure 6.1, notice that three dots (• • •) have
been used to signify the beginning of a new cluster of items. You can use asterisks,
lines, boxes, or some similar way of alerting students to the beginning of a new cluster.

Coordinating Items with Their Stem
In multiple-choice items, the first part of the item—the part preceding the response
options—is usually called the stem. For multiple binary-choice items, we refer to
the first part of the item cluster as the cluster’s stem or stimulus material. The
second item-writing guideline for this item type suggests that you make sure all
items in a cluster are, in fact, linked in a meaningful way to the cluster’s stem. If
they’re not, then you might as well use individual binary-choice items rather than
multiple binary-choice items.

There’s another compelling reason why you should consider adding multiple
binary-choice items to your classroom assessment repertoire. Unlike traditional
binary-choice items, for which it’s likely students will need to rely on memorized infor-
mation, that’s rarely the case with multiple binary-choice items. It is rare, that is, if
the stimulus materials contain content that’s not been encountered previously by stu-
dents. In other words, if the stem for a subset of multiple binary-choice items contains
material that’s new to the student, and if each binary-choice item depends directly on
the previously unencountered content, it’s dead certain the student will need to func-
tion above the mere recollection of knowledge, the lowest level of Bloom’s cognitive
taxonomy (Bloom et al., 1956). So, if you make certain that your stimulus material for
multiple binary-choice items contains new content, those items will surely be more
intellectually demanding than the run-of-the-memory true–false item.

In review, we’ve considered an item type that’s not widely used but has some
special virtues. The major advantage of multiple binary-choice items is that students
can respond to two or three such items in the time it takes them to respond to a single
multiple-choice item. Other things being equal, the more items students respond to,
the more reliably we can gauge their abilities.

I must confess to a mild bias toward this oft-overlooked item type because for
over 20 years I used such tests as the final examinations in an introductory educa-
tional measurement course I taught in the UCLA Graduate School of Education.
I’d give my students 20 one- or two-paragraph descriptions of previously unen-
countered educational measurement situations and then follow up each description
with five binary-choice items. In all, then, I ended up with a 100-item final exam
that really seemed to sort out those students who knew their stuff from those who
didn’t. The five items in each cluster were simply statements to which students were
to respond accurate or inaccurate. I could just as appropriately have asked for true
or false responses, but I tried to add a touch of suave to my exams. After all, I was
teaching in a Graduate School of Education. (In retrospect, my decision seems some-
what silly.) Nonetheless, I had pretty good luck with my 100-item final exams. I
used 50-item versions for my midterm exams, and they worked well too. For certain
kinds of purposes, I think you’ll find multiple binary-choice items will prove useful.

M06_POPH0936_10_SE_C06.indd 174M06_POPH0936_10_SE_C06.indd 174 09/11/23 6:23 PM09/11/23 6:23 PM

Multiple-CCoice Items 175

Multiple-Choice Items
For a number of decades, the multiple-choice test item has dominated achieve-
ment testing in the United States and many other nations. Multiple-choice items
can be used to measure a student’s possession of knowledge or a student’s abil-
ity to engage in higher levels of thinking. A strength of multiple-choice items
is that they can contain several answers differing in their relative correctness.
Thus, the student can be called on to make subtle distinctions among answer
options, several of which may be somewhat correct. A weakness of multiple-
choice items, as is the case with all selected-response items, is that students need
only recognize a correct answer. Students need not generate a correct answer.
Although a fair amount of criticism has been heaped on multiple-choice items,
particularly in recent years, properly constructed multiple-choice items can tap
a rich variety of student skills and knowledge, and thus they can be useful tools
for classroom assessment.

The first part of a multiple-choice item, as noted earlier, is referred to as the
item’s stem. The potential answer options are described as item alternatives.
Incorrect alternatives are typically referred to as item distractors. Two common
ways of creating multiple-choice items are to use an item stem that is either (1)
a direct question or (2) an incomplete statement. With younger students, the
direct-question approach is preferable. Using either direct-question stems or
incomplete-statement stems, a multiple-choice item can ask students to select
either a correct answer or a best answer.

In Figure 6.2 there are examples of a direct-question item requesting a
best-answer response (indicated by an asterisk) and an incomplete-statement
item requesting a correct-answer response (also indicated by an asterisk). One
frequently cited advantage of using a best-answer rather than a correct-answer

Direct-Question Form (best-answer version)
Which of the following modes of composition would be most effective in
explaining to someone how a bill becomes a law in this nation?

A. Narrative
* B. Expository

C. Persuasive
D. Descriptive

Incomplete-Statement Form (correct-answer version)
Mickey Mouse’s nephews are named

A. Huey, Dewey, and Louie.
B. Mutt and Jeff.
C. Larry, Moe, and Curly.

* D. Morty and Ferdie.

Figure 6.2 Illustrative Multiple-CCoice Items

M06_POPH0936_10_SE_C06.indd 175M06_POPH0936_10_SE_C06.indd 175 09/11/23 6:23 PM09/11/23 6:23 PM

176 ChaptEr 6 Selected-response tests

to these sorts of items is that when students are faced with a set of best-answer
choices, we can see whether they are able to make distinctions among subtle differ-
ences in a set of answers—all of which are technically “correct.” Moreover, because
best-answer items can require the student to do some serious thinking when choos-
ing among a set of options, these sorts of items can clearly elicit higher-order cog-
nitive thinking from students. And, of course, that’s usually a good thing to do.

Let’s turn now to a consideration of item-category guidelines for
multiple-choice items. Because of the widespread use of multiple-choice
items over the past half-century, experience has generated quite a few
suggestions regarding how to create such items. Here, you’ll find five of
the more frequently cited item-specific recommendations for constructing
multiple-choice items.

Stem Stuffing
A properly constructed stem for a multiple-choice item will present a clearly
described task to the student so the student can then get to work on figuring out
which of the item’s options is best (if it’s a best-answer item) or is correct (if it’s
a correct-answer item). A poorly constructed stem for a multiple-choice item will
force the student to read one or more of the alternatives in order to figure out
what the item is getting at. In general, therefore, it’s preferable to load as much of
the item’s content as possible into the stem. Lengthy stems and terse alternatives
are, as a rule, much better than skimpy stems and long alternatives. You might try
reading the stems of your multiple-choice items without any of the alternatives
to see whether the stems (either direct questions or incomplete statements) make
sense all by themselves.

Knocking Negatively Stated Stems
It has been alleged, particularly by overly cautious individuals, that “one robin
does not spring make.” Without debating the causal relationship between sea-
sonal shifts and feathered flyers, in multiple-choice item writing we could say

Item-Writing precepts for
Multiple-CCoice Items

1. The stem should consist of a self-contained question or problem.
2. Avoid negatively stated stems.
3. Do not let the length of alternatives supply unintended clues.
4. Randomly assign correct answers to alternative positions.
5. Never use “all-of-the-above” alternatives, but do use “none-of-the-above”

alternatives to increase item difficulty.

M06_POPH0936_10_SE_C06.indd 176M06_POPH0936_10_SE_C06.indd 176 09/11/23 6:23 PM09/11/23 6:23 PM

Multiple-CCoice Items 177

with confidence that “one negative in an item stem does not confusion unmake.”
Negatives are strange commodities. A single not, tossed casually into a test item,
can make students crazy. Besides, because not is such a tiny word, and might
be overlooked by students, a number of students (who didn’t see the not) may
be trying to ferret out the best alternative for a positively stated stem that, in
reality, is negative.

For example, let’s say you wanted to do a bit of probing of your students’
knowledge of U.S. geography and phrased the stem of a multiple-choice item
like this: “Which one of the following cities is located in a state west of the Missis-
sippi River?” If your alternatives were: A. San Diego, B. Pittsburgh, C. Boston, and
D. Atlanta, students would have little difficulty in knowing how to respond. Let’s
say, however, that you decided to add a dollop of difficulty by using the same alter-
natives but tossing in a not. Now your item’s stem might read something like this:
“Which one of the following cities is not located in a state east of the Mississippi
River?” For this version of the item, the student who failed to spot the not (this,
in psychometric circles, might be known as not-spotting) would be in big trouble.

By the way, note that in both stems, the student was asked to identify
which one of the following answers was correct. If you leave the one out, a
student might have interpreted the question to mean that two or more cities
were being sought.

If there is a compelling reason for using a negative in the stem of a
multiple-choice item, be sure to highlight the negative with italics, boldface type,
or underscoring so that students who are not natural not-spotters will have a fair
chance to answer the item correctly.

Attending to Alternative Length
Novice item-writers often fail to realize that the length of a multiple-choice item’s
alternatives can give away what the correct answer is. Let’s say choices A, B, and
C say blah, blah, blah, but choice D says blah, blah, blah, blah, blah, and blah.
The crafty student will be inclined to opt for choice D not simply because one
gets many more blahs for one’s selection, but because the student will figure out
the teacher has given so much attention to choice D that there must be something
special about it.

Thus, when you’re whipping up your alternatives for multiple-choice
items, try either to keep all the alternatives about the same length or, if this isn’t
possible, to have at least two alternatives be of approximately equal length. For
instance, if you were using four-alternative items, you might have two fairly
short alternatives and two fairly long alternatives. What you want to avoid
is having the correct alternative be one length (either short or long) while the
distractors are all another length.

Incidentally, the number of alternatives is really up to you. Most fre-
quently, we see multiple-choice items with four or five alternatives. Because
students can guess correct answers more readily with fewer alternatives, three

M06_POPH0936_10_SE_C06.indd 177M06_POPH0936_10_SE_C06.indd 177 09/11/23 6:23 PM09/11/23 6:23 PM

178 ChaptEr 6 Selected-response tests

alternatives are not seen all that often, except with younger students. Having
more than five alternatives puts a pretty heavy reading load on the student.
Many teachers usually employ four alternatives in their own multiple-choice
tests, but in a few instances, the nature of the test’s content leads them to use
three or five alternatives.

Assigning Correct Answer-Positions
A fourth guideline for writing multiple-choice items is to make sure you scatter
your correct answers among your alternatives, so students don’t “guess their way
to high scores” simply by figuring out your favorite correct answer-spot is, for
instance, choice D or perhaps choice C. Many novice item-writers are reluctant
to put the correct answer in the choice-A position because they believe that gives
away the correct answer too early. Yet the choice-A position deserves its share of
correct answers too. Absence-of-bias should also apply to answer-choice options.

As a rule of thumb, if you have four-alternative items, try to assign
approximately 25 percent of your correct answers to each of the four positions. But
try to avoid always assigning exactly 25 percent of the correct answers to each
position. Students can be remarkably cunning in detecting your own cunning-
ness. It may be necessary to do some last-minute shifting of answer positions in
order to achieve what is closer to a random assignment of correct answers to the
available positions. But always do a last-minute check on your multiple-choice
tests to make sure you haven’t accidentally allowed your correct answers to
appear too often in a particular answer-choice position.

Dealing with “Of-the-Above” Alternatives
Sometimes a beginning item-writer who’s trying to come up with four (or five)
reasonable alternatives will toss in a none-of-the-above or an all-of-the-above
alternative simply as a “filler” alternative. But both of these options must be con-
sidered carefully.

The fifth guideline for this item type says quite clearly that you should
never use the all-of-the-above alternative. Here’s why. Let’s say you’re using a
five-option type of multiple-choice item, and you want to set up the item so that
the fifth option (choice E), “all of the above,” would be correct. This indicates
that the first four answers, choices A through D, must all be correct. The prob-
lem with such an item is that even a student who knows only that two of the
first four alternatives are correct will be able to easily select the all-of-the-above
response because, if any two responses are correct, choice E is the only possible
best answer. Even worse, some students will read only the first alternative, see
that it is correct, mark choice A on their response sheet, and move on to the next
test item without going over the full array of enticing alternatives for the item.
For either of these reasons, you should never use an all-of-the-above alternative
in your multiple-choice items.

M06_POPH0936_10_SE_C06.indd 178M06_POPH0936_10_SE_C06.indd 178 09/11/23 6:23 PM09/11/23 6:23 PM

Multiple-CCoice Items 179

What about the none-of-the-above alternative? The guideline indicates
that it may be used when you wish to increase an item’s difficulty. You should
do so only when the presence of the none-of-the-above alternative will help
you make the kind of test-based inference you want to make. To illustrate,
let’s say you want to find out how well your students can perform basic math-
ematical operations such as multiplying and dividing. Moreover, you want
to be confident that your students can really perform the computations “in
their heads,” by using scratch paper, by employing an appropriate app with
their cell phones or digital tablets, or—if permitted—by renting a mathematics
consultant from a local rental agency. Now, if you use only four-alternative
multiple-choice items, there’s a real likelihood that certain students won’t
be able to perform the actual mathematical operations, but may be able to
select, by estimation, the answer that is most reasonable. After all, one of the
four options must be correct. Yet, when you simply add the none-of-the-above
option (as a fourth or fifth alternative), students can’t be sure that the correct
answer is silently sitting there among the item’s alternatives. To determine
whether the correct answer is really one of the alternatives provided for the
item, students will be obliged to perform the required mathematical operation
and come up with the actual answer. In essence, when the none-of-the-above
option is added, the task presented to the student more closely approximates
the task in which the teacher is interested. A student’s chance of guessing the
correct answer to a multiple-choice item is markedly less likely when a none-
of-the-above option shows up in the item.

Here’s a little wrinkle on this guideline you might find useful. Be a bit careful
about using a none-of-the-above option for a best-answer multiple-choice item.
Care is warranted because there are wily students waiting out there who’ll parade
a series of answers for you that they regard as better than the one option you think
is a winner. If you can cope with such carping, there’s no problem. If you have
a low carping-coping threshold, however, avoid none-of-the-above options for
most—if not all—of your best-answer items.

If you opt for the use of multiple-choice items as part of your
classroom-assessment repertoire, you might be prepared to respond to students’
questions about correction-for-guessing scoring. Indeed, even if the teacher does
not use any sort of correction-for-guessing when scoring students’ choices, it is
often the case that—because such scoring approaches are used with certain stan-
dardized tests—some students (or some parents) will raise questions about such
scoring. The formula used on certain standardized tests to minimize guessing
typically subtracts, from the number of correctly answered items, a fraction based
on the number of items answered incorrectly, but not those items for which no
response was given by the student.

For example, suppose that on a 50-item test composed of 4-option
multiple-choice items, a student answered 42 items correctly, but also incorrectly
answered 8 items. That student, instead of receiving a raw score of 42 items correct,
would have 2 points subtracted—resulting in a score of 40 correct. It is assumed

M06_POPH0936_10_SE_C06.indd 179M06_POPH0936_10_SE_C06.indd 179 09/11/23 6:23 PM09/11/23 6:23 PM

180 ChaptEr 6 Selected-response tests

that, by chance alone on each of the 8 incorrectly answered items, taking one chance
out of four would yield 2 correct answers (one out of the four options for the 8
items). If another student also answered 42 items correctly, but did not guess on
any items—preferring instead to leave them blank—then that more cautious stu-
dent would retain a score of 42 correct. Few teachers employ this sort of correction
formula for their own classroom tests, but teachers should have at least a ball-park
notion of what makes such formulae tick. (Please note the ritzy Latin plural of
“formula” just employed. Clearly, this is a classy book that you’re reading.)

In review, we’ve just sauntered through a cursory look at five guidelines for
the creation of multiple-choice items. There are other, more subtle suggestions
for creating such items, but if you combine these guidelines with the five general
item-writing precepts presented earlier in the chapter, you’ll have a good set of
ground rules for devising decent multiple-choice items. As with all the item types
being described, there’s no substitute for oodles of practice in item writing that’s
followed by collegial or supervisorial reviews of your item-writing efforts. It is
said that Rome wasn’t built in a day. Similarly, in order to become a capable con-
structor of multiple-choice items, you’ll probably need to develop more than one
such test—or even more than two.

Decision time
Multiple-Guess test Items?

For all six years that he has taught, Viraj Patel has
worked with fifth-graders. Although Viraj enjoys all
content areas, he takes special pride in his reading
instruction, because he believes his students enter the
sixth grade with dramatically improved comprehension
capabilities. Viraj spends little time trying to promote
his students’ isolated reading skills but, instead,
places great emphasis on having students “construct
their own meanings” from what they have read.

All of Viraj’s reading tests consist exclusively of
four-option multiple-choice items. As Viraj puts it,
“I can put together a heavy-duty multiple-choice
item when I put my mind to it.” Because he has put
his mind to it many times during the past six years,
Viraj is quite satisfied that his reading tests are as
good as his reading instruction.

At the most recent open house night at his
school, however, a group of five parents registered
genuine unhappiness with the exclusive multiple-
choice makeup of Viraj’s reading tests. The parents

had obviously been comparing notes prior to the
open house, and Mrs. Davies (the mother of one of
Viraj’s fifth-graders) acted as their spokesperson. In
brief, Mrs. Davies argued that multiple-choice tests
permitted even weak students to “guess their way
to good scores.” “After all,” Mrs. Davies pointed
out, “do you want to produce good readers or
good guessers?”

Surprised and somewhat shaken by this
incident, Viraj has been rethinking his approach
to reading assessment. He concludes that he can
(1) stick with his tests as they are, (2) add some
short-answer or essay items to his tests, or
(3) replace all his multiple-choice items with
open-ended items. As he considers these
options, he realizes that he has given himself a
multiple-choice decision!

If you were Viraj, what would your
decision be?

M06_POPH0936_10_SE_C06.indd 180M06_POPH0936_10_SE_C06.indd 180 09/11/23 6:23 PM09/11/23 6:23 PM

MatcCing Items 181

Column A Column B

____ 1. World War I A. Bush (the father)

____ 2. World War II B. Clinton

____ 3. Korea C. Eisenhower

____ 4. Vietnam D. Johnson

____ 5. First Persian Gulf E. Nixon

F. Roosevelt

G. Truman

H. Wilson

Directions: On the line to the left of each military conflict listed in
Column A, write the letter of the U.S. president in Column B who was in
office when that military conflict was concluded. Each name in Column B
may be used no more than once.

Figure 6.3 an Illustrative MatcCing Item

Matching Items
A matching item consists of two parallel lists of words or phrases requiring the stu-
dent to match entries on one list with appropriate entries on the second list. Entries
in the list for which a match is sought are widely referred to as premises. Entries in
the list from which selections are made are referred to as responses. Usually, students
are directed to match entries from the two lists according to a specific kind of asso-
ciation described in the test directions. Figure 6.3 is an example of a matching item.

Notice that in Figure 6.3’s illustrative matching item, both lists are homogeneous.
All of the entries in the column at the left (the premises) are U.S. military conflicts,
and all of the entries in the column at the right (the responses) are names of U.S. presi-
dents. Homogeneity is an important attribute of properly constructed matching items.

An advantage of matching items is that their compact form takes up little
space on a printed page or a computer screen, thus making it easy to consider
a good deal of information efficiently. Matching items (presented on paper
tests) can be easily scored by simply holding a correct-answer template next
to the list of premises where students are to supply their selections from the
list of responses. A disadvantage of matching items is that, like binary-choice
items, they sometimes encourage students’ memorization of low-level factual
information that, in at least some instances, is of debatable utility. The illus-
trative matching item is a case in point. Although it’s relatively easy to create
matching items such as this, is it really important to know which U.S. chief
executive was in office when a military conflict was concluded? That’s the kind
of issue you’ll be facing when you decide what sorts of items to include in your
classroom assessments.

M06_POPH0936_10_SE_C06.indd 181M06_POPH0936_10_SE_C06.indd 181 09/11/23 6:24 PM09/11/23 6:24 PM

182 ChaptEr 6 Selected-response tests

Typically, matching items are used as part of a teacher’s assessment arsenal.
It’s pretty difficult to imagine a major classroom examination consisting exclu-
sively of matching items. Matching items don’t work well when teachers are try-
ing to assess relatively distinctive ideas, because matching items require pools of
related entries to insert into the matching format.

Let’s consider a half-dozen guidelines you should think about when creating
matching items for your classroom assessment instruments. The guidelines are
presented here.

Employing Homogeneous Entries
As noted earlier, each list in a matching item should consist of homogeneous
entries. If you really can’t create a homogeneous set of premises and a homo-
geneous set of responses, you shouldn’t be mucking about with matching
items.

Going for Relative Brevity
From the student’s perspective, it’s much easier to respond to matching items
if the entries in both lists are relatively few in number. About 10 or so premises
should be the upper limit for most matching items. The problem with longer lists
is that students spend so much time trying to isolate the appropriate response
for a given premise that they may forget what they’re attempting to find. Very
lengthy sets of premises or responses are almost certain to cause at least some
students difficulty in responding because they lose track of what’s being sought.
It would be far better to take a lengthy matching item with 24 premises, then split
it into three 8-premise matching items.

In addition, to cut down on the reading requirements of matching items,
be sure to place the list of shorter words or phrases at the right. In other words,
make the briefer entries the responses. In this way, when students are scanning
the response lists for a matching entry, they won’t be obliged to read too many
lengthy phrases or sentences.

Item-Writing precepts for
MatcCing Items

1. Employ homogeneous lists.
2. Use relatively brief lists, placing the shorter words or phrases at the right.
3. Employ more responses than premises.
4. Order the responses logically.
5. Describe the basis for matching and the number of times responses may be used.
6. Place all premises and responses for an item on a single page (or screen).

M06_POPH0936_10_SE_C06.indd 182M06_POPH0936_10_SE_C06.indd 182 09/11/23 6:24 PM09/11/23 6:24 PM

MatcCing Items 183

Loading Up on Responses
A third guideline for the construction of matching items is to make sure there
are at least a few extra responses. Otherwise, if the numbers of premises and
responses are identical, the student who knows, say, 80 percent of the matches
to the premises may be able to figure out the remaining matches by a process of
elimination. A few extra responses reduce this likelihood substantially. Besides,
these sorts of responses are inexpensive. Be a big spender!

Ordering Responses
So that you’ll not provide unintended clues to students regarding which responses
go with which premises, it’s a good idea in matching items to order the responses

parent talk
Assume that you’ve been using a fair number
of multiple-choice items in your classroom
examinations. Benito’s parents, Mr. and
Mrs. Olmedo, have set up a 15-minute conference
with you during a Back-to-Classroom Night to talk
about Benito’s progress.

When they arrive, they soon get around to the
topic in which they are most interested—namely,
your multiple-choice test items. As Mr. Olmedo puts
it, “We want our son to learn to use his mind, not his
memory. Although my wife and I have little experience
with multiple-choice tests because almost all of our
school exams were of an essay nature, we believe
multiple-choice tests measure only what Benito
has memorized. He has a good memory, as you’ve
probably found out, but we want more for him. Why
are you using such low-level test items?”

If I were you, here’s how I’d respond to
Mr. Olmedo’s question:

“I appreciate your coming in to talk about Benito’s
education and, in particular, the way he is being
assessed. And I also realize that recently there has
been a good deal of criticism of multiple-choice test
items in the press. More often than not, such criticism
is altogether appropriate. In far too many cases,
because multiple-choice items are easy to score,
they’re used to assess just about everything. And, all

too often, those kinds of items do indeed ask students
to do little more than display their memorization skills.

“But this doesn’t need to be the case.
Multiple-choice items, if they are carefully
developed, can assess a wide variety of truly
higher-order thinking skills. In our school, all
teachers have taken part in a series of staff-
development workshop sessions in which every
teacher learned how to create challenging
multiple-choice items that require students
to display much more than memory.” (At this
point, you might whip out a few examples of
demanding multiple-choice items used in your
own tests and then go through them, from stem
to alternatives, showing Mr. and Mrs. Olmedo
what you meant. If you don’t have any examples
of such items from your tests, you might think
seriously about the possible legitimacy of
Benito’s parents’ criticism.)

“So, although there’s nothing wrong with
Benito’s acquiring more memorized information, and
a small number of my multiple-choice items actually
do test for such knowledge, the vast majority of
the multiple-choice items that Benito will take in my
class call for him to employ that fine mind of his.

“That’s what you two want. That’s what I want.”

Now, how would you respond to Mr. and
Mrs. Olmedo’s concerns?

M06_POPH0936_10_SE_C06.indd 183M06_POPH0936_10_SE_C06.indd 183 09/11/23 6:24 PM09/11/23 6:24 PM

184 ChaptEr 6 Selected-response tests

in some sort of logical fashion—for example, alphabetical or chronological
sequence. Notice that the names of the U.S. presidents are listed alphabetically in
the illustrative matching item in Figure 6.3.

Describing the Task for Students
The fifth guideline for the construction of matching items suggests that the direc-
tions for an item should always make explicit the basis on which the matches
are to be made and, at the same time, the number of times a response can be
used. The more clearly students understand how they’re supposed to respond,
the more accurately they’ll respond, and the more validly you’ll be able to make
score-based inferences about your students.

Same-Page Formatting
The final guideline for this item type suggests that you make sure all premises
and responses for a matching item are on a single page or a solo electronic screen.
Not only does this eliminate the necessity for massive (and potentially disrup-
tive) page turning and encourages (quieter) screen trading by students, but it
also decreases the likelihood that students will overlook correct answers merely
because these answer choices were on the “other” page.

Matching items, if employed judiciously, can efficiently assess your stu-
dents’ knowledge. The need to employ homogeneous lists of related content
tends to diminish the applicability of this type of selected-response item.
Nonetheless, if you are dealing with content that can be addressed satisfacto-
rily by such an approach, you’ll find matching items a useful member of your
repertoire of item types.

What Do Classroom Teachers
Really Need to Know About
Selected-Response Tests?
If you engage in any meaningful amount of assessment in your own classroom, it’s
quite likely you’ll find that selected-response items will be useful. Selected-response
items can typically be used to ascertain students’ mastery of larger domains of con-
tent than is the case with constructed-response kinds of test items. Although it’s
often thought that selected-response items must measure only lower-order kinds
of cognitive capabilities, inventive teachers can create selected-response options
to elicit very high levels of cognitive skills from their students.

As for the four types of items treated in this chapter, you really need to under-
stand enough about each kind to help you decide whether one or more of those
item types would be useful for a classroom assessment task you have in mind. If
you do decide to use binary-choice, multiple binary-choice, multiple-choice, or

M06_POPH0936_10_SE_C06.indd 184M06_POPH0936_10_SE_C06.indd 184 09/11/23 6:24 PM09/11/23 6:24 PM

WCat Do Classroom teacCers really eed to now about Selected-response tests? 185

matching items in your own tests, then you’ll find that the sets of item-writing
guidelines for each item type will come in handy.

In Chapter 12, you will learn how to employ selected-response items as
part of the formative-assessment process. Rather than relying completely on
paper-and-pencil versions of these items, you will see how to employ such items in
a variety of less formal ways. Yet, in honesty, even if you adhere to the five general
item-writing precepts discussed here as well as to the set of guidelines for particu-
lar types of items, you’ll still need practice in constructing selected-response items
such as those considered here. As was suggested several times in the chapter, it is
exceedingly helpful if you can find someone with measurement moxie, or at least
an analytic mind, to review your test-development efforts. It’s difficult in any realm
to improve if you don’t get feedback about the adequacy of your efforts. That’s
surely true with classroom assessment. Try to entice a colleague or supervisor to
look over your selected-response items to see what needs to be strengthened. How-
ever, even without piles of practice, if you adhere to the item-writing guidelines
provided here, your selected-response tests won’t be all that shabby.

The most important thing to learn from this chapter is that there are four
useful selected-response procedures for drawing valid inferences about your stu-
dents’ status. The more assessment options you have at your disposal, the more
appropriately you’ll be able to assess those students’ status with respect to the
variables in which you’re interested.

ut WCat Does tCis have to Do witC teacCing?
In this chapter, not only have you become acquainted
with varied types of test items, but you have also
learned how to construct them properly. Why, you
might ask, does a classroom teacher need to possess
such a resplendent repertoire of item types? After
all, if a student knows that Abraham Lincoln was
assassinated while attending a play, the student could
satisfactorily display such knowledge in a variety of
ways. A good old true–false item or a nifty multiple-
choice item would surely do the necessary assessing.

Well, that’s quite true. But what you need to
recall, from a teaching perspective, is that students
ought to master the skills and knowledge they’re
being taught so thoroughly that they can display
such mastery in a good many ways, not just one. A
cognitive skill that’s been learned well by a student
will be a cognitive skill that can be displayed in all
sorts of ways, whether via selected-response items
or constructed-response items.

And this is why, when you teach children, you
really need to be promoting their generalizable
mastery of whatever’s being taught. Consistent
with such a push for generalizable mastery, you
will typically want to employ varied assessment
approaches. If you gravitate toward only one or two
item types (that is, if you become “Multiple-Choice
Maria” or “True–False Fredrico”), then your students
will tend to learn things only in a way that meshes
with your favored item type.

Eons ago, when I was preparing to be a teacher,
my teacher education professors advised me to
employ different kinds of items “for variety’s sake.”
Well, classroom teachers aren’t putting on a fashion
show. Variety for its own sake is psychometrically
stupid. But using a variety of assessment
approaches as a deliberate way of promoting and
measuring students’ generalizable mastery of what’s
been taught—this is psychometrically suave.

M06_POPH0936_10_SE_C06.indd 185M06_POPH0936_10_SE_C06.indd 185 09/11/23 6:24 PM09/11/23 6:24 PM

186 ChaptEr 6 Selected-response tests

Chapter Summary
The chapter began with a presentation of five item-
writing precepts that pertain to both constructed-
response and selected-response items. The five
admonitions directed teachers to avoid unclear
directions, ambiguous statements, unintentional
clues, complex syntax, and hypersophisticated
vocabulary.

Consideration was then given to the four
most common kinds of selected-response
test items: binary-choice items, multiple

binary-choice items, multiple-choice items, and
matching items. For each of these four item types,
after a brief description of the item type and its
strengths and weaknesses, a set of item-writing
guidelines was presented. These four sets of
guidelines were presented on pages 170, 173, 176,
and 181. Each guideline was briefly discussed.
Readers were encouraged to consider the four
item types when deciding on an answer to the
how-to-assess-it question.

References
Bloom, B. S., et al. (1956). Taxonomy of educational

objectives: Handbook I: Cognitive domain. New
York: David McKay.

Brookhart, S. M., & Nitko, A. J. (2018).
Educational assessment of students (8th ed.).
Pearson.

Frisbie, D. A. (1992). The multiple true–
false format: A status review, Educational
Measurement: Issues and Practice, 11, no. 4
(Winter): 21–26.

Herold, B. (2022, April 14). Khan Academy
founder on how to boost math performance
and make free college a reality, Education
Week. https://www.edweek.org/

technology/khan-academy-founder-on-how-
to-boost-math-performance-and-make-free-
college-a-reality/2022/04

Hill, K. (2022, May 27). Accused of cheating by
an algorithm, and a professor she had never
met, New York Times. https://www.nytimes.
com/2022/05/27/technology/ college-
students-cheating-software-honorlock.html

Rodriguez, M. C., & Haladyna, T. M. (2013).
Writing selected-response items for
classroom assessment. In J. H. McMillan
(Ed.), SAGE Handbook of research on classroom
assessment (pp. 293–312). SAGE Publications.

M06_POPH0936_10_SE_C06.indd 186M06_POPH0936_10_SE_C06.indd 186 09/11/23 6:24 PM09/11/23 6:24 PM

188

Chapter 7

Constructed-
Response Tests

Chief Chapter Outcome

A sufficient understanding of generally approved guidelines for
creating constructed-response items, and scoring students’ responses
to them, so that errors in item-construction and response-scoring can
be identified and remedied

Learning Objectives

7.1 Define and distinguish guidelines for constructed-response items that are
task-understandable to test-takers.

7.2 Identify and employ evaluative criteria—holistically or analytically—when
judging the quality of students’ responses.

You’re going to learn about constructed-response tests in this chapter. To be truth-
ful, you’re going to learn about only two kinds of paper-and-penci l constructed-
response items—namely, short-answer items and essay items (including students’
written compositions). Although you probably know that “you can’t tell a
book by its cover,” now you’ve discovered that a chapter’s title doesn’t always
describe its contents accurately either. Fortunately, there are no state or federal
“ truth-in-titling” laws.

You might be wondering why your ordinarily honest, never-fib author has
descended to this act of blatant mislabeling. Actually, it’s just to keep your reading
chores more manageable. Earlier, you learned that student-constructed responses
can be obtained from a wide variety of item types. In this chapter, we’ll be look-
ing at two rather traditional forms of constructed-response items, both of them
paper-and-pencil in nature. In the next chapter, the focus will be on performance
tests, such as those that arise when we ask students to make oral presentations

M07_POPH0936_10_SE_C07.indd 188M07_POPH0936_10_SE_C07.indd 188 09/11/23 12:31 PM09/11/23 12:31 PM

Constructed-Response Tests 189

or supply comprehensive demonstrations of complex skills in class. After that, in
Chapter 9, we’ll be dealing with portfolio assessment and how portfolios are used
for assessment purposes.

Actually, all three chapters could be lumped under the single description of
performance assessment or constructed-response measurement. This is because anytime
you assess your students by asking them to respond in other than a make-a-choice
manner, the students are constructing; that is, they are performing. It’s just that if
all of this performance assessment stuff had been crammed into a single chapter,
you’d have thought you were experiencing a season-long TV mini-series. Yes, it
was an inherent concern for your well-being that led to a flagrant disregard for
accuracy when titling this chapter.

The major payoff of all constructed-response items is they elicit student
responses more closely approximating the kinds of behavior that students must
display in real life. After students leave school, for example, the demands of daily
living almost never require them to choose responses from four nicely arranged
alternatives. And when was the last time, in normal conversation, you were
obliged to render a flock of true–false judgments about a set of statements that
were presented to you? Yet, you may well be asked to make a brief oral presenta-
tion to your fellow teachers or to a parent group, or you may be asked to write a
brief report for the school newspaper about your students’ field trip to City Hall.
Constructed-response tasks unquestionably coincide more closely with custom-
ary nonacademic tasks than do selected-response tasks.

As a practical matter, if the nature of a selected-response task is sufficiently
close to what might be garnered from a constructed-response item, then you may
wish to consider a selected-response assessment tactic to be a reasonable sur-
rogate for a constructed-response assessment tactic. Selected-response tests are
clearly much more efficient to score. And, because almost all the teachers are
busy folks, time-saving procedures are not to be scoffed at. Yet there will be situ-
ations wherein you’ll want to make inferences about your students’ status when
selected-response tests just won’t fill the bill. For instance, if you wish to know
what kind of a cursive writer Florio is, then you’ll have to let Florio write cur-
sively. A true–false test about i dotting and t crossing just doesn’t cut it.

Given the astonishing technological advances we see every other week, what
today is typically a paper-and-pencil test is apt, perhaps by tomorrow or the follow-
ing day, to be some sort of digitized assessment linked to an exotic outer-space satel-
lite. It may become quite commonplace for teachers to develop computer-generated
exams themselves. However, because such a digital-assessment derby does not yet
surround us—and some classroom teachers have no idea about how to construct a
flock of electronically dispensed items—in this chapter the guidelines and illustra-
tions will tend to reflect paper-and-pencil assessment. Happily, most of the guide-
lines you encounter here will apply with equal force to either paper-and-pencil
or electronically dispensed assessments. While digital assessment is becoming
increasingly commonplace, the same test-design principles that one would use
when designing a pen-and-paper test still hold true.

M07_POPH0936_10_SE_C07.indd 189M07_POPH0936_10_SE_C07.indd 189 09/11/23 12:31 PM09/11/23 12:31 PM

190 ChapTeR 7 Constructed-Response Tests

Short-Answer Items
The first kind of constructed-response item we’ll look at is the short-answer item.
These types of items call for students to supply a word, a phrase, or a sentence
in response to either a direct question or an incomplete statement. If an item asks
students to come up with a fairly lengthy response, it is considered an essay item,
not a short-answer item. If the item asks students to supply only a single word,
then it’s a really short-answer item.

Short-answer items are suitable for assessing relatively simple kinds of learn-
ing outcomes such as those focused on students’ acquisition of knowledge. If
crafted carefully, however, short-answer items can measure substantially more
challenging kinds of learning outcomes. The major advantage of short-answer
items is that students need to produce a correct answer, not merely recognize it
from a set of selected-response options. The level of partial knowledge that might
allow a student to respond correctly to a multiple-choice item won’t be sufficient
if the student is required to generate a correct answer to a short-answer item.

The major drawback with short-answer items, as is true with all
constructed-response items, is that students’ responses are difficult to score.
The longer the responses sought, the tougher it is to score them accurately. And
inaccurate scoring, as we saw in Chapter 3, leads to reduced reliability—which,
in turn, reduces the validity of the test-based interpretations we make about
students—which, in turn, reduces the quality of the decisions we base on those
interpretations. Educational measurement is much like the rest of life—it’s simply
loaded with trade-offs. When classroom teachers choose constructed-response
tests, they must be willing to trade some scoring accuracy (the kind of accu-
racy that comes with selected-response tests) for greater congruence between
constructed-response assessment strategies and the kinds of student behaviors
about which inferences are to be made.

Here, you will find five straightforward item-writing guidelines for
short-answer items. Please look them over briefly. Thereafter, each guideline will
be amplified by describing the essence of how it works.

Item-Writing Guidelines for
Short-answer Items

1. Usually employ direct questions rather than incomplete statements, particularly
for young students.

2. Structure the item so that a response should be concise.
3. Place blanks in the margin for direct questions or near the end of incomplete

statements.
4. For incomplete statements, use only one or, at most, two blanks.
5. Make sure blanks for all items are equal in length.

M07_POPH0936_10_SE_C07.indd 190M07_POPH0936_10_SE_C07.indd 190 09/11/23 12:31 PM09/11/23 12:31 PM

Short-answer Items 191

Using Direct Questions Rather Than
Incomplete Statements
For young children, the direct question is a far more familiar format than the
incomplete statement. Accordingly, such students will be less confused if
direct questions are employed. Another reason why short-answer items should
employ a direct-question format is that the use of direct questions typically
forces the item-writer to phrase the item so that less ambiguity is present. With
incomplete-statement formats, there’s often too much temptation simply to delete
words or phrases from statements the teacher finds in texts. To make sure there
isn’t more than one correct answer to a short-answer item, it is often helpful if
the item-writer first decides on the correct answer and then builds a question or
incomplete statement designed to elicit a unique correct response from knowl-
edgeable students.

Nurturing Concise Responses
Responses to short-answer items, as might be inferred from what they’re offi-
cially called, should be short. Thus, no matter whether you’re eliciting responses
that are words, symbols, phrases, or numbers, try to structure the item so a brief
response is clearly sought. Suppose you conjured up an incomplete statement
item such as this: “An animal that walks on two feet is a __________.” There are
all sorts of answers a student might legitimately make to such a too-general item.
Moreover, some of those responses could be somewhat lengthy. Now note how a
slight restructuring of the item constrains the student: “An animal that walks on
two feet is technically classified as a __________.” By the addition of the phrase
“technically classified as,” the item-writer has restricted the appropriate responses
to only one—namely, “biped.” If your short-answer items are trying to elicit stu-
dents’ phrases or sentences, you may wish to place word limits on each, or at
least to indicate, in the test’s directions, that only a short one-sentence response is
allowable for each item.

Always try to put yourself, mentally, inside the heads of your students and
try to anticipate how they are apt to interpret what sort of response is needed by
an item. What this second guideline suggests is that you massage an item until it
truly lives up to its name—that is, until it becomes a bona fide short-answer item.

Positioning Blanks
If you’re using direct questions in your short-answer items, place the students’
response areas for all items near the right-hand margin of the page, immediately
after the item’s questions. By doing so, you’ll have all of a student’s responses
nicely lined up for scoring. If you’re using incomplete statements, try to place the
blank near the end of the statement, not near its beginning. A blank positioned
too early in a sentence tends to perplex the students. For instance, notice how this
too-early blank can lead to confusion: “The __________ is the governmental body

M07_POPH0936_10_SE_C07.indd 191M07_POPH0936_10_SE_C07.indd 191 09/11/23 12:31 PM09/11/23 12:31 PM

192 ChapTeR 7 Constructed-Response Tests

that, based on the United States Constitution, must ratify all U.S. treaties with
foreign nations.” It would be better to use a direct question or to phrase the item
as follows: “The governmental body that, based on the United States Constitution,
must ratify all U.S. treaties with foreign nations is the __________.”

Limiting Blanks
For incomplete-statement types of short-answer items, you should use only one
or two blanks. Any more blanks and the item can be labeled a “Swiss-cheese
item,” or an item with holes galore. Here’s a Swiss-cheese item to illustrate the
confusion that a profusion of blanks can inflict on what is otherwise a decent
short-answer item: “After a series of major conflicts with natural disasters, in
the year __________, the explorers __________ and __________, accompanied by
their __________, discovered __________.” The student who could supply correct
answers to such a flawed short-answer item could also be regarded as a truly
successful explorer!

Inducing Linear Equality
Too often in short-answer items, a beginning itemwriter will give away the answer
by varying the length of the answer blanks so that short lines are used when short
answers are correct, and long lines are used when lengthier answers are correct.
This practice tosses unintended clues to students, so it should be avoided. In the
interest of linear egalitarianism, not to mention decent item writing, try to keep all
blanks for short-answer items equal in length. Be sure, however, that the length of
the answer spaces provided is sufficient for students’ responses—in other words,
not so skimpy that students must cram their answers in an illegible fashion.

Okay, let’s review. Short-answer items are the most simple form of
constructed-response items, but they can help teachers measure important skills
and knowledge. Because such items seek students’ constructed rather than
selected responses, those items can be employed to tap some genuinely higher-
order skills. Although students’ responses to short-answer items are more difficult
to score than their answers to selected-response items, the scoring of such items
isn’t impossible. That’s because short-answer items, by definition, should elicit
only short answers.

Essay Items: Development
The essay item is surely the most commonly used form of constructed-response
assessment item. Anytime teachers ask their students to churn out a paragraph or
two on what the students know about Topic X or to compose an original composi-
tion describing their “Favorite Day,” an essay item is being used. Essay items are
particularly useful in gauging a student’s ability to synthesize, evaluate, and com-
pose. Such items have a wide variety of applications in most teachers’ classrooms.

M07_POPH0936_10_SE_C07.indd 192M07_POPH0936_10_SE_C07.indd 192 09/11/23 12:31 PM09/11/23 12:31 PM

esssay Itemss: eeelopment 193

A special form of the essay item is the writing sample—when teachers ask
students to generate a written composition in an attempt to measure students’
composition skills. Because the procedures employed to construct items for
such writing samples and, thereafter, for scoring students’ compositions, are
so similar to the procedures employed to create and score responses to any
kind of essay item, we’ll treat writing samples and other kinds of essay items
all at one time at this point. You’ll find it helpful, however, to remember that
requiring students to generate a writing sample is, in reality, a widely used
type of performance test. We’ll dig more deeply into performance tests in the
following chapter.

For assessing certain kinds of complex learning outcomes, the essay item is
our hands-down winner. It clearly triumphs when you’re trying to see how well
students can create original compositions. Yet, there are a fair number of draw-
backs associated with essay items, and if you’re going to consider using such
items in your own classroom, you ought to know the weaknesses as well as the
strengths of this item type.

One difficulty with essay items is that they’re harder to write—at least to
write properly—than is generally thought. I must confess that as a first-year
high school teacher, I sometimes conjured up essay items while walking to
school and then slapped them up on the chalkboard so that I created almost
instant essay exams. At the time, I thought my essay items were pretty good.
Such is the pride of youth—and the consequence of naivete. I’m glad I have no
record of those items. In retrospect, I must assume that they were pretty putrid.
I now know that generating a really good essay item is a tough task—a task not
accomplishable while strolling to school. You’ll see this is true from the item-
writing rules to be presented shortly. It takes time to create a nifty essay item.
You’ll need to find time to construct suitable essay items for your own classroom
assessments.

The most serious problem with essay items, however, is the difficulty that
teachers have in reliably scoring students’ responses. Let’s say you use a six-item
essay test to measure your students’ ability to solve certain kinds of problems
in social studies. Suppose that, by some stroke of measurement magic, all your
students’ responses could be transformed into typed manuscript form so you
could not tell which response came from which student. Let’s say you were asked
to score the complete set of responses twice. What do you think is the likelihood
your two sets of scores would be consistent? Well, experience suggests that most
teachers often aren’t able to produce very consistent results when they score stu-
dents’ essay responses. The challenge in this instance, of course, is to increase the
reliability of your scoring efforts so that you’re not distorting the validity of the
score-based inferences you want to make on the basis of your students’ responses.

Although the potential of having computers gleefully and accurately score
students’ essays is often trotted out during discussions of essay scoring, for the
present it seems that really particularized scoring of students’ essay responses is
somewhat in the future for many teachers. Even though computer-based scoring

M07_POPH0936_10_SE_C07.indd 193M07_POPH0936_10_SE_C07.indd 193 09/11/23 12:31 PM09/11/23 12:31 PM

194 ChapTeR 7 Constructed-Response Tests

ecision Time
Forests or Trees?

Anh Nguyen is a brand-new English teacher
assigned to work with seventh-grade and eighth-
grade students at Dubois Junior High School.
Anh has taken part in a state-sponsored summer
workshop that emphasizes “writing as a process.”
Coupled with what she learned while completing
her teacher education program, Anh is confident
that she can effectively employ techniques such as
brainstorming, outlining, early drafts, peer critiquing,
and multiple revisions. She assumes that her
students not only will acquire competence in their
composition capabilities, but also will increase their
confidence about possessing those capabilities.
What Anh’s preparation failed to address, however,
was how to grade her students’ compositions.

Two experienced English teachers at Dubois
Junior High have gone out of their way to help Anh
get through her first year as a teacher. Mrs. Miller
and Mr. Quy had both been quite helpful during the
early weeks of the school year. However, when Anh
asked them, one day during lunch, how she should
judge the quality of her students’ compositions, two
decisively different messages were given.

Mr. Quy strongly endorsed holistic grading of
compositions—that is, a general appraisal of each
composition as a whole. Although Mr. Quy bases his
holistic grading scheme on a set of explicit criteria,
he believes a single “gestalt” grade should be given
so that “one’s vision of the forest is not obscured by
tree-counting.”

Arguing with equal vigor, Mrs. Miller urged
Anh to adopt analytic appraisals of her students’
compositions. “If you supply your students with a
criterion-by-criterion judgment of their work,” she
contended, “each student will be able to know
precisely what’s good and what isn’t.” (It was
evident, during their fairly heated interchanges,
that Mrs. Miller and Mr. Quy had disagreed about
this topic in the past.) Mrs. Miller concluded her
remarks by saying, “Forget about that forest-and-
trees metaphor, Anh. What we’re talking about
here is clarity!”

If you were Anh, how would you decide
to judge the quality of your students’
compositions?

programs can be taught to do some terrific scoring, the wide range of students’
potential content typically limits computer-based scoring to judging only the most
rudimentary sorts of good and bad moves by an essay writer. If the formidable
obstacles to tailored computer scoring ever get wrestled to the mat, the classroom
teachers will have access to a sanity-saving way of scoring students’ essays. This
is a scoring advance greatly to be sought.

Creating Essay Items
Because the scoring of essay responses (and students’ compositions) is such
an important topic, you’ll soon be getting a set of guidelines on how to score
responses to such items. The more complex that the nature of students’ con-
structed responses becomes, as you’ll see in the next two chapters, the more
attention you’ll need to lavish on scoring. You can’t score responses to items that
you haven’t yet written, however, so let’s look now at five guidelines for the
construction of essay items.

M07_POPH0936_10_SE_C07.indd 194M07_POPH0936_10_SE_C07.indd 194 09/11/23 12:31 PM09/11/23 12:31 PM

esssay Itemss: eeelopment 195

Item-Writing Guidelines for
esssay Items

1. Convey to students a clear idea regarding the extensiveness of the response
desired.

2. Construct items so the student’s task is explicitly described.
3. Provide students with the approximate time to be expended on each item, as

well as each item’s value.
4. Do not employ optional items.
5. Precursively judge an item’s quality by composing, mentally or in writing, a

possible response.

Communicating the Extensiveness of
Students’ Responses Sought
It is sometimes thought that when teachers decide to use essay items, students have
total freedom of response. On the contrary, teachers can structure essay items so that
students produce (1) barely more than they would for a short-answer item or, in
contrast, (2) extremely lengthy responses. The two types of essay items that reflect
this distinction in the desired extensiveness of students’ responses are described as
restricted-response items and extended-response items.

A restricted-response item decisively limits the form and content of students’
responses. For example, a restricted-response item in a health education class
might ask students the following: “Describe the three most common ways in
which HIV is transmitted. Take no more than 25 words to describe each method
of transmission.” In this example, the number of HIV transmission methods was
specified, as was the maximum length for each transmission method’s description.

In contrast, an extended-response item provides students with far more lati-
tude in responding. Here’s an example of an extended-response item from a social
studies class: “Identify the chief factors contributing to the U.S. government’s
financial deficit during the past two decades. Having identified those factors,
decide which factors, if any, have been explicitly addressed by the U.S. legislative
and/or executive branches of government in the last five years. Finally, critically
evaluate the likelihood that any currently proposed remedies will bring about
significant reductions in the U.S. national debt.” A decent response to such an
extended-response item not only should get high marks from the teacher but
might also be the springboard for a student’s successful career in politics.

One technique that teachers commonly employ to limit students’ responses is
to provide a certain amount of space on the test paper, in their students’ response
booklets, or in the available response area on a computer screen. For instance,
the teacher might direct students to “Use no more than two sheets (both sides) in
your blank-lined blue books to respond to each test item” or “Give your answer in

M07_POPH0936_10_SE_C07.indd 195M07_POPH0936_10_SE_C07.indd 195 09/11/23 12:31 PM09/11/23 12:31 PM

196 ChapTeR 7 Constructed-Response Tests

the designated screen-space below.” Although the space-limiting ploy is an easy
one to implement, it really puts at a disadvantage those students who write in
a large-letter, scrawling fashion. Whereas such large-letter students may only be
able to cram a few paragraphs onto a page, those students who write in a small,
scrunched-up style may be able to produce a short novella in the same space.
Happily, when students respond to a computer-administered test “in the space
provided,” we can prohibit their shifting down to a tiny, tiny font that might allow
them to yammer endlessly.

This first guideline asks you to think carefully about whether the test-based
inference that you wish to make about your students is best served by students’
responses to (1) more essay items requiring shorter responses or (2) fewer essay
items requiring extensive responses. Having made that key decision, be sure to
make it clear to your students what degree of extensiveness you’re looking for
in their responses.

Describing Students’ Tasks
Students will find it difficult to construct responses to tasks if they don’t under-
stand what the tasks are. Moreover, students’ responses to badly understood tasks
are almost certain to yield flawed inferences by teachers. The most important part
of an essay item is, without question, the description of the assessment task. It
is the task students respond to when they generate essays. Clearly, then, poorly
described assessment tasks will yield many off-target responses that, had the stu-
dent truly understood what was being sought, might have been more appropriate.

There are numerous labels used to represent the assessment task in an essay
item. Sometimes it’s simply called the task, the charge, or the assignment. In essay
items that are aimed at eliciting student compositions, the assessment task is often
referred to as a prompt. No matter how the assessment task is labeled, if you’re
a teacher who is using essay items, you must make sure the nature of the task is
really set forth clearly for your students. Put yourself in the student’s position and
see whether, with the level of knowledge possessed by most of your students, the
nature of the assessment task is apt to be understood.

To illustrate, if you wrote the following essay item, there’s little doubt that
your students’ assessment task would have been badly described: “In 500 words or
less, discuss democracy in Latin America.” In contrast, notice in the following item
how much more clearly the assessment task is set forth: “Describe how the checks
and balances provisions in the U.S. Constitution were believed by the Constitu-
tion’s framers to be a powerful means to preserve democracy (300–500 words).”

Providing Time-Limit and Item-Value Guidance
When teachers create an examination consisting of essay items, they often have an
idea regarding which items will take more of the students’ time. But students don’t
know what’s in the teacher’s head. As a consequence, some students will lavish
loads of attention on items that the teacher thought warranted only modest effort,

M07_POPH0936_10_SE_C07.indd 196M07_POPH0936_10_SE_C07.indd 196 09/11/23 12:31 PM09/11/23 12:31 PM

esssay Itemss: eeelopment 197

and yet will devote little time to items that the teacher thought deserved substan-
tial attention. Similarly, sometimes teachers will want to weight certain items more
heavily than others. Again, if students are unaware of which items count most,
they may toss reams of rhetoric at the low-value items and thus end up without
enough time to supply more than a trifling response to some high-value items.

To avoid these problems, there’s quite a straightforward solution—namely,
letting students in on the secret. If there are any differences among items in point
value or in the time students should spend on them, simply provide this informa-
tion in the directions or, perhaps parenthetically, at the beginning or end of each
item. Students will appreciate such clarifications of your expectations.

Avoiding Optionality
It’s fairly common practice among teachers who use essay examinations to pro-
vide students with a certain number of items and then let each student choose
to answer fewer than the number of items presented. For example, the teacher
might allow students to “choose any three of the five essay items presented.”
Students, of course, really enjoy such an assessment procedure because they can
respond to items for which they’re well prepared and avoid those items for which
they’re inadequately prepared. Yet, other than inducing student jubilation, this
optional-items assessment scheme has little going for it.

When students select different items from a menu of possible items, they
are actually responding to different examinations. As a consequence, it is impos-
sible to judge their performances on some kind of common scale. Remember, as
a classroom teacher you’ll be trying to make better educational decisions about
your students by relying on test-based interpretations regarding those students.
It’s tough enough to make a decent test-based interpretation when you have only
one test to consider. It’s far more difficult to make such interpretations when you
are faced with a medley of potentially dissimilar tests because you allowed your
students to engage in a personal mix-and-match measurement procedure.

In most cases, teachers rely on an optional-items procedure with essay items
when they’re uncertain about the importance of the content measured by the
items on their examinations. Such uncertainty gives rise to the use of optional
items because the teacher is not clearheaded about the inferences (and resulting
decisions) for which the examination’s results will be used. If you spell out those
inferences (and decisions) crisply, prior to the examination, you will generally
find you’ll have no need for optional-item selection in your essay examinations.

Previewing Students’ Responses
After you’ve constructed an essay item for one of your classroom assessments,
there’s a quick way to get a preliminary fix on whether the item is a winner or a
loser. Simply squeeze yourself, psychologically, into the head of one of your typi-
cal students, and then anticipate how such a student would respond to the item.
If you have time and are inclined to do so, you could even try writing a response

M07_POPH0936_10_SE_C07.indd 197M07_POPH0936_10_SE_C07.indd 197 09/11/23 12:31 PM09/11/23 12:31 PM

198 ChapTeR 7 Constructed-Response Tests

that the student might produce. More often than not, because you’ll be too busy
to conjure up such fictitious responses in written form, you might try to compose
a mental response to the item on behalf of the typical student you’ve selected.
An early mental run-through of how a student might respond to an item can
often help you identify deficits in the item, because when you put yourself, even
hypothetically, on the other side of the teacher’s desk, you’ll sometimes discover
shortcomings in items that you otherwise wouldn’t have identified. Too many
times teachers assemble a platoon of essay questions, send those soldiers into
battle on examination day, and only then discover that one or more of them are
not adequately prepared for the task at hand. Mental previewing of likely student
responses can help you detect such flaws while there’s still time for repairs.

In review, we’ve looked at five guidelines for creating essay items. If you
remember that all of these charming little collections of item-specific recommen-
dations should be adhered to, in addition to the five general item-writing precepts
set forth in Chapter 6 (see page 164), you’ll probably be able to come up with a
pretty fair set of essay items. Then, perish the thought, you’ll have to score your
students’ responses to those items. That’s what we’ll be looking at next.

Essay Items: Scoring Students’
Responses
Later in this text, you’ll be looking at how to evaluate students’ responses to perfor-
mance assessments in Chapter 8 and how to judge students’ portfolios in Chapter 9.
In short, you’ll be learning much more about how to evaluate your students’ perfor-
mances on constructed-response assessments. Thus here, to spread out the load a bit,
we’ll be looking only at how to score students’ responses to essay items (including
tests of students’ composition skills). You’ll find that many of the suggestions for
scoring students’ constructed responses you will encounter will also be applicable
when you’re trying to judge your students’ essay responses. But just to keep matters
simple, let’s look now at recommendations for scoring responses to essay items.

Guidelines for Scoring Responses
to esssay Items

1. Score responses holistically and/or analytically.
2. Prepare a tentative scoring key in advance of judging students’ responses.
3. Make decisions regarding the importance of the mechanics of writing prior to

scoring.
4. Score all responses to one item before scoring responses to the next item.
5. Insofar as possible, evaluate responses anonymously.

M07_POPH0936_10_SE_C07.indd 198M07_POPH0936_10_SE_C07.indd 198 09/11/23 12:31 PM09/11/23 12:31 PM

esssay Itemss: Scoring Students’ Responses 199

Choosing an Analytic and/or Holistic Scoring
Approach
During the past several decades, the measurement of students’ composition skills
by having students generate actual writing samples has become widespread. As
a consequence of all this attention to students’ compositions, educators have
become far more skilled in evaluating students’ written compositions. Fortu-
nately, classroom teachers can use many of the procedures that were identified
and refined as educators scored thousands of students’ compositions during state-
wide assessment extravaganzas.

A fair number of the lessons learned about scoring students’ writing samples
apply quite nicely to the scoring of responses to any kind of essay item. One of the
most important of the scoring insights picked up from this large-scale scoring of
students’ compositions is that almost any type of student-constructed response
can be scored either holistically or analytically. This is why the first of the five
guidelines suggests that you make an early decision about whether you’re going
to score your students’ responses to essay items using a holistic approach or an
analytic approach or, perhaps, score them using a combination of the two scoring
approaches. Let’s look at how each of these two scoring strategies works.

A holistic scoring strategy, as its name suggests, focuses on the essay response
(or written composition) as a whole. At one extreme of scoring rigor, the teacher
can, in a somewhat unsystematic manner, supply a “general impression” overall
grade to each student’s response. Or, in a more systematic fashion, the teacher can
isolate, in advance of scoring, those evaluative criteria that should be attended to
in order to arrive at a single, overall score per essay. Generally, a score range of 4 to
6 points is used to evaluate each student’s response. (Some scoring schemes have
a few more points, some a few less.) A teacher, then, after considering whatever
factors should be attended to in a given item, will give a score to each student’s
response. Here is a set of evaluative criteria that teachers might use in holistically
scoring a student’s written composition.

Illustrstiee eeslustiee Criteris to Be
Considered When Scoring Students’
esssay Responses holisticsllay
For scoring a composition intended to reflect students’ composition prowess:

• Organization
• Communicative Clarity
• Adaptation to Audience
• Word Choice
• Mechanics (spelling, capitalization, punctuation)

M07_POPH0936_10_SE_C07.indd 199M07_POPH0936_10_SE_C07.indd 199 09/11/23 12:31 PM09/11/23 12:31 PM

200 ChapTeR 7 Constructed-Response Tests

And now, here are four evaluative factors that a speech teacher might employ
in holistically scoring a response to an essay item used in a debate class.

potentisl eeslustiee Criteris to Be
Used When Scoring Students’ esssay
Responses in s ebste Clsss
For scoring a response to an essay item dealing with rebuttal preparation:

• Anticipation of Opponent’s Positive Points
• Support for One’s Own Points Attacked by Opponents
• Isolation of Suitably Compelling Examples
• Preparation of a “Spontaneous” Conclusion

When teachers score students’ responses holistically, they do not dole out
points-per-criterion for a student’s response. Rather, the teacher keeps in mind
evaluative criteria such as those set forth in the previous two boxes. The speech
teacher, for instance, while looking at the student’s essay response to a question
about how someone should engage in effective rebuttal preparation, will not nec-
essarily penalize a student who overlooks one of the four evaluative criteria. The
response as a whole may lack one factor, yet otherwise represent a really terrific
response. Evaluative criteria such as those illustrated here simply dance merrily in
the teacher’s head when the teacher scores students’ essay responses holistically.

In contrast, an analytic scoring scheme strives to be a fine-grained, specific
point-allocation approach. Suppose, for example, that instead of using a holistic
method of scoring students’ compositions, a teacher chose to employ an ana-
lytic method. Under those circumstances, a scoring guide such as the example in
Figure 7.1 might be used by the teacher. Note that for each evaluative criterion in

Factor
Unacceptable

(0 points )
Satisfactory

(1 point)
Outstanding

(2 points)

1. Organization ✓

2. Communicative Clarity ✓

3. Adaptation to Audience ✓

4. Word Choice ✓

5. Mechanics ✓

Total Score = 7

Figure 7.1 an Illustrstiee Guide for anslayticsllay Scoring s Student’s Written
Composition

M07_POPH0936_10_SE_C07.indd 200M07_POPH0936_10_SE_C07.indd 200 09/11/23 12:31 PM09/11/23 12:31 PM

esssay Itemss: Scoring Students’ Responses 201

the guide, the teacher must award 0, 1, or 2 points. The lowest overall score for a
student’s composition, therefore, would be 0, whereas the highest overall score
for a student’s composition would be 10 (that is, ×2 points 5 criteria ).

The advantage of an analytic scoring system is that it can help you identify
the specific strengths and weaknesses of your students’ performances and, there-
fore, allows you to communicate such diagnoses to your students in a pinpointed
fashion. The downside of analytic scoring is that a teacher sometimes becomes
so attentive to the subpoints in a scoring system that the forest (overall quality)
almost literally can’t be seen because of a focus on individual trees (the separate
scoring criteria). In less metaphoric language, the teacher will miss the communi-
cation of the student’s response “as a whole” because of paying excessive atten-
tion to a host of individual evaluative criteria.

A middle-of-the-road scoring approach can be seen when teachers ini-
tially grade all students’ responses holistically and then return for an analytic
scoring of only those responses that were judged, overall, to be unsatisfac-
tory. After the analytic scoring of the unsatisfactory responses, the teacher
then relays more fine-grained diagnostic information to those students whose
unsatisfactory responses were analytically scored. The idea underlying this
sort of hybrid approach is that the students who are most in need of fine-
grained feedback are those who, on the basis of the holistic evaluation, are
performing less well.

This initial guideline for scoring students’ essay responses applies to the
scoring of responses to all kinds of essay items. As always, your decision about
whether to opt for holistic or analytic scoring should flow directly from your
intended use of the test results. Putting it another way, your choice of scoring
approach will depend on the educational decisions that are linked to the test’s
interpreted results. (Is this beginning to sound somewhat familiar?)

Devising a Tentative Scoring Key
No matter what sort of approach you opt for in scoring your students’ essay
responses, you’ll find that it will be useful to develop a tentative scoring key for
responses to each item in advance of actually scoring students’ responses. Such
tentative scoring schemes are almost certain to be revised based on your scor-
ing of actual student papers, but that’s to be anticipated. If you wait until you
commence scoring your students’ essay responses, there’s too much likelihood
you’ll be unduly influenced by the responses of the first few students whose
papers you grade. If those papers are atypical, the resultant scoring scheme is
apt to be unsound. It is far better to think through, at least tentatively, what
you really hope students will supply in their responses, and then modify the
scoring key if unanticipated responses from students suggest that alterations
are needed.

If you don’t have a tentative scoring key in place, you are very likely to be
influenced by such factors as a student’s vocabulary or writing style, even though,
in truth, such variables may be of little importance to you. Advance exploration

M07_POPH0936_10_SE_C07.indd 201M07_POPH0936_10_SE_C07.indd 201 09/11/23 12:31 PM09/11/23 12:31 PM

202 ChapTeR 7 Constructed-Response Tests

of the evaluative criteria you intend to employ, either holistically or analytically,
is a winning idea when scoring responses to essay items.

Let’s take a look at what a tentative scoring key might look like if it were
employed by a teacher in a U.S. history course. The skill being promoted by the
teacher is a genuinely high-level one that’s reflected in the following curricular aim:

When presented with a description of a current real-world problem in
the United States, the student will be able to (1) cite one or more signifi-
cant events in American history that are particularly relevant to the pre-
sented problem’s solution, (2) defend the relevance of the cited events,
(3) propose a solution to the presented problem, and (4) use historical
parallels from the cited events to support the proposed solution.

As indicated, this is no rinky-dink, low-level cognitive skill. It’s way, way up
there in its cognitive demands on students. Now, let’s suppose our hypothetical
history teacher routinely monitors students’ progress related to this skill by hav-
ing students respond to similar problems either orally or in writing. The teacher
presents a real-world problem. The students, aloud or in an essay, try to come up
with a suitable four-part response.

Figure 7.2 illustrates a tentative scoring key for evaluating students’ oral
or written responses to constructed-response items measuring their mastery
of this curricular aim. As you can discern, in this tentative scoring key, greater

M07_POPH0936_10_SE_C07.indd 202M07_POPH0936_10_SE_C07.indd 202 09/11/23 12:31 PM09/11/23 12:31 PM

esssay Itemss: Scoring Students’ Responses 203

weight has been given to the fourth subtask—namely, the student’s provision of
historical parallels to support the student’s proposed solution to the real-world
problem the teacher proposed. Much less significance was given to the first sub-
task of citing pertinent historical events. This, remember, is a tentative scoring
key. Perhaps, when the history teacher begins scoring students’ responses, it will
turn out that the first subtask is far more formidable than the teacher originally
thought and the fourth subtask appears to be handled without much difficulty
by students. At that point, based on the way students are actually responding,
this hypothetical history teacher surely ought to do a bit of score-point juggling
in the tentative scoring key before applying it in final form to the appraisal of
students’ responses.

Deciding Early About the Importance of Mechanics
Few things influence scorers of students’ essay responses as much as the
mechanics of writing employed in the response. If the student displays subpar
spelling, chaotic capitalization, and pathetic punctuation, it’s pretty tough for
a scorer of the student’s response to avoid being influenced adversely. In some
instances, of course, mechanics of writing play a meaningful role in scoring
students’ performances. For instance, suppose you’re scoring students’ written
responses to the task of writing an application letter for a position as a reporter
for a local newspaper. In such an instance, it is clear that mechanics of writ-
ing would be pretty important when judging the students’ responses. But in a
chemistry class, perhaps the teacher cares less about such factors when scoring
students’ essay responses to a problem-solving task. The third guideline simply
suggests that you make up your mind about this issue early in the process so

Tentative Point Allocation

Students’ Subtasks
Weak

Response
Acceptable
Response

Strong
Response

1. Citation of Pertinent
Historical Events

2. Defense of the Cited
Historical Events

3. Proposed Solution to the
Presented Problem

4. Historical Support
of Proposed Solution

0 pts. 5 pts. 10 pts.

0 pts. 10 pts. 20 pts.

0 pts. 20 pts. 40 pts.

Total Points Possible = 90 pts.

Figure 7.2 an Illustrstiee Tentstiee Scoring Keay for s high-Leeel Cognitiee Skill in historay

M07_POPH0936_10_SE_C07.indd 203M07_POPH0936_10_SE_C07.indd 203 09/11/23 12:31 PM09/11/23 12:31 PM

204 ChapTeR 7 Constructed-Response Tests

that, if mechanics aren’t very important to you, you don’t let your students’
writing mechanics subconsciously influence the way you score their responses.

Scoring One Item at a Time
If you’re using an essay examination with more than one item, be sure to score
all your students’ responses to one item, then score all their responses to the next
item, and so on. Do not score all responses of a given student and then go on to

psrent Tslk
The mother of one of your students, Jill Jenkins, was
elected to your district’s school board 2 years ago.
Accordingly, anytime Mrs. Jenkins wants to discuss
Jill’s education, you are understandably “all ears.”

Mrs. Jenkins recently stopped by your
classroom to say that three of her fellow board
members have been complaining there are too
many essay tests being given in the district. The
three board members contend that such tests,
because they must be scored “subjectively,” are
neither reliable nor valid.

Because many of the tests you give Jill and her
classmates are essay tests, Mrs. Jenkins asks you
whether the three board members are correct.

If I were you, here’s how I’d respond to Jill’s
mother. (Oh yes, because Mrs. Jenkins is a
board member, I’d simply ooze politeness and
professionalism while I responded.)

“Thanks for giving me an opportunity to
comment on this issue, Mrs. Jenkins. As you’ve
already guessed, I believe in the importance of
constructed-response examinations such as essay
tests. The real virtue of essay tests is that they call
for students to create their responses, not merely
recognize correct answers, as students must do
with other types of tests, such as those featuring
multiple-choice items.

“Your three school board colleagues are
correct when they say it is more difficult to score
constructed-response items consistently than it
is to score selected-response items consistently.

And that’s a shortcoming of essay tests. But it’s a
shortcoming that’s more than compensated for by
the far greater authenticity of the tasks presented to
students in almost all constructed-response tests.
When Jill writes essays during my major exams, this
is much closer to what she’ll be doing in later life
than when she’s asked to choose the best answer
from four options. In real life, people aren’t given four-
choice options. Rather, they’re required to generate
a response reflecting their views. Essay tests give Jill
and her classmates a chance to do just that.

“Now, let’s talk briefly about consistency and
validity. Actually, Mrs. Jenkins, it is important
for tests to be scored consistently. We refer to
the consistency with which a test is scored as
its reliability. And if a test isn’t reliable, then the
interpretations about students we make based on
test scores aren’t likely to be valid. It’s a technical
point, Mrs. Jenkins, but it isn’t a test that’s valid
or invalid; it’s a score-based interpretation about a
student’s ability that may or may not be valid. As
your board colleagues pointed out, some essay
tests are not scored very reliably. But essay tests
can be scored reliably and they can yield valid
inferences about students.

“I hope my reactions have been helpful. I’ll be
happy to show you some of the actual essay exams
I’ve used with Jill’s class. And I’m sure the district
superintendent, Dr. Stanley, can supply you with
additional information.”

Now, how would you respond to
Mrs. Jenkins?

M07_POPH0936_10_SE_C07.indd 204M07_POPH0936_10_SE_C07.indd 204 09/11/23 12:31 PM09/11/23 12:31 PM

esssay Itemss: Scoring Students’ Responses 205

the next student’s paper. There is way too much danger that a student’s responses
to early items will unduly influence your scoring of the student’s responses to
later items. If you score all responses to item number 1 and then move on to the
responses to item number 2, you can eliminate this tendency. In addition, the scor-
ing will often go a bit quicker because you won’t need to shift evaluative criteria
between items. Adhering to this fourth guideline will invariably lead to more con-
sistent scoring and, therefore, to more accurate response-based inferences about
your students. There’ll be more paper handling than you might prefer, but the
increased accuracy of your scoring will be worth it. (Besides, you’ll be getting a
smidge of psychomotor exercise with all the paper shuffling.)

Striving for Anonymity
Because I’ve been a teacher, I know all too well how quickly teachers can identify
their students’ writing styles, particularly those students who have especially
distinctive styles, such as the “scrawlers,” the “petite letter-size crew,” and those
who dot their i’s with half-moons or cross their t’s with lightning bolts. Yet, insofar
as you can, try not to know whose responses you’re scoring. One simple way to
help in that effort is to ask students to write their names on the reverse side of
the last sheet of the examination in the response booklet. Try not to peek at the
students’ names until you’ve scored all of the exams.

I used such an approach for three decades of scoring graduate students’
essay examinations at UCLA. It worked fairly well. Occasionally I was really
surprised because students who had appeared to be knowledgeable during
class discussions sometimes displayed just the opposite on my exams, while
several in-class Silent Sarahs and Quiet Quentins came up with really solid
exam performances. I’m sure that had I known whose papers I was grading,
I would have been improperly influenced by my classroom-based percep-
tions of different students’ abilities. I am not suggesting that you shouldn’t
use students’ classroom discussions as part of your evaluation system. Rather,
I’m advising you that classroom-based perceptions of students can sometimes
cloud your scoring of essay responses. This is one strong reason for you to favor
anonymous scoring.

In review, we’ve considered five guidelines for scoring students’ responses to
essay examinations. If your classroom assessment procedures involve any essay
items, you’ll find these five practical guidelines will go a long way in helping you
come up with consistent scores for your students’ responses. And consistency,
as you learned in Chapter 3, is something that makes psychometricians mildly
euphoric.

In the next two chapters, you’ll learn about two less common forms of
constructed-response items. You’ll learn how to create and score performance
assessments and how to use students’ portfolios in classroom assessments. You’ll
find that what we’ve been dealing with in this chapter will serve as a useful
springboard to the content of the next two chapters.

M07_POPH0936_10_SE_C07.indd 205M07_POPH0936_10_SE_C07.indd 205 09/11/23 12:31 PM09/11/23 12:31 PM

206 ChapTeR 7 Constructed-Response Tests

What Do Classroom Teachers Really
Need to Know About Constructed-
Response Tests?
At the close of a chapter dealing largely with the nuts and bolts of creating and
scoring written constructed-response tests, you probably expect to be told you
really need to internalize all those nifty little guidelines so that when you spin
out your own short-answer and essay items, you’ll elicit student responses you
can score accurately. Well, that’s not a terrible aspiration, but there’s really a
more important insight you need to walk away with after reading the chapter.
That insight, not surprisingly, derives from the central purpose of classroom
assessment—namely, to draw accurate inferences about students’ status so you

But Whst oes This hsee to o with Tesching?
A teacher wants students to learn the stuff that’s
being taught—to learn it well. And one of the best
ways to see whether students have really acquired
a cognitive skill, or have truly soaked up scads of
knowledge, is to have them display what they’ve
learned by generating answers to constructed-
response tests. A student who has learned
something well enough to toss it out from scratch,
rather than choosing from presented options, clearly
has learned it pretty darn well.

But, as this chapter has often pointed out,
the creation and scoring of first-rate constructed-
response items, especially anything beyond short-
answer items, require serious effort from a teacher.
And if constructed-response items aren’t first-rate,
then they are not likely to help a teacher arrive at
valid interpretations about students’ knowledge
and/or skills. Decent constructed-response tests
take time to create and to score. And that’s where
this kind of classroom assessment runs smack into
a teacher’s instructional-planning requirements.

A classroom teacher should typically reserve
constructed-response assessments for a modest
number of really significant instructional outcomes.
Moreover, remember, a teacher’s assessments
should typically exemplify the outcomes that the
teacher wants most students to master. It’s better

to get students to master three to five really high-
level cognitive skills, adroitly assessed by excellent
constructed-response tests, than it is to have students
master a litany of low-level—easy-come, easy-go—
outcomes. In short, choose your constructed-
response targets with care. Those choices will have
curricular and instructional implications.

Putting it another way, if you, as a classroom
teacher, want to determine whether your students
have the skills and/or knowledge that can be best
measured by short-answer or essay items, then
you need to refresh your memory regarding how
to avoid serious item-construction or response-
scoring errors. A review of the guidelines presented
on pages 190, 195, and 198 should give you the
brushup you need. Don’t assume you are obligated
to use short-answer or essay items simply because
you now know a bit more about how to crank them
out. If you’re interested in the extensiveness of your
students’ knowledge regarding Topic Z, it may be
far more efficient to employ fairly low-level selected-
response kinds of items. If, however, you really want
to make inferences about your students’ skills in
being able to perform the kinds of tasks represented
by excellent short-answer and essay items, then
the guidelines provided in the chapter should be
consulted.

M07_POPH0936_10_SE_C07.indd 206M07_POPH0936_10_SE_C07.indd 206 09/11/23 12:31 PM09/11/23 12:31 PM

References 207

can make more appropriate educational decisions. What you really need to know
about short-answer items and essay items is that you should use them as part of
your classroom assessment procedures if you want to make the sorts of inferences
about your students that those students’ responses to such items would support.

Chapter Summary
After a guilt-induced apology for mislabeling
this chapter, we started off with a description
of short-answer items accompanied by a set of
guidelines (page 190) regarding how to write
short-answer items. Next, we took up essay items
and indicated that, although students’ written
compositions constitute a particular kind of essay
response, most of the recommendations for con-
structing essay items and for scoring students’
responses were the same, whether measuring

students’ composition skills or skills in subject
areas other than language arts. Guidelines were
provided for writing essay items (page 195) and
for scoring students’ responses to essay items
(page 200). The chapter concluded with the sug-
gestion that much of the content to be treated in
the following two chapters, because those chap-
ters also focus on constructed-response assess-
ment schemes, will also be related to the creation
and scoring of short-answer and essay items.

References
Brookhart, S. M., & Nitko, A. J. (2018).

Educational assessment of students (8th ed.).
Pearson.

Hawe, E., Dixon, H., Murray, J., & Chandler, S.
(2021). Using rubrics and exemplars to
develop students’ evaluative and productive
knowledge and skill, Journal of Further and
Higher Education, 4(8): 1033–1047. https://
doi.org/10.1080/0309877X.2020.1851358

Hogan, T. P. (2013). Constructed-response
approaches for classroom assessment. In J.
H. McMillan (Ed.), SAGE Handbook of research
on classroom assessment (pp. 275–292). SAGE
Publications.

Lane, S., Raymond, M. R., & Haladyna, T. M.
(Eds.). (2016). Handbook of test development
(2nd ed.). Routledge.

Miller, M. D., & Linn, R. (2013). Measurement and
assessment in teaching (11th ed.). Pearson.

Stiggins, R. J., & Chappuis, J. (2017). An
introduction to student-involved assessment
FOR learning (7th ed.). Prentice Hall.

To, J., Panadero, E., & Carless, D. (2021). A
systematic review of the educational uses
and effects of exemplars, Assessment &
Evaluation in Higher Education. https://doi.
org/10.1080/02602938.2021.2011134

M07_POPH0936_10_SE_C07.indd 207M07_POPH0936_10_SE_C07.indd 207 09/11/23 12:31 PM09/11/23 12:31 PM

208 ChapTeR 7 Constructed-Response Tests

A Testing Takeaway

Evaluating Student-Constructed Responses Using Rubrics*
W. James Popham, University of California, Los Angeles

Two prominent categories of students’ responses account for practically all items found in
educational tests, namely, selected-response items and con structed-response items. Whereas selected-
response items are particularly efficient to score, they sometimes constrain the originality of a
test-taker’s response. In contrast, although constructed-response approaches such as essay and
short-answer items provide test-takers with ample opportunities to invoke their originality, the
scoring of students’ responses frequently represents a formidable challenge.

One strong argument in favor of constructed-response items is that the challenges they
embody often coincide directly with what’s required of people in the real world. Although
a friend may often ask you to “explain your reasons” for supporting a politician’s proposed
policy changes, rarely does that friend solicit your opinion by providing you with four
possible options from which you must choose only one.

Given the potentially wide range of students’ constructed responses, scoring those responses
is a substantial task. If a satisfactory job of scoring students’ responses to essay or short-answer
items cannot be done, of course, it would be better never to employ such items in the first place.

For the last several decades, educators have evaluated the quality of students’ constructed
responses by employing a rubric. Such rubrics—that is, scoring guides—help identify
whether a student’s response is dazzling or dismal. A rubric that’s employed to score
students’ responses has, at minimum, three features:

• Evaluative criteria. The factors used to judge the quality of a student’s response

• Descriptions of qualitative differences for all evaluative criteria. Clearly spelled-out descriptions
of qualitative distinctions for each criterion

• Scoring approach to be used. Whether the evaluative criteria are to be applied collectively
(holistically) or on a criterion-by-criterion basis (analytically)

The virtue of a rubric is that it brings clarity to those who score a student’s responses. But
also, if provided to students early in the instructional process, rubrics can clarify what’s
expected. Yet not all rubrics are created equal. Here are two losers and one winner:

• Task-specific rubric. This loser focuses on eliciting responses to a single task, not demonstrat-
ing mastery levels of a skill.

• Hypergeneral rubric. Another loser, its meanings are murky because it lacks clarity.

• Skill-focused rubric. This winner addresses mastery of the skill sought, not of an item.

Thus, when working with rubrics, either to teach students or to score their work, always
remember that even though many rubrics represent real contributors to the educational
enterprise, some simply don’t.

*From Chspter 7 of Classroom Assessment: What Teachers Need to Know, 10th ed., bay W. Jsmes pophsm. Copayright 2022 bay pesrson, which herebay grsnts
permission for the reproduction snd distribution of this Testing Takeaway, with proper sttribution, for the intent of incressing sssessment literscay. a digitsllay
shsresble eersion is sesilsble from httpss://www.pesrson.com/store/en-us/pesrsonplus/login.

M07_POPH0936_10_SE_C07.indd 208M07_POPH0936_10_SE_C07.indd 208 09/11/23 12:31 PM09/11/23 12:31 PM

234

Chapter 9

Portfolio Assessment

Chief Chapter Outcome

An understanding not only of the distinctive relationship between
measurement and instruction inherent in portfolio assessment, but
also of the essentials of a process that teachers can use to install
portfolio assessment

Learning Objectives

9.1 Describe both positive and negative features of the relationship between
measurement and instruction.

9.2 From memory, identify and explain the key features of a seven-step
process that teachers can employ to install portfolio assessment.

9.3 Having identified three different functions of portfolio assessment, isolate
the chief strengths and weaknesses of portfolio assessment as a classroom
measurement approach.

“Assessment should be a part of instruction, not apart from it” is a point of view
most proponents of portfolio assessment would enthusiastically endorse. Portfo-
lio assessment, a contemporary entry in the educational measurement derby, has
captured the attention of many educators because it represents a clear alternative
to more traditional forms of educational testing.

A portfolio is a systematic collection of one’s work. In education, portfolios
consist of collections of students’ work. Although the application of portfolios in
education has been a relatively recent phenomenon, portfolios have been widely
used in a number of other fields for many years. Portfolios, in fact, constitute the
chief method by which certain professionals display their skills and accomplish-
ments. For example, portfolios are traditionally used for this purpose by photogra-
phers, artists, journalists, models, architects, and so on. Although many educators
tend to think of portfolios as collections of written works featuring “words on
paper,” today’s explosion of technological devices makes it possible for students to

M09_POPH0936_10_SE_C09.indd 234M09_POPH0936_10_SE_C09.indd 234 09/11/23 12:48 PM09/11/23 12:48 PM

Claasrrom rsrtrClr aasaaomser ssasa lsrse-SlCs rsrtrClr aasaaomser 235

assemble their work in a variety of electronically retained forms instead of a sheaf
of hand-written papers in a manila folder. An important feature of portfolios is that
they must be updated as a person’s achievements and skills grow.

Portfolios have been warmly embraced—particularly by many educators who
regard traditional assessment practices with scant enthusiasm. In Table 9.1, for
example, a classic chart presented by Tierney, Carter, and Desai (1991) indicates
what those three authors believe are the differences between portfolio assessment
and assessment based on standardized testing tactics.

One of the settings in which portfolio assessment has been used with suc-
cess is in the measurement of students with severe disabilities. Such youngsters
sometimes encounter insuperable difficulties in displaying their capabilities via
more routine sorts of testing. This strength of portfolio assessment, as might be
expected, also turns out to be its weakness. As you will see in this chapter, the use
of portfolios as a measurement method allows teachers to particularize assess-
ment approaches for different students. Such particularization, although it may
work well in the case of an individual student, usually leads to different collec-
tions of evidence from different students, thus making accurate comparisons of
different students’ work—or one student’s work over time—somewhat difficult.

Classroom Portfolio Assessment Versus
Large-Scale Portfolio Assessment
Classroom Applications
Most advocates of portfolio assessment believe the real payoffs for such assess-
ment approaches lie in the individual teacher’s classroom, because the relation-
ship between instruction and assessment will be strengthened as a consequence

Source: MlrssllC tsrom Portfolio Assessment in the Reading-Writing Classroom, by Rrbssr J. Tlssesy, Mlsk . lsrss, led lssl E. Dsall, psbClahsd by
hslarrphsseGrsdre sbClahssa, IeS. © 1991, sasd wlrh pssomlaalre rt rhs psbClahss.

Table 9.1 DlttssseSsa le aasaaomser OsrSromsa Bsrwsse rsrtrClra led -rledlsdlzsd Tsarler slSrlSsa

Portfolio Testing

Represents the range of reading and writing students are
engaged in

Assesses students across a limited range of reading and writing
assignments that may not match what students do

Engages students in assessing their progress and/or
accomplishments and establishing ongoing learning goals

Mechanically scored or scored by teachers who have little input

Measures each student’s achievement, while allowing for
individual differences between students

Assesses all students on the same dimensions

Represents a collaborative approach to assessment Assessment process is not collaborative

Has a goal of student self-assessment Student self-assessment is not a goal

Addresses improvement, effort, and achievement Addresses achievement only

Links assessment and teaching to learning Separates learning, testing, and teaching

M09_POPH0936_10_SE_C09.indd 235M09_POPH0936_10_SE_C09.indd 235 09/11/23 12:48 PM09/11/23 12:48 PM

236 h TER 9 rsrtrClr aasaaomser

of students’ continuing accumulation of work products in their portfolios. Ideally,
teachers who adopt portfolios in their classrooms will make the ongoing collec-
tion and appraisal of students’ work a central focus of the instructional program,
rather than a peripheral activity whereby students occasionally gather up their
work to convince a teacher’s supervisors or students’ parents that good things
have been going on in class.

Here’s a description of how an elementary teacher might use portfolios to assess
students’ progress in social studies, language arts, and mathematics. The teacher,
let’s call him Phil Pholio, asks students to keep three portfolios, one in each of those
three subject fields. In each portfolio, the students are to place their early and revised
work products. The work products are always dated so that Mr. Pholio, as well as the
students themselves, can see what kinds of differences in quality (if any) take place
over time. For example, if effective instruction is being provided, there should be
discernible improvement in the caliber of students’ written compositions, solutions
to mathematics problems, and analyses of social issues.

Three or four times per semester, Mr. Pholio holds a 15- to 20-minute
portfolio conference with each student about the three different portfolios.
The other, nonconferencing students take part in small-group and inde-
pendent learning activities while the portfolio conferences are being con-
ducted. During a conference, the participating student plays an active role
in evaluating his or her own work. Toward the close of the school year,
students select from their regular portfolios a series of work products that
not only represent their best final versions, but also indicate how those final
products were created. These selections are placed in a display portfolio
featured at a spring open-school session designed for parents. Parents who
visit the school are urged to take their children’s display portfolios home.
Mr. Pholio also sends portfolios home to parents who are unable to attend
the open-school event.

There are, of course, many other ways to use portfolios effectively in a
classroom. Phil Pholio, our phictitious (sic) teacher, employed a fairly common
approach, but a variety of alternative procedures could also work quite nicely.
The major consideration is that the teacher uses portfolio assessment as an inte-
gral aspect of the instructional process. Because portfolios can be tailored to a
specific student’s evolving growth, the ongoing diagnostic value of portfolios
for teachers is immense.

Who Is Evaluating Whom?
Roger Farr, a leader in language arts instruction and assessment, often contended
that the real payoff from proper portfolio assessment is that students’ self-evaluation
capabilities are enhanced (1994). Thus, during portfolio conferences the teacher
encourages students to come up with personal appraisals of their own work. The
conference, then, becomes far more than merely an opportunity for the teacher
to dispense an “oral report card.” On the contrary, students’ self-evaluation skills

M09_POPH0936_10_SE_C09.indd 236M09_POPH0936_10_SE_C09.indd 236 09/11/23 12:48 PM09/11/23 12:48 PM

Claasrrom rsrtrClr aasaaomser ssasa lsrse-SlCs rsrtrClr aasaaomser 237

are nurtured not only during portfolio conferences, but also throughout the entire
school year. For this reason, Farr strongly preferred the term working portfolios
to the term showcase portfolios because he believed self-evaluation is nurtured
more readily in connection with ongoing reviews of products not intended to
impress external viewers.

For self-evaluation purposes, it is particularly useful to be able to compare
earlier work with later work. Fortunately, even if a teacher’s instruction is down-
right abysmal, students grow older and, as a consequence of maturation, tend
to get better at what they do in school. If a student is required to review three
versions of her or his written composition (a first draft, a second draft, and a
final draft), self-evaluation can be fostered by encouraging the student to make
comparative judgments of the three compositions based on a rubric featuring
appropriate evaluative criteria. As anyone who has done much writing knows,
written efforts tend to get better with time and revision. Contrasting later ver-
sions with earlier versions can prove illuminating from an appraisal perspective
and, because students’ self-evaluation is so critical to their future growth, from
an instructional perspective as well.

Large-Scale Applications
It is one thing to use portfolios for classroom assessment; it is quite another
to use portfolios for large-scale assessment programs. Several states and large
school districts have attempted to install portfolios as a central component of a
large-scale accountability assessment program—that is, an evaluation approach
in which students’ performances serve as an indicator of an educational system’s
effectiveness. To date, the results of efforts to employ portfolios for accountability
purposes have not been encouraging.

In large-scale applications of portfolio assessments for accountability pur-
poses, students’ portfolios are judged either by the students’ regular teachers
or by a cadre of specially trained scorers (often teachers) who carry out the
bulk of scoring at a central site. The problem with specially trained scorers and
central-site scoring is that it typically costs much more than can be afforded.
Some states, therefore, have opted to have all portfolios scored by students’
own teachers, who then relay such scores to the state department. The problem
with having regular teachers score students’ portfolios, however, is that such
scoring tends to be too unreliable for use in accountability programs. Not only
have teachers usually not been provided with thorough training about how
to score portfolios, but there is also a tendency for teachers to be biased in
favor of their own students. To cope with such problems, sometimes teachers
in a school or district evaluate their students’ portfolios, but then a random
sample of those portfolios are scored by state officials as an “audit” of the local
scoring’s accuracy.

One of the most visible of the statewide efforts to use portfolios on every pupil
has been a performance-assessment program in the state of Vermont. Because

M09_POPH0936_10_SE_C09.indd 237M09_POPH0936_10_SE_C09.indd 237 09/11/23 12:48 PM09/11/23 12:48 PM

238 h TER 9 rsrtrClr aasaaomser

substantial national attention has been focused on the Vermont program, and
because it has been evaluated independently, many policymakers in other states
have drawn on the experiences encountered in the Vermont Portfolio Assessment
Program. Unfortunately, independent evaluators of Vermont’s statewide efforts
to use portfolios found that there was considerable unreliability in the apprais-
als given to students’ work. And, if you recall the importance of reliability as
discussed in Chapter 3, you know that it’s tough to draw valid inferences about
students’ achievements if the assessments of those achievements are not yielding
consistent results.

But, of course, this is a book about classroom assessment, not large-scale
assessment. It certainly hasn’t been shown definitively that portfolios do not have
a place in large-scale assessment. What has been shown, however, is that there are
significant obstacles to be surmounted if portfolio assessment is going to make a
meaningful contribution to large-scale educational accountability testing.

DsSlalre Tloms
Drsa -sCteEvlCslrlre EqslC -sCteGsldler?

After a midsummer, schoolwide 3-day workshop
on the Instructional Payoffs of Classroom
Portfolios, the faculty at Rhoda Street Elementary
School have agreed to install student portfolios
in all classrooms for one or more subject areas.
Maria Martinez, an experienced third-grade
teacher in the school, has decided to try out
portfolios only in mathematics. She admits to her
family (but not to her fellow teachers) that she’s
not certain she’ll be able to use portfolios properly
with her students.

Because she has attempted to follow guidelines
of the National Council of Teachers of Mathematics,
Maria stresses mathematical problem solving and
the integration of mathematical understanding with
content from other disciplines. Accordingly, she asks
her students to place in their mathematics portfolios
versions of their attempts to solve quantitative
problems drawn from other subjects. Maria poses
these problems for her third-graders and then
instructs them to prepare an initial solution strategy
and also revise that solution at least twice. Students
are directed to put all solutions (dated) in their
portfolios.

Six weeks after the start of school, Maria sets up
a series of 15-minute portfolio conferences with her
students. During the 3 days on which the portfolio
conferences are held, students who are not involved in
a conference move through a series of learning stations
in other subject areas, where they typically engage in a
fair amount of peer critiquing of each other’s responses
to various kinds of practice exercises.

Having learned during the summer workshop
that the promotion of students’ self-evaluation is
critical if students are to get the most from portfolios,
Maria devotes the bulk of her 15-minute conferences
to students’ personal appraisals of their own work.
Although Maria offers some of her own appraisals of
most students’ work, she often allows the student’s
self-evaluation to override her own estimates of a
student’s ability to solve each problem.

Because it will soon be time to give students
their 10-week grades, Maria doesn’t know whether
to base the grades chiefly on her own judgments or
on the students’ self-appraisals.

If you were Maria, what would you
decide to do?

M09_POPH0936_10_SE_C09.indd 238M09_POPH0936_10_SE_C09.indd 238 09/11/23 12:48 PM09/11/23 12:48 PM

-svse Ksy Ierssdlsera le Claasrrom rsrtrClr aasaaomser 239

Seven Key Ingredients in Classroom
Portfolio Assessment
Although there are numerous ways to install and sustain portfolios in a class-
room, you will find that the following seven-step sequence provides a reason-
able template for getting underway with portfolio assessment. Taken together,
these seven activities capture the key ingredients in classroom-based portfolio
assessment.

1. Make sure your students “own” their portfolios. In order for portfolios to
represent a student’s evolving work accurately, and to foster the kind of
self-evaluation so crucial if portfolios are to be truly educational, students
must perceive portfolios to be collections of their own work, and not
merely temporary receptacles for products that their teacher ultimately
grades. You will probably want to introduce the notion of portfolio as-
sessment to your students (assuming portfolio assessment isn’t already
a schoolwide operation and your students aren’t already steeped in the
use of portfolios) by explaining the distinctive functions of portfolios in
the classroom.

2. Decide what kinds of work samples to collect. Various kinds of work samples can
be included in a portfolio. Obviously, such products will vary from subject
to subject. In general, a substantial variety of work products is preferable to
a limited range of work products. However, for portfolios organized around
students’ mastery of a particularly limited curricular aim, it may be preferable
to include only a single kind of work product. Ideally, you and your students
can collaboratively determine what goes in the portfolio.

3. Collect and store work samples. Students need to collect the designated work
samples as they are created, place them in a suitable container (a folder or
notebook, for example), and then store the container in a file cabinet, storage
box, or some other safe location. You may need to work individually with
your students to help them decide whether particular products should be
placed in their portfolios. The actual organization of a portfolio’s contents
depends, of course, on the nature of the work samples being collected. For
instance, today’s rapidly evolving digital technology will surely present a
range of new options for creating electronic portfolios.

4. Select criteria by which to evaluate portfolio work samples. Working collaboratively
with students, carve out a set of criteria by which you and your students can
judge the quality of your students’ portfolio products. Because of the likely
diversity of products in different students’ portfolios, the identification of
evaluative criteria will not be a simple task. Yet, unless at least rudimentary
evaluative criteria are isolated, the students will find it difficult to evaluate
their own efforts and, thereafter, to strive for improvement. The criteria, once
selected, should be described with the same sort of clarity we saw in Chapter 8

M09_POPH0936_10_SE_C09.indd 239M09_POPH0936_10_SE_C09.indd 239 09/11/23 12:48 PM09/11/23 12:48 PM

240 h TER 9 rsrtrClr aasaaomser

regarding how to employ a rubric’s evaluative criteria when judging students’
responses to performance test tasks. Once students realize that the evaluative
criteria will be employed to appraise their own work, most students get into
these criteria-appraisal sessions with enthusiasm.

5. Require students to evaluate their own portfolio products continually. Using
the agreed-on evaluative criteria, be sure your students routinely appraise
their own work. Students can be directed to evaluate their work products
holistically, analytically, or using a combination of both approaches. Such
self-evaluation can be made routine by requiring each student to complete
brief evaluation slips on cards on which they identify the major strengths
and weaknesses of a given product and then suggest how the product could
be improved. Be sure to have your students date such self-evaluation sheets
so they can keep track of modifications in their self-evaluation skills. Each
completed self-evaluation sheet should be stapled or paper-clipped to the
work product being evaluated. For digital portfolios, of course, comparable
electronic actions would be undertaken.

6. Schedule and conduct portfolio conferences. Portfolio conferences take time. Yet
these interchange sessions between teachers and students regarding students’
work are really pivotal in making sure portfolio assessment fulfills its poten-
tial. The conference should not only evaluate your students’ work products,
but should also help them improve their self-evaluation abilities. Try to hold
as many of these conferences as you can. In order to make the conferences
time efficient, be sure to have students prepare for the conferences so you can
start right in on the topics of most concern to you and the students.

7. Involve parents in the portfolio-assessment process. Early in the school year, make
sure your students’ parents understand the nature of the portfolio-assessment
process that you’ve devised for your classroom. Insofar as is practical, encour-
age your students’ parents/guardians periodically to review their children’s
work samples, as well as their children’s self-evaluation of those work sam-
ples. The more active that parents become in reviewing their children’s work,
the stronger the message will be to the child indicating the portfolio activity
is really worthwhile. If you wish, you may have students select their best
work for a showcase portfolio or, instead, you may simply use the students’
working portfolios.

These seven steps reflect only the most important activities that teachers
might engage in when creating assessment programs in their classrooms. There
are obviously all sorts of variations and embellishments possible.

There’s one situation in which heavy student involvement in the portfolio
process may not make instructional sense. This occurs in the early grades when, in
the teacher’s judgment, those little tykes are not developmentally ready to take a
meaningful hand in a full-blown portfolio self-evaluation extravaganza. Any sort
of educational assessment ought to be developmentally appropriate for the students
who are being assessed. Thus, for students who are in early primary grades, a

M09_POPH0936_10_SE_C09.indd 240M09_POPH0936_10_SE_C09.indd 240 09/11/23 12:48 PM09/11/23 12:48 PM

-svse Ksy Ierssdlsera le Claasrrom rsrtrClr aasaaomser 241

teacher may sensibly decide to employ only showcase portfolios to display a child’s
accomplishments to the child and the child’s parents. Working portfolios, simply
bristling with student-evaluated products, can be left for later. This is, clearly, a
teacher’s call.

Purposeful Portfolios
There are numerous choice-points you’ll encounter if you embark on a
portfolio-assessment approach in your own classroom. The first one ought to
revolve around purpose. Why is it that you are contemplating a meaningful prance
down the portfolio pathway?

Assessment specialists typically identify three chief purposes for portfolio
assessment. The first of these is documentation of student progress, wherein the major
function of the assembled work samples is to provide the student, the teacher,
and the student’s parents with evidence about the student’s growth—or lack of
it. Such working portfolios provide meaningful opportunities for self-evaluation
by students.

Pointing out that students’ achievement levels ought to influence teach-
ers’ instructional decisions, Anderson (2003) concluded that “the information
should be collected as close to the decision as possible (e.g., final examinations
are administered in close proximity to end-of-term grades). However, if decisions
are to be based on learning, then a plan for information collection over time must
be developed and implemented” (p. 44). The more current any documentation of
students’ progress is, the more accurate such documentation is apt to be.

M09_POPH0936_10_SE_C09.indd 241M09_POPH0936_10_SE_C09.indd 241 09/11/23 12:48 PM09/11/23 12:48 PM

242 h TER 9 rsrtrClr aasaaomser

From an instructional perspective, the really special advantage of portfolio
assessment is that its recurrent assessment of the student’s status with respect
to mastery of one or more demanding skills provides both teachers and stu-
dents with assessment-informed opportunities to make any needed adjustments
in what they are currently doing. Later in Chapter 12, we will be considering
the instructional dividends of the formative-assessment process. Because of its
recurring assessment of students’ evolving mastery of skills, portfolio assessment
practically forces teachers to engage in a classroom instructional process closely
resembling formative assessment.

A second purpose of portfolios is to provide an opportunity for showcasing
student accomplishments. Chappuis and Stiggins (2017) have described portfolios
that showcase students’ best work as celebration portfolios, and they contend that
celebration portfolios are especially appropriate for the early grades. In portfolios
intended to showcase student accomplishments, students typically select their
best work and reflect thoughtfully on its quality.

One teacher in the Midwest always makes sure students include the follow-
ing elements in their showcase portfolios:

• A letter of introduction to portfolio reviewers

• A table of contents

• Identification of the skills or knowledge being demonstrated

• A representative sample of the student’s best work

• Dates on all entries

• The evaluative criteria (or rubric) being used

• The student’s self-reflection on all entries

Student self-reflections about the entries in portfolios is a pivotal ingredient
in showcase portfolios. Some portfolio proponents contend that a portfolio’s
self-evaluation by the student helps the learner learn better and permits the reader
of the portfolio to gain insights about how the learner learns.

A final purpose for portfolios is evaluation of student status—that is, the
determination of whether students have met previously determined quality
levels of performance. McMillan (2018) points out that when portfolios are
used for this purpose, there must be greater standardization about what should
be included in a portfolio and how the work samples should be appraised.
Typically, teachers select the entries for this kind of portfolio, and consider-
able attention is given to scoring, so that any rubrics employed to score the
portfolios will yield consistent results even if different scorers are involved. For
portfolios being used to evaluate student status, there is usually less need for
self-evaluation of entries—unless such self-evaluations are themselves being
evaluated by others.

Well, we’ve peeked at three purposes underlying portfolio assessment. Can
one portfolio perform all three functions? Many teachers who have used portfolios
will supply a somewhat shaky yes. But if you were to ask those teachers whether

M09_POPH0936_10_SE_C09.indd 242M09_POPH0936_10_SE_C09.indd 242 09/11/23 12:48 PM09/11/23 12:48 PM

-svse Ksy Ierssdlsera le Claasrrom rsrtrClr aasaaomser 243

one portfolio can perform all three functions well, you ought to get a rock-solid
no. The three functions, though somewhat related—rather like annoying second
cousins—are fundamentally different.

That’s why your very first decision, if you’re going to install portfolios in your
classroom, is to decide on the primary purpose of the portfolios. You can then more
easily determine what the portfolios should look like and how students should
prepare them.

Scriptural scholars sometime tell us that “No man can serve two masters.”
Similarly, one kind of portfolio cannot blithely satisfy multiple functions. Some
classroom teachers rush into portfolio assessment because they’ve heard about all
of the enthralling things that portfolios can do. But one kind of portfolio cannot
easily fulfill multiple functions. Pick your top-priority purpose and then build a
portfolio assessment to satisfy this purpose well.

Work-Sample Selection
For a teacher just joining the portfolio party, another key decision hinges
on identifying the work samples to be put into the portfolios. All too often,
teachers who are novices at portfolio assessment will fail to think divergently
enough about the kinds of entries that should constitute a portfolio’s chief
contents.

But divergency is not necessarily a virtue when it comes to the determination
of a portfolio’s contents. You shouldn’t search for varied kinds of work samples
simply for the sake of variety. What’s important is that the particular kinds of
work samples to be included in the portfolio will allow you to derive valid infer-
ences about the skills and/or knowledge you’re trying to have your students
master. It’s far better to include a few kinds of inference-illuminating work samples
than to include a galaxy of work samples, many of which do not contribute to
your interpretations regarding students’ knowledge or skills.

Remember, as you saw earlier in this text, really rapturous educational
assessment requires that the assessment procedure being employed must exhibit
validity, reliability, and fairness. The need to satisfy this trio of measurement
criteria does not instantly disappear into the ether when portfolio assessment is
adopted. Yes, you still need evidence of valid assessment-based interpretations
for the portfolio assessment’s intended use. Yes, you still need evidence that
your chosen assessment technique yields accurate and reliable scores. And, yes,
you still need evidence attesting to the fundamental fairness of your portfolio
assessment.

One of the most common shortcomings of teachers who, for the first time,
hop into the deep end of the portfolio-assessment pool is the tendency for teachers
to let students’ raw “smarts” or their family affluence rule the day. That is, if the
teacher has not done a crackerjack job in clarifying the evaluative criteria by which
to evaluate students’ efforts, then when portfolios are created and subsequently
evaluated, students who inherited jolly good cerebral genes, or whose families

M09_POPH0936_10_SE_C09.indd 243M09_POPH0936_10_SE_C09.indd 243 09/11/23 12:48 PM09/11/23 12:48 PM

244 h TER 9 rsrtrClr aasaaomser

are more socioeconomically affluent, will be seen as the winners. Putting it dif-
ferently, the danger is that students’ portfolios will be evaluated dominantly by
what students are bringing to the instructional situation, not by what they have
gained from it. To guard against this understandable bias—after all, what sensible
teacher does not groove on good-looking and content-loaded portfolios—it is often
useful for the teacher to adopt a mental trio of review “lenses” through which to
consider a student’s evolving and, finally, completed portfolio. As always, try to
focus on all three of the sanctioned facets of educational testing: validity, reliabil-
ity, and fairness. With this assessment trinity constantly in your mind, odds are
that you can spot, and then jettison, instances of portfolio invalidity, unreliability,
and unfairness.

Because of the atypicality of portfolio assessment, however, coming up with
such evidence is often more challenging than when relying on more conventional
sorts of tests. Challenging or not, however, the “big three” still apply to portfolio
assessment with their customary force. For a portfolio-assessment system to have
the desired punch, it must be accompanied by meaningful evidence bearing on
its validity, reliability, and fairness.

Appraising Portfolios
As indicated earlier in the chapter, students’ portfolios are almost always evalu-
ated by the use of a rubric. The most important ingredients of such a rubric are
its evaluative criteria—that is, the factors to be used in determining the quality of
a particular student’s portfolio. If there’s any sort of student self-evaluation to be
done, and such self-evaluation is almost always desirable, then it is imperative
that students have access to, and thoroughly understand, the rubric that will be
used to evaluate their portfolios.

As you’ll see when we treat formative assessment in Chapter 12, for certain
applications of formative assessment, students must definitely understand the
rubrics (and those rubrics’ evaluative criteria). However, for all forms of portfolio
assessment, students’ familiarity with rubrics is imperative.

Here’s one quick-and-dirty way to appraise any sort of portfolio assess-
ments you might install in your classroom. Simply ask yourself whether your
students’ portfolios have substantially increased the accuracy of the inferences
you make regarding your students’ skills and knowledge. If your answer is yes,
then portfolio assessment is a winner for you. If your answer is no, then pitch
those portfolios without delay—or guilt. But wait! If your answer resembles
“I’m not sure,” then you really need to think this issue through rigorously and,
quite possibly, collect more evidence bearing on the accuracy of portfolio-based
inferences. Classroom assessment is supposed to contribute to more valid infer-
ences from which better instructional decisions can be made. If your portfolio-
assessment program isn’t clearly doing those things, you may need to make some
serious changes in that program.

M09_POPH0936_10_SE_C09.indd 244M09_POPH0936_10_SE_C09.indd 244 09/11/23 12:48 PM09/11/23 12:48 PM

Ths sra led rea rt rsrtrClr aasaaomser 245

The Pros and Cons of Portfolio
Assessment
You must keep in mind that portfolio assessment’s greatest strength is that it can
be tailored to the individual student’s needs, interests, and abilities. Yet, portfolio
assessment suffers from the drawback faced by all constructed-response mea-
surement. Students’ constructed responses are genuinely difficult to evaluate,
particularly when those responses vary from student to student.

As was seen in Vermont’s Portfolio Assessment Program, it is quite dif-
ficult to come up with consistent evaluations of different students’ portfolios.
Sometimes the scoring guides devised for use in evaluating portfolios are so
terse and so general as to be almost useless. They’re akin to Rorschach inkblots,
in which different scorers see in the scoring guide what they want to see. In
contrast, some scoring guides are so detailed and complicated that they simply
overwhelm scorers. It is difficult to devise scoring guides that embody just the
right level of specificity. Generally speaking, most teachers are so busy they
don’t have time to create elaborate scoring schemes. Accordingly, many teach-
ers (and students) sometimes find themselves judging portfolios by employing
fairly loose evaluative criteria. Such criteria tend to be interpreted differently
by different people.

M09_POPH0936_10_SE_C09.indd 245M09_POPH0936_10_SE_C09.indd 245 09/11/23 12:48 PM09/11/23 12:48 PM

246 h TER 9 rsrtrClr aasaaomser

Another problem with portfolio assessment is it takes time—loads of time—
to carry out properly. Even if you’re very efficient in reviewing your students’
portfolios, you’ll still have to devote many hours both in class (during portfolio
conferences) and outside of class (if you also wish to review your students’ port-
folios by yourself). Proponents of portfolios are convinced that the quality of
portfolio assessment is worth the time such assessment takes. You at least need
to be prepared for the required investment of time if you decide to undertake
portfolio assessment. And teachers will surely need to receive sufficient training
to learn how to carry out portfolio assessment well. Any teachers who set out to
do portfolio assessment by simply stuffing student stuff into containers built for
stuff stuffing will end up wasting their time and their students’ time. Meaningful
professional development is a must whenever portfolio assessment is to work
well. If several teachers in a school will be using portfolio assessment in their
classes, this would be a marvelous opportunity to establish a teacher learning com-
munity in which, on a continuing basis during the year, portfolio-using teachers
meet to share insights and work collaboratively on common problems.

Bsr Whlr Drsa Thla hlvs rr Dr wlrh TslShler?
Portfolio assessment almost always fundamentally
changes a teacher’s approach to instruction. A
nifty little selected-response test, or even a lengthy
constructed-response exam, can usually be
incorporated into a teacher’s ongoing instructional
program without significantly altering how the teacher
goes about teaching. But that’s not so with portfolio
assessment. Once a teacher hops aboard the
portfolio hay wagon, there’s a long and serious ride
in the offing.

Portfolio assessment, if it’s properly focused
on helping children improve by evaluating their own
work samples, becomes a continuing and central
component of a teacher’s instructional program.
Work samples have to be chosen, scoring rubrics
need to be developed, and students need to be
taught how to use those rubrics to monitor the
evolving quality of their efforts. Portfolio assessment,
therefore, truly dominates most instructional
programs in which it is employed.

The first thing you need to do is decide
whether the knowledge and skills you are trying
to have your students master (especially the
skills) lend themselves to portfolio assessment.

Will there be student work samples that, because
they permit you to make accurate interpretations
about your students’ evolving skill mastery,
could provide the continuing focus for portfolio
assessment?

Some content lends itself delightfully to portfolio
assessment. Some doesn’t. During my first year as
a high school teacher, I taught two English courses
and a speech class. Because my English courses
focused on the promotion of students’ composition
skills, I now believe that a portfolio-assessment
strategy might have worked well in either English
course. My students and I could have monitored
their improvements in being able to write.

But I don’t think I could have used portfolios
effectively in my speech class. At that time, television
itself had just arrived on the West Coast, and
videotaping was unknown. Accordingly, my speech
students and I wouldn’t have had anything to toss
into their portfolios.

The first big question you need to ask yourself,
when it comes to portfolio assessment, is quite simple:
Does this powerful but time-consuming form of
assessment seem suitable for what I’m trying to teach?

M09_POPH0936_10_SE_C09.indd 246M09_POPH0936_10_SE_C09.indd 246 09/11/23 12:48 PM09/11/23 12:48 PM

Whlr Dr Claasrrom TslShssa RslCCy ssd rr Kerw brsr rsrtrClr aasaaomser? 247

On the plus side, however, most teachers who have used portfolios agree
that portfolio assessment provides a strategy for documenting and evaluating
growth happening in a classroom in ways that standardized or written tests can-
not. Portfolios have the potential to create authentic portraits of what students
learn. Stiggins (2017) contends that for portfolios to merge effectively with instruc-
tion, they must have a story to tell. Fortunately, this story can be made compatible
with improved student learning.

If you speak with many of the teachers who use portfolio assessments, you’ll
often find that they are primarily enamored of two of its payoffs. They believe
that the self-evaluation it fosters in students is truly important in guiding students’
learning over time. They also think that the personal ownership students experience
regarding their own work—and the progress they experience—makes the benefits
of portfolio assessment outweigh its costs.

What Do Classroom Teachers Really
Need to Know About Portfolio
Assessment?
As noted at the beginning of this book’s four-chapter foray into item types, the more
familiar you are with different kinds of test items, the more likely you will be to
select an item type that best provides you with the information you need in order to
draw suitable inferences about your students. Until recently, portfolios haven’t been
viewed as a viable assessment option by many teachers. These days, however, port-
folio assessment is increasingly regarded as a legitimate tactic in a teacher’s assess-
ment strategy. And, as noted earlier in the chapter, portfolio assessment appears
to be particularly applicable to the assessment of students with severe disabilities.

You need to realize that if portfolio assessment is going to constitute a helpful
adjunct to your instructional program, portfolios will have to become a central, not
a tangential, part of what goes on in your classroom. The primary premise in port-
folio assessment is that a particularized collection of a student’s evolving work will
allow both the student and you to determine the student’s progress. You can’t gauge
the student’s progress if you don’t have frequent evidence of the student’s efforts.

It would be educationally unwise to select portfolio assessment as a one-time
measurement approach to deal with a short-term instructional objective. Rather,
it makes more sense to select some key curricular aim, such as the student’s abil-
ity to write original compositions, and then monitor this aspect of the student’s
learning throughout the entire school year. It is also important to remember that
although portfolio assessment may prove highly valuable for classroom instruc-
tion and measurement purposes, at this juncture there is insufficient evidence that
it can be used appropriately for large-scale assessment.

A number of portfolio-assessment specialists believe that the most important
dividend from portfolio assessment is the increased ability of students to evaluate

M09_POPH0936_10_SE_C09.indd 247M09_POPH0936_10_SE_C09.indd 247 09/11/23 12:48 PM09/11/23 12:48 PM

248 h TER 9 rsrtrClr aasaaomser

their own work. If this becomes one of your goals in a portfolio-assessment approach,
you must be certain to nurture such self-evaluation growth deliberately via portfo-
lios, instead of simply using portfolios as convenient collections of work samples.

The seven key ingredients in portfolio assessment that were identified in
the chapter represent only one way of installing this kind of assessment strategy.
Variations of those seven suggested procedures are not only possible, but also to
be encouraged. The big thing to keep in mind is that portfolio assessment offers
your students, and you, a way to particularize your evaluation of each student’s
growth over time. And, speaking of time, it’s only appropriate to remind you that
it takes substantially more time to use a portfolio-assessment approach properly
than to score a zillion true–false tests. If you opt to try portfolio assessment, you’ll
have to see whether, in your own instructional situation, it yields sufficient educa-
tional benefits to be worth the investment you’ll surely need to make in it.

lsser TlCk
Suppose that Mr. and Mrs. Holmgren, parents of your
student, Harry, stopped by your classroom during a
back-to-school night to examine their son’s portfolio.
After spending almost 30 minutes going through the
portfolio and skimming the portfolios of several other
students, they speak to you—not with hostility, but
with genuine confusion. Mrs. Holmgren sums up their
concerns nicely with the following comments: “When we
stopped by Mr. Bray’s classroom earlier this evening to
see how our daughter, Elissa, is doing, we encountered
a series of extremely impressive portfolios. Elissa’s was
outstanding. To be honest, our son, Harry’s portfolio is a
lot less polished. It seems that he’s included everything
he’s ever done in your class, rough drafts as well as final
products. Why is there this difference?”

If I were you, here’s how I’d respond to
Harry’s parents:

“It’s really good that you two could take the time
to see how Harry and Elissa are doing. And I can
understand why you’re perplexed by the differences
between Elissa’s and Harry’s portfolios. You see,
there are different kinds of student portfolios, and
different portfolios serve different purposes.

“In Mr. Bray’s class, and I know this because
we often exchange portfolio tips with one another
during regular meetings of our teacher learning

community, students prepare what are called
showcase portfolios. In such portfolios, students
pick their very best work to show Mr. Bray and
their parents what they’ve learned. I think Mr. Bray
actually sent his students’ showcase portfolios
home about a month ago so you could see
how well Elissa is doing. For Mr. Bray and his
students, the portfolios are collections of best
work that, in a very real sense, celebrate students’
achievements.

“In my class, however, students create working
portfolios in which the real emphasis is on getting
students to make progress and to evaluate this
progress on their own. When you reviewed Harry’s
portfolio, did you see how each entry is dated
and how he had prepared a brief self-reflection of
each entry? I’m more interested in Harry’s seeing
the improvement he makes than in anyone seeing
just polished final products. You, too, can see the
striking progress he’s made over the course of this
school year.

“I’m not suggesting that my kind of portfolio
is better than Mr. Bray’s. Both have a role to
play. Those roles, as I’m sure you’ll see, are quite
different.”

Now, how would you respond to Mr. and
Mrs. Holmgren?

M09_POPH0936_10_SE_C09.indd 248M09_POPH0936_10_SE_C09.indd 248 09/11/23 12:48 PM09/11/23 12:48 PM

RstssseSsa 249

Chapter Summary
After defining portfolios as systematic collections
of students’ work, contrasts were drawn between
portfolio assessment and more conventional test-
ing. It was suggested that portfolio assessment
is far more appropriate for an individual teach-
er’s classroom assessment than for large-scale
accountability assessments.

An emphasis on self-assessment was
suggested as being highly appropriate for port-
folio assessment, particularly in view of the
way portfolios can be tailored to an individual
student’s evolving progress. Seven steps were
then suggested as key ingredients for classroom
teachers to install and sustain portfolio assess-
ment in their classroom: (1) establish student own-
ership, (2) decide what work samples to collect,

(3) collect and score work samples, (4) select
evaluative criteria, (5) require continual student
self-evaluations, (6) schedule and conduct port-
folio conferences, and (7) involve parents in the
portfolio-assessment process.

Three different functions of portfolio assess-
ment were identified—namely, documentation of
student progress, showcasing of student accom-
plishments, and evaluation of student status.
Teachers were urged to select a primary purpose
for portfolio assessment.

The chapter concluded with an identification
of plusses and minuses of portfolio assessment.
It was emphasized that portfolio assessment rep-
resents an important measurement strategy now
available to today’s classroom teachers.

References
Anderson, L. W. (2003). Classroom assessment:

Enhancing the quality of teacher decision making.
Erlbaum.

Andrade, H. L. (2019). A critical review of
research on student self-assessment. Frontiers.
Retrieved September 20, 2022, from https://
www.frontiersin.org/articles/10.3389/
feduc.2019.00087/full

Belgrad, S. E. (2013). Portfolios and e-portfolios:
Student reflection, self-assessment, and
goal setting in the learning process. In J. H.
McMillan (Ed.), SAGE Handbook of research
on classroom assessment (pp. 331–346). SAGE
Publications.

Chappuis, J., & Stiggins, R. J. (2017). An
introduction to student-involved classroom
assessment for learning (7th ed.). Pearson.

Farr, R. (1994). Portfolio and performance assessment:
Helping students evaluate their progress as readers
and writers (2nd ed.). Harcourt Brace & Co.

Farr, R. (1997). Portfolio & performance assessment
language. Harcourt Brace & Co.

Hertz, M. B. (2020, January 6). Tools for
creating digital student portfolios. Edutopia.
https://www.edutopia.org/article/
tool s-creating-digital-student-portfolios

McMillan, J. H. (2018). Classroom assessment:
Principles and practice for effective standards-based
instruction (7th ed.). Pearson.

Miller, M. D., & Linn, R. (2013). Measurement and
assessment in teaching (11th ed.). Pearson.

Tierney, R. J., Carter, M. A., & Desai, L. E. (1991).
Portfolio assessment in the reading-writing
classroom. Christopher-Gordon.

M09_POPH0936_10_SE_C09.indd 249M09_POPH0936_10_SE_C09.indd 249 09/11/23 12:48 PM09/11/23 12:48 PM

250 hlprss 9 rsrtrClr aasaaomser

A Testing Takeaway

Portfolio Assessment: Finding a Sustainable Focus*
W. James Popham, University of California, Los Angeles

A portfolio is a systematic collection of one’s work. In education, portfolios are systematic
collections of a student’s work products—and portfolio assessment is the appraisal of a
student’s evolving skills based on those collected products. Although a few states have
attempted to use portfolio assessment in a statewide accountability system, the inability
to arrive at sufficiently reliable scores for students’ work products has led to portfolio
assessment’s being used almost exclusively at the classroom level.

Classroom portfolio assessment takes time and energy, both from the teacher and from
the teacher’s students. Everyone can get worn out making classroom portfolio assessment
work. Yet classroom portfolio assessment can promote genuinely important learning
outcomes for students. Accordingly, the challenge is to select a sustainable structure for
classroom portfolio assessment.

Teachers who report the most success when employing portfolio assessment in their
classrooms often focus their instructional efforts on enhancing their students’ self-evaluation
abilities. Depending on the skill(s) being promoted, a long-term collection of a student’s
evolving work products typically emerges—and those products can then be subjected
to a student’s use of whatever evaluative system has been adopted. Over time, then, and
with suitable guidance from the teacher, students’ self-evaluation skills are often seriously
sharpened.

One of the most common targets for portfolio assessment is the improvement of students’
composition skills. Although portfolio assessment can be used in many other content arenas,
let’s use composition as an example.

Early on, after the teacher explains the structure of the portfolio-evaluation system, the
teacher describes the rubric—that is, the scoring guide to be employed in evaluating students’
compositions. Scrutiny of the rubric allows students to become conversant with the criteria
to be used as they evaluate their own compositions. Thereafter, at designated intervals, portfolio
conferences are scheduled between each student and the teacher. These conferences are
frequently led by students in evaluating their portfolio’s work products.

Many teachers who employ this approach to portfolio assessment report meaningful
success in promoting their students’ self-evaluation skills. Yet some teachers who have
attempted to use portfolio assessment have abandoned the process simply because it takes
too much time. It is the teacher’s challenge, then, to devise a way to assess portfolios that is
truly sustainable, year after year, so the substantial dividends of portfolio assessment can be
achieved while, at the same time, maintaining the teacher’s sanity.

*Fsrom hlprss 9 rt Classroom Assessment: What Teachers Need to Know, 10rh sd., by W. Jlomsa rphlom. rpyslrhr 2022 by slsare, whlSh hsssby rslera
pssomlaalre trs rhs sspsrdsSrlre led dlarslbsrlre rt rhla Testing Takeaway, wlrh psrpss lrrslbsrlre, trs rhs lerser rt leSsslaler laasaaomser ClrsslSy. dlrlrlCCy
ahlsslbCs vssalre la lvllClbCs tsrom hrrpa://www.pslsare.Srom/arrss/seesa/pslsarepCsa/Crrle.

M09_POPH0936_10_SE_C09.indd 250M09_POPH0936_10_SE_C09.indd 250 09/11/23 12:48 PM09/11/23 12:48 PM

209

Chapter 8

Performance
Assessment

Chief Chapter Outcome

A sufficiently broad understanding of performance assessment
to distinguish between accurate and inaccurate statements
regarding the nature of performance tests, the identification
of tasks suitable for such tests, and the scoring of students’
performances

Learning Objectives

8.1 Identify three defining characteristics of performance tests intended to
maximize their contributions to teachers’ instructional effectiveness and
describe how they should be applied.

8.2 Isolate the substantive distinctions among three different procedures for
evaluating the quality of students’ responses to performance assessments,
that is, via task-specific, hypergeneral, and skill-focused rubrics.

During the early 1990s, a good many educational policymakers became
enamored of performance assessment, which is an approach to measuring
a student’s status based on the way the student completes a specified task.
Theoretically, of course, when the student chooses between true and false for
a binary-choice item, the student is completing a task, although an obviously
modest one. But the proponents of performance assessment have measure-
ment schemes in mind that are meaningfully different from binary-choice and
multiple-choice tests. Indeed, it was dissatisfaction with traditional paper-
and-pencil tests that caused many educators to travel with enthusiasm down
the performance-testing trail.

M08_POPH0936_10_SE_C08.indd 209M08_POPH0936_10_SE_C08.indd 209 09/11/23 6:21 PM09/11/23 6:21 PM

210 Chapter 8 performance assessment

What Is a Performance Test?
Before digging into what makes performance tests tick and how you might use
them in your own classroom, we’d best explore the chief attributes of such an
assessment approach. Even though all educational tests require students to per-
form in some way, when most educators talk about performance tests, they are
thinking about assessments in which the student is required to construct an origi-
nal response. It may be useful for you to regard performance tests as an assess-
ment task in which the students make products or engage in behaviors other than
answering selected-response or constructed-response items. Often, an examiner
(such as the teacher) observes the process of construction, in which case observa-
tion of the student’s performance and judgment about the quality of that perfor-
mance are required. Frequently, performance tests call for students to generate
some sort of specified product whose quality can then be evaluated.

More than a half-century ago, Fitzpatrick and Morrison (1971) observed that
“there is no absolute distinction between performance tests and other classes of
tests.” They pointed out that the distinction between performance assessments
and more conventional tests is chiefly the degree to which the examination simu-
lates the real-world criterion situation—that is, the extent to which the examina-
tion calls for student behaviors approximating those about which we wish to
make inferences.

Suppose, for example, a teacher who had been instructing students in the
process of collaborative problem solving wanted to see whether students had
acquired this collaborative skill. The inference at issue centers on the extent to
which each student has mastered the skill. The educational decision on the line
might be whether particular students need additional instruction or, on the con-
trary, it’s time to move on to other, possibly more advanced, curricular aims.
The teacher’s real interest, then, is in how well students can work with other
students to arrive collaboratively at solutions to problems. In Figure 8.1, you will
see several assessment procedures that could be used to get a fix on a student’s
collaborative problem-solving skills. Note that the two selected-response assess-
ment options (numbers 1 and 2) don’t really ask students to construct anything.
For the other three constructed-response assessment options (numbers 3, 4, and 5),
however, there are clear differences in the degree to which the task presented to
the student coincides with the class of tasks called for by the teacher’s curricular
aim. Assessment Option 5, for example, is obviously the closest match to the
behavior called for in the curricular aim. Yet, Assessment Option 4 is surely more
of a “performance test” than is Assessment Option 1.

It should be apparent to you, then, that different educators will be using
the phrase performance assessment to refer to very different kinds of assessment
approaches. Many teachers, for example, are willing to consider short-answer
and essay tests as a form of performance assessment. In other words, those teach-
ers essentially equate performance assessment with any form of constructed-
response assessment. Other teachers establish more stringent requirements for a

M08_POPH0936_10_SE_C08.indd 210M08_POPH0936_10_SE_C08.indd 210 09/11/23 6:21 PM09/11/23 6:21 PM

Wat s a performance testt 211

measurement procedure to be accurately described as a performance assessment.
For example, some performance-assessment proponents contend that genuine
performance assessments must exhibit at least three features:

• Multiple evaluative criteria. The student’s performance must be judged using
more than one evaluative criterion. To illustrate, a student’s ability to speak
Spanish might be appraised on the basis of the student’s accent, syntax, and
vocabulary.

• Prespecified quality standards. Each of the evaluative criteria on which a stu-
dent’s performance is to be judged is clearly explicated in advance of judging
the quality of the student’s performance.

• Judgmental appraisal. Unlike the scoring of selected-response tests in which
electronic computers and scanning machines can, once programmed, carry on
without the need of humankind, genuine performance assessments depend
on human judgments to determine how acceptable a student’s performance
really is.

Looking back to Figure 8.1, it is clear that if the foregoing three requirements
were applied to the five assessment options, then Assessment Option 5 would
qualify as a performance test, and Assessment Option 4 probably would as well,

Assessment
Options

5. Students work in small groups to solve
previously unencountered

problems.

Teacher observes and judges their efforts.

4. Students are given a new problem, then
asked to write an essay regarding how
a group should go about solving it.

3. Students are asked a series of questions regarding
ways of solving problems collaboratively, then
asked to supply short answers to the questions.

2. Students answer a series of multiple-choice
tests about the next steps to take when
solving problems in groups.

1. Students respond to true–false
questions about the best procedures
to follow in group problem solving.

Students can solve
problems collaboratively.

Curricular Aim

Figure 8.1 a Set of assessment Options tWat Vary in tWe Degree to WicW a Student’s task approximates tWe
Curricularly targeted BeWavior

M08_POPH0936_10_SE_C08.indd 211M08_POPH0936_10_SE_C08.indd 211 09/11/23 6:21 PM09/11/23 6:21 PM

212 Chapter 8 performance assessment

but the other three assessment options wouldn’t qualify under a definition of
performance assessment requiring the incorporation of multiple evaluative
criteria, prespecified quality standards, and judgmental appraisals.

A good many advocates of performance assessment would prefer that the
tasks presented to students represent real-world rather than school-world kinds of
problems. Other proponents of performance assessment would be elated simply
if more school-world measurement was constructed response rather than selected
response in nature. Still other advocates of performance testing want the tasks
in performance tests to be genuinely demanding—that is, way up the ladder of
cognitive difficulty. In short, proponents of performance assessment often advo-
cate somewhat different approaches to measuring students on the basis of those
students’ performances.

In light of the astonishing advances we now see every few months in the sorts
of computer-delivered stimuli for various kinds of assessment—performance tests
surely included—the potential nature of performance-test tasks seems practi-
cally unlimited. For example, the possibility of digitally simulating a variety of
authentic performance-test tasks provides developers of performance tests with
an ever-increasing range of powerful performance assessments placing students
in a test-generated “virtual world.”

You’ll sometimes encounter educators who use other phrases to describe
performance assessment. For example, they may use the label authentic
assessment (because the assessment tasks more closely coincide with real-
life, nonschool tasks) or the label alternative assessment (because such assess-
ments constitute an alternative to traditional, paper-and-pencil tests). In the
next chapter, we’ll be considering portfolio assessment, which is a particular
type of performance assessment and should not be considered a synony-
mous descriptor for the performance-assessment approach to educational
measurement.

To splash a bit of reality juice on this chapter, it may be helpful for you to
recognize a real-world fact about educational performance assessment. Here it
goes: Although most educators regard performance testing as an effective way to
measure students’ mastery of important skills instead of using many traditional
testing tactics, in recent years the financial demands of such testing have ren-
dered them nonexistent or, at best, tokenistic, in many settings. Yes, as the chapter
probes the innards of this sort of educational testing, you will discover that it
embodies some serious advantages, particularly its contributions to instruction.
Yet, you will also see that a full-fledged reliance on performance testing for many
students, such as we see in states’ annual accountability tests, often renders the
widespread use of performance tests prohibitively expensive. More affordable,
of course, is teachers’ use of this potent assessment approach for their own class-
room assessments.

We now turn to the twin issues that are at the heart of performance assess-
ments: selecting appropriate tasks for students and, once the students have tackled
those tasks, judging the adequacy of students’ responses.

M08_POPH0936_10_SE_C08.indd 212M08_POPH0936_10_SE_C08.indd 212 09/11/23 6:21 PM09/11/23 6:21 PM

dentifying Suitable tasks for performance assessment 213

Identifying Suitable Tasks
for Performance Assessment
Performance assessment typically requires students to respond to a small number
of more significant tasks, rather than to a large number of less significant tasks.
Thus, rather than answering 50 multiple-choice items on a conventional chemistry
examination, students who are being assessed via performance tasks may find
themselves asked to perform an actual experiment in their chemistry class, and
then prepare a written interpretation of the experiment’s results—along with an
analytic critique of the procedures they used. From the chemistry teacher’s per-
spective, instead of seeing how students respond to the 50 “ mini-tasks” represented
in the multiple-choice test, an estimate of each student’s status must be derived
from the student’s response to a single, complex task. Given the significance of
each task used in a performance-testing approach to classroom assessment, it is
apparent that great care must be taken in the selection of performance-assessment
tasks. Generally speaking, classroom teachers will either have to (1) generate their
own performance-test tasks or (2) select performance-test tasks from the increasing
number of tasks currently available from educators elsewhere.

M08_POPH0936_10_SE_C08.indd 213M08_POPH0936_10_SE_C08.indd 213 09/11/23 6:21 PM09/11/23 6:21 PM

214 Chapter 8 performance assessment

Decision time
Grow, plants, Grow!

Sofia Esposito is a third-year biology teacher in
Kennedy High School. Because she has been
convinced by several of her colleagues that
traditional paper-and-pencil examinations fail to
capture the richness of the scientific experience,
Sofia has decided to base most of her students’
grades on a semester-long performance test. As
Sofia contemplates her new assessment plan, she
decides that 90 percent of the students’ grades
will stem from the quality of their responses to the
performance test’s task; 10 percent of the grades
will be linked to classroom participation and to a few
short true–false quizzes administered throughout the
semester.

The task embodied in Sofia’s performance test
requires each student to design and conduct a
2-month experiment to study the growth of three
identical plants under different conditions, and then
prepare a formal scientific report describing the
experiment. Although most of Sofia’s students carry
out their experiments at home, several students use
the shelves at the rear of the classroom for their
experimental plants. A number of students vary
the amount of light or the kind of light received by
the different plants, but most students modify the

nutrients given to their plants. After a few weeks
of the 2-month experimental period, all of Sofia’s
students seem to be satisfactorily under way with
their experiments.

Several of the more experienced teachers in the
school, however, have expressed their reservations
to Sofia about what they regard as “overbooking
on a single assessment experience.” The teachers
suggested to Sofia that she will be unable to draw
defensible inferences about her students’ true
mastery of biological skills and knowledge on the
basis of a single performance test. They urged
her to reduce dramatically the grading weight for
the performance test so that, instead, additional
grade-contributing exams can also be given to the
students.

Other colleagues, however, believe Sofia’s
performance-test approach is precisely what
is needed in courses such as biology. They
recommended that she “stay the course” and
alter “not one word” of her new assessment
strategy.

If you were Sofia, what would your
decision be?

Inferences and Tasks
Consistent with the frequently asserted message of this text about classroom
assessment, the chief determinants of how you assess your students are (1) the
inference—that is, the interpretation—you want to make about those students
and (2) the decision that will be based on the inference. For example, sup-
pose you’re a history teacher and you’ve spent a summer at a lakeside cabin
meditating about curricular matters (which, in one lazy setting or another, is
the public’s perception of how most teachers spend their summer vacations).
After three months of heavy curricular thought, you have concluded that what
you really want to teach your students is to apply historical lessons of the
past to the solution of current and future problems, which, at least to some
extent, parallel the problems of the past. You have decided to abandon your
week-long, 1500-item true–false final examination, which your stronger

M08_POPH0936_10_SE_C08.indd 214M08_POPH0936_10_SE_C08.indd 214 09/11/23 6:21 PM09/11/23 6:21 PM

dentifying Suitable tasks for performance assessment 215

students refer to as a “measurement marathon” and your weaker students
describe by using a rich, if earthy, vocabulary. Instead of using true–false items,
you are now committed to a performance-assessment strategy, and you wish
to select tasks for your performance tests. You want the performance test to
help you infer how well your students can draw on the lessons of the past to
illuminate their approach to current and/or future problems.

In Figure 8.2, you will see a graphical depiction of the relationships among
(1) a teacher’s key curricular aim, (2) the inference that the teacher wishes to draw
about each student, and (3) the tasks for a performance test intended to secure
data to support the inference that the teacher wishes to make. As you will note,
the teacher’s curricular aim provides the source for the inference. The assessment
tasks yield the evidence needed for the teacher to arrive at defensible inferences
regarding the extent to which students can solve current or future problems using
historical lessons. To the degree that students have mastered the curricular aim, the
teachers will reach a decision about how much more instruction, if any, is needed.

The Generalizability Dilemma
One of the most serious difficulties with performance assessment is that, because
students respond to fewer tasks than would be the case with conventional
paper-and-pencil testing, it is often more difficult to generalize accurately about
what skills and knowledge are possessed by the student. To illustrate, let’s say
you’re trying to get a fix on your students’ ability to multiply pairs of double-
digit numbers. If, because of your instructional priorities, you can devote only a
half-hour to assessment purposes, you could require the students to respond to
20 such multiplication problems in the 30 minutes available. (That’s probably
more problems than you’d need, but this example attempts to draw a vivid con-
trast for you.) From a student’s responses to 20 multiplication problems, you can
get a pretty fair idea about what kind of double-digit multiplier the student is. As
a consequence of the student’s performance on a reasonable sample of items repre-
senting the curricular aim, you can sensibly conclude that “Xavier really knows

Figure 8.2 relationsWips among a teacWer’s Key Curricular aim, tWe assessment-Based nference Derived
from tWe aim, and tWe performance-assessment tasks providing evidence for tWe nference

E.g., Students are given a
current/future problem, then

asked to solve it using
historically derived

insights.

Student Responses
to Performance

Tasks E.g., Students’ ability to
illuminate current/future
problems with relevant

historical lessons is
inferred.

Inferred
Student Status

E.g., Students can use
historical lessons to
solve current/future

problems.

Derived
from

Evidence
for Key

Curricular Aim

M08_POPH0936_10_SE_C08.indd 215M08_POPH0936_10_SE_C08.indd 215 09/11/23 6:21 PM09/11/23 6:21 PM

216 Chapter 8 performance assessment

how to multiply when facing those sorts of problems,” or “Fred really couldn’t
multiply double-digit multiplication problems if his life depended on it.” It is
because you have adequately sampled the kinds of student performance (about
which you wish to make an inference) that you can confidently make inferences
about your students’ abilities to solve similar sorts of multiplication problems.

With only a 30-minute assessment period available, however, if you moved
to a more elaborate kind of performance test, you might only be able to have
students respond to one big-bopper item. For example, if you presented a
multiplication-focused mathematics problem involving the use of manipulatives,
and wanted your students to derive an original solution and then describe it in
writing, you’d be lucky if your students could finish the task in half an hour.
Based on this single task, how confident would you be in making inferences about
your students’ abilities to perform comparable multiplication tasks?

And this, as you now see, is the rub with performance testing. Because stu-
dents respond to fewer tasks, the teacher is put in a trickier spot when it comes
to deriving accurate interpretations about students’ abilities. If you use only one
performance test, and a student does well on the test, does this mean the stu-
dent really possesses the category of skills the test was designed to measure, or
did the student just get lucky? On the other hand, if a student messes up on a
single-performance test, does this signify that the student really doesn’t possess
the assessed skill, or was there a feature in this particular performance test that
misled the student who, given other tasks, might have performed marvelously?

As a classroom teacher, you’re faced with two horns of a classic measurement
dilemma. Although performance tests often measure the kinds of student abilities
you’d prefer to assess (because those abilities are in line with really worthwhile curric-
ular aims), the inferences you make about students on the basis of their responses to
performance tests must often be made with increased caution. As with many dilem-
mas, there may be no perfect way to resolve this dilemma. But there is, at least, a way
of dealing with the dilemma as sensibly as you can. In this instance, the solution strat-
egy is to devote great care to the selection of the tasks embodied in your performance
tests. Among the most important considerations in selecting such tasks is to choose
tasks that optimize the likelihood of accurately generalizing about your students’
capabilities. If you really keep generalizability at the forefront of your thoughts when
you select or construct performance-test tasks, you’ll be able to make the strongest
possible performance-based inferences about your students’ capabilities.

Factors to Consider When Evaluating
Performance-Test Tasks
We’ve now looked at what many measurement specialists regard as the most
important factor you can consider when judging potential tasks for performance
assessments—generalizability. Let’s look at a set of seven such factors you might
wish to consider, whether you select a performance-test task from existing tasks
or create your own performance-test tasks anew.

M08_POPH0936_10_SE_C08.indd 216M08_POPH0936_10_SE_C08.indd 216 09/11/23 6:21 PM09/11/23 6:21 PM

dentifying Suitable tasks for performance assessment 217

Evaluative Criteria for Performance-Test Tasks

• Generalizability. Is there a high likelihood that the students’ performance on
the task will generalize to comparable tasks?

• Authenticity. Is the task similar to what students might encounter in the real
world, as opposed to encountering such a task only in school?

• Multiple foci. Does the task measure multiple instructional outcomes instead
of only one?

• Teachability. Is the task one that students can become more proficient in as
a consequence of a teacher’s instructional efforts?

• Fairness. Is the task fair to all students—that is, does the task avoid bias
based on such personal characteristics as students’ gender, ethnicity, or socio-
economic status?

• Feasibility. Is the task realistically implementable in relation to its cost, space,
time, and equipment requirements?

• Scorability. Is the task likely to elicit student responses that can be reliably
and accurately evaluated?

Whether you’re developing your own tasks for performance tests or select-
ing such tasks from an existing collection, you may wish to apply some but not
all the factors listed here. Some teachers try to apply all seven factors, although
they occasionally dump the authenticity criterion or the multiple foci criterion. In
some instances, for example, school tasks rather than real-world tasks might be
suitable for the kinds of inferences a teacher wishes to reach, so the authenticity
criterion may not be all that relevant. And even though it is often economically
advantageous to measure more than one outcome at one time, particularly con-
sidering the time and effort that goes into almost any performance test, there
may be cases in which a single educational outcome is so important that it war-
rants a solo performance test. More often than not, though, a really good task
for a performance test will satisfy most, if not all, of the seven evaluative criteria
presented here.

Performance Tests and Teacher Time
Back in Chapter 1, you were promised that if you completed this text with a rea-
sonable degree of attentiveness, you’d become a better teacher. Now, here comes
another promise for you: This text promises to be honest about the measurement
mysteries being dissected. And this brings us to an important consideration
regarding performance testing. Put briefly, it takes time!

Think for a moment about the time that you, as the teacher, must devote to
(1) the selection of suitable tasks, (2) the development of an appropriate scheme
for scoring students’ responses, and (3) the actual scoring of students’ responses.
Talk to any teacher who has already used many classroom performance tests, and
you’ll learn that it requires a ton of time to use performance assessment.

M08_POPH0936_10_SE_C08.indd 217M08_POPH0936_10_SE_C08.indd 217 09/11/23 6:21 PM09/11/23 6:21 PM

218 Chapter 8 performance assessment

So here’s one additional factor you should throw into your decision making
about performance assessment. It is the significance of the skill you’re using the
performance test to assess. Because you’ll almost certainly have time for only a
handful of such performance tests in your teaching, make sure that every per-
formance test you use is linked to a truly significant skill you’re trying to have
your students acquire. If performance assessments aren’t based on genuinely
demanding skills, you’ll soon stop using them because, to be truthful, they’ll be
far more trouble than they’re worth.

The Role of Rubrics
Performance assessments invariably are based on constructed-response measure-
ment procedures in which students generate rather than select their responses.
Student-constructed responses must be scored, however, and it’s clearly much
tougher to score constructed responses than to score selected responses. The
scoring of constructed responses centers on the evaluative criteria one calls on to
determine the adequacy of students’ responses. Let’s turn our attention now to
the evaluative factors we employ to decide whether students’ responses to per-
formance tests are splendid or shabby.

A criterion, according to most people’s understanding, is a standard on which
a decision or judgment can be based. In the case of scoring students’ responses to
a performance test’s task, you’re clearly trying to make a judgment regarding the
adequacy of the student’s constructed response. The specific criteria to be used
in making that judgment will obviously influence the way you score a student’s
response. For instance, if you were scoring a student’s written composition on
the basis of organization, word choice, and communicative clarity, you might
arrive at very different scores than if you had scored the composition on the basis
of spelling, punctuation, and grammar. The evaluative criteria used when scor-
ing students’ responses to performance tests (or their responses to any kind of
constructed-response item) really control the whole assessment game.

Referring to my previously mentioned and oft-whined compulsion to fall
back on my 5 years of Latin studies in high school and college, I feel responsible
for explaining that the Latin word criterion is singular and the Latin word criteria
is plural. Unfortunately, so many of today’s educators mix up these two terms
that I don’t even become distraught about it anymore. However, now that you
know the difference, if you find any of your colleagues erroneously saying “the
criteria is” or “the criterion were,” you have permission to display an ever so
subtle, yet altogether condescending, smirk.

The scoring procedures for judging students’ responses to performance tests
are usually referred to these days as scoring rubrics or, more simply, as rubrics.
A rubric that’s used to score students’ responses to a performance assessment has,
at minimum, three important features:

M08_POPH0936_10_SE_C08.indd 218M08_POPH0936_10_SE_C08.indd 218 09/11/23 6:21 PM09/11/23 6:21 PM

tWe role of rubrics 219

• Evaluative criteria. These are the factors to be used in determining the quality
of a student’s response.

• Descriptions of qualitative differences for all evaluative criteria. For each evalua-
tive criterion, a description must be supplied so that qualitative distinctions
in students’ responses can be made using the criterion.

• An indication of whether a holistic or analytic scoring approach is to be used. The
rubric must indicate whether the evaluative criteria are to be applied collec-
tively in the form of holistic scoring, or on a criterion-by-criterion basis in the
form of analytic scoring.

The identification of a rubric’s evaluative criteria is probably the most important
task for rubric developers. If you’re creating a rubric for a performance test that you
wish to use in your own classroom, be careful not to come up with a lengthy laundry
list of evaluative criteria which a student’s response should satisfy. Many seasoned
scorers of students’ performance tests believe that when you isolate more than three
or four evaluative criteria per rubric, you’ve identified too many. If you find your-
self facing more than a few evaluative criteria, simply rank all the criteria in order of
importance, and then chop off those listed lower than the very most important ones.

The next job you’ll have is deciding how to describe in words what a student’s
response must display in order to be judged wonderful or woeful. The level of
descriptive detail you apply needs to work for you. Remember, you’re devising a
scoring rubric for your own classroom, not for a statewide or national test. Reduce
the aversiveness of the work by employing brief descriptors of quality differences that
you can use when teaching and that, if you’re instructionally astute, your students can
use as well. However, because for instructional purposes you will invariably want
your students to understand the meaning of each evaluative criterion and its worth,
give some thought to how effectively the different quality labels you choose and the
descriptions of evaluative criteria you employ can be communicated to students.

Finally, you’ll need to decide whether you’ll make a single, overall judg-
ment about a student’s response by considering all of the rubric’s evaluative cri-
teria as an amalgam (holistic scoring) or, instead, award points to the response
on a criterion-by-criterion basis (analytic scoring). The virtue of holistic scoring,
of course, is it’s quicker to do. The downside of holistic scoring is that it fails
to communicate to students, especially low-performing students, the nature of
their shortcomings. Clearly, analytic scoring is more likely than holistic scoring
to yield diagnostically pinpointed scoring and sensitive feedback. Some class-
room teachers have attempted to garner the best of both worlds by scoring all
responses holistically and then analytically rescoring (for feedback purposes) all
of the responses made by low-performing students.

Because most performance assessments call for fairly complex responses from
students, there will usually be more than one evaluative criterion employed to
score students’ responses. For each of the evaluative criteria chosen, a numerical
scale is typically employed so that, for each criterion, a student’s response might
be assigned a specified number of points—for instance, 0 to 6 points. Usually,

M08_POPH0936_10_SE_C08.indd 219M08_POPH0936_10_SE_C08.indd 219 09/11/23 6:21 PM09/11/23 6:21 PM

220 Chapter 8 performance assessment

these scale points are accompanied by verbal descriptors, but sometimes they
aren’t. For instance, in a 5-point scale, the following descriptors might be used:
=5 Exemplary, =4 Superior, =3 Satisfactory, =2 Weak, =1 Inadequate.

If no verbal descriptors are used for each score point on the scale, a scheme such
as the following might be employed:

Excellent Unsatisfactory

6 5 4 3 2 1 0

In some cases, the scoring scale for each criterion is not numerical; that is, it
consists only of verbal descriptors such as “exemplary,” “adequate,” and so on.
Although such verbal scales can be useful with particular types of performance
tests, their disadvantage is that scores from multiple criteria often cannot be added
together in order to produce a meaningful, diagnostically discernible, overall score.

The heart of the intellectual process in isolating suitable evaluative criteria
is to get to the essence of what the most significant factors are that distinguish
acceptable from unacceptable responses. In this assessment instance, as in many
others, less is more. A few truly important criteria are preferable to a ton of trifling
criteria. Go for the big ones! If you need help in deciding which criteria to employ
in conjunction with a particular performance test, don’t be reluctant to ask col-
leagues to toss in their ideas regarding what significant factors to use (for a given
performance test) in order to discriminate between super and subpar responses.

In this chapter you’ll be presented with scoring rubrics that can serve as use-
ful models for your own construction of such rubrics. In those illustrative scoring
rubrics, you’ll see that care has been taken to isolate a small number of instruction-
ally addressable evaluative criteria. The greatest payoff from a well-formed scoring
rubric, of course, is in its contribution to a teacher’s improved instruction.

To give you a better idea about the kinds of tasks that might be used in
a performance test, and the way you might score a student’s responses, let’s
take a gander at illustrative tasks for a performance test. The tasks presented in
Figure 8.3 are intended to assess students’ mastery of an oral communication skill.

Rubrics: The Wretched and
the Rapturous
Unfortunately, many educators believe a rubric is a rubric is a rubric. Not so.
Rubrics differ dramatically in their instructional worth. We’ll now consider two
sorts of rubrics that are sordid and one sort that’s super.

Task-Specific Rubrics
Some rubrics are created so that their evaluative criteria are linked only to
the particular task embodied in a specific performance test. These are called
task-specific rubrics. Such a rubric does little to illuminate instructional decision

M08_POPH0936_10_SE_C08.indd 220M08_POPH0936_10_SE_C08.indd 220 09/11/23 6:21 PM09/11/23 6:21 PM

rubrics: tWe retcWed and tWe rapturous 221

An Oral Communication Performance Test

Introduction
There are numerous kinds of speaking tasks students must perform in
everyday life, both in school and out of school. This performance assess-
ment focuses on some of these tasks—namely, describing objects, events,
and experiences; explaining the steps in a sequence; providing
information in an emergency; and persuading someone.
In order to accomplish a speaking task, the speaker must formulate
and transmit a message to a listener. This process involves deciding what
needs to be said, organizing the message, adapting the message to the
listener and situation, choosing language to convey the message and,
finally, delivering the message. The effectiveness of the speaker may be
rated according to how well the speaker meets the requirement of the task.

Sample Tasks
Description Task: Think about your favorite class or extracurricular activity

program?)
Emergency Task: Imagine you are home alone and you smell smoke. You
call the fire department and I answer your call. Talk to me as if you
were talking on the telephone. Tell me everything I would need to know
to get help to you. (Talk directly to me; begin by saying hello.)

Sequence Task: Think about something you know how to cook. Explain to

Persuasion Task: Think about one change you would like to see made in
your school, such as a change in rules or procedures. Imagine I am the
principal of your school. Try to convince me the school should make this
change. (How about something like a change in the rules about hall passes
or the procedures for enrolling in courses?)
Source: Based on assessment efforts of the Massachusetts Department of Education.

in school. Describe to me everything you can about it so I will know a lot
about it. (How about something like a school subject, a club, or a sports

me, step by step, how to make it. (How about something like popcorn, a
sandwich, or scrambled eggs?)

Figure 8.3 task Description and Sample tasks for a One-on-One Oral
Communication performance assessment

making, because it implies that students’ performances on the constructed-
response test’s specific task are what’s important. They’re not! What’s important
is the student’s ability to perform well on the class of tasks that can be accom-
plished by using the skill being measured by the assessment. If students are
taught to shine only on the single task represented in a performance test, but
not on the full range of comparable tasks, students lose out. If students learn to
become good problem solvers, they’ll be able to solve all sorts of problems—
not merely the single (sometimes atypical) problem embodied in a particular
performance test.

M08_POPH0936_10_SE_C08.indd 221M08_POPH0936_10_SE_C08.indd 221 09/11/23 6:21 PM09/11/23 6:21 PM

222 Chapter 8 performance assessment

So, to provide helpful instructional illumination—that is, to assist the teacher’s
instructional decision making—a rubric’s evaluative criteria dare not be task spe-
cific. Instead, those criteria must be rooted in the skill itself. This can be illustrated
by your considering an evaluative criterion frequently employed in the rubrics
used to judge students’ written communication skills—namely, the evaluative
criterion of organization. Presented here is a task-specific evaluative criterion that
might be used in a scoring rubric to judge the quality of a student’s response to
the following task:

“Compose a brief narrative essay, of 400 to 600 words, describing what happened
during yesterday’s class session when we were visited by the local firefighters
who described a fire-exit escape plan for the home.”

A task-specific evaluative criterion for judging the organization of stu-
dents’ narrative essays: Superior essays will (1) commence with a recounting
of the particular rationale for home fire-escape plans the local firefighters pro-
vided, then (2) follow up with a description of the six elements in home-safety
plans in the order in which those elements were presented, and (3) conclude by
citing at least three of the life–death safety statistics the firefighters provided at
the close of their visit. Departures from these three organizational elements will
result in lower evaluations of essays.

Suppose you’re a teacher who has generated or been given this task-specific
evaluative criterion. How would you organize your instruction? If you really tried
to gain instructional guidance from this evaluative criterion, you’d be aiming your
instruction directly at a particular task—in this case, a narrative account of the
firefighters’ visit to your class. You’d need to teach your students to commence
their narrative essays with a rationale when, in fact, using a rationale for an intro-
duction might be inappropriate for other sorts of narrative essays. Task-specific
evaluative criteria do not help teachers plan instructional sequences that promote
their students’ abilities to generalize the skills they acquire. Task-specific criteria
are just what they are touted to be—specific to one task.

During recent years, educators have seen a number of rubrics that classroom
teachers have been required to generate on their own. From an instructional per-
spective, many of these rubrics don’t pass muster. Numerous of these rubrics are
simply swimming in a sea of task specificity. Such task-specific rubrics may make
it easier to score students’ constructed responses. Thus, for scoring purposes, the
more specific evaluative rubrics are, the better. But task-specific rubrics do not
provide teachers with the kinds of instructional insights that good rubrics should.
That is, they fail to provide sufficient clarity that allows a teacher to organize and
deliver weakness-focused, on-target instruction.

You’ll not be surprised to learn that when the nation’s large testing firms set
out to score thousands of students’ responses to performance tasks, they almost
always employ task-specific rubrics. Such task-focused scoring keeps down costs.
But regrettably, the results of such scoring typically supply teachers with few
instructional insights.

M08_POPH0936_10_SE_C08.indd 222M08_POPH0936_10_SE_C08.indd 222 09/11/23 6:21 PM09/11/23 6:21 PM

rubrics: tWe retcWed and tWe rapturous 223

Hypergeneral Rubrics
Another variety of scoring rubric that will not help teachers plan their instruction
is referred to as a hypergeneral rubric. Such a rubric is one in which the evaluative
criteria are described in exceedingly general and amorphous terms. The evalu-
ative criteria are so loosely described that, in fact, teachers’ instructional plans
really aren’t benefited. For example, using the previous example of a task calling
for students to write a narrative essay, a hypergeneral evaluative criterion for
organization might resemble the following one:

A hypergeneral evaluative criterion for judging the organization of students’
narrative essays: Superior essays are those in which the essay’s content has been
arranged in a genuinely excellent manner, whereas inferior essays are those
displaying altogether inadequate organization. An adequate essay is one repre-
senting a lower organizational quality than a superior essay, but a higher organi-
zational quality than an inferior essay.

You may think that you’re being put on when you read the illustrative hyper-
general evaluative criterion just given. But you’re not. Most seasoned teachers
have seen many such hypergeneral scoring rubrics. These rubrics ostensibly
clarify how teachers can score students’ constructed responses, but really these
vague rubrics do little more than loosely redefine such quality descriptors as
superior, proficient, and inadequate. Hypergeneral rubrics often attempt to draw
distinctions among students’ performances no more clearly than among students’
performances in earning grades of A through F. Hypergeneral rubrics provide
teachers with no genuine benefits for their instructional planning, because such
rubrics do not give the teacher meaningfully clarified descriptions of the criteria
to be used in evaluating the quality of students’ performances—and, of course, in
promoting students’ abilities to apply a rubric’s evaluative criteria.

Robert Marzano, one of our field’s most respected analysts of educational research,
has taken a stance regarding scoring guides that seems to endorse the virtues of the
kinds of hypergeneral rubrics currently being reviled here. Marzano (2008) has urged
educators to develop rubrics for each curricular aim at each grade level using a generic
rubric that can be “applied to all content areas” (pp. 10–11). However, for a rubric to be
applicable to all content areas, it must obviously be so general and imprecise that much
of its instructional meaning is likely to have been leached from the rubric.

The midpoint (anchor) of this sort of hypergeneral rubric calls for no major
errors or omissions on the student’s part regarding the simple or complex con-
tent and/or the procedures taught. If errors or omissions are present in either the
content or the procedures taught, these deficits lead to lower evaluations of the
student’s performance. If the student displays “in-depth inferences and applica-
tions” that go beyond what was taught, then higher evaluations of the student’s
performance are given. Loosely translated, these generic rubrics tell teachers that
a student’s “no-mistakes” performance is okay, a “some-mistakes” performance
is less acceptable, and a “better-than-no-mistakes” performance is really good.

M08_POPH0936_10_SE_C08.indd 223M08_POPH0936_10_SE_C08.indd 223 09/11/23 6:21 PM09/11/23 6:21 PM

224 Chapter 8 performance assessment

When these very general rubrics are applied to the appraisal of students’ mas-
tery of given curricular goals, such rubrics’ level of generality offers scant instruc-
tional clarity to teachers about what’s important and what’s not. I have enormous
admiration for Bob Marzano’s contributions through the years, and I count him as
a friend. But in this instance, I fear my friend is pushing for a sort of far-too-general
scoring guide that won’t help teachers do a better instructional job.

Skill-Focused Rubrics
Well, because we have now denigrated task-specific rubrics and sneered at
hypergeneral rubrics, it’s only fair to wheel out the type of rubric that helps
score students’ responses, yet can also illuminate a teacher’s instructional plan-
ning. Such scoring guides can be described as skill-focused rubrics because
they really are conceptualized around the skill that is (1) being measured by the
constructed-response assessment and (2) being pursued instructionally by the
teacher. Based on the views of numerous teachers who have used a variety of scor-
ing rubrics, many teachers report that creating a skill-focused scoring rubric prior
to their instructional planning nearly always helps them devise a more potent
instructional sequence.

For an example of an evaluative criterion you’d encounter in a skill-focused
rubric, look over the following illustrative criterion for evaluating the organiza-
tion of a student’s narrative essay:

A skill-focused evaluative criterion for judging the organization of
students’ narrative essays: Two aspects of organization will be employed in the
appraisal of students’ narrative essays: overall structure and sequence. To earn
maximum credit, an essay must embody an overall structure containing an in-
troduction, a body, and a conclusion. The content of the body of the essay must
be sequenced in a reasonable manner—for instance, in a chronological, logical,
or order-of-importance sequence.

Now, when you consider this evaluative criterion from an instructional
perspective, you’ll realize that a teacher could sensibly direct instruction by
(1) familiarizing students with the two aspects of the criterion—that is, over-
all structure and sequence; (2) helping students identify essays in which overall
structure and sequence are, or are not, acceptable; and (3) supplying students
with gobs of guided and independent practice in writing narrative essays that
exhibit the desired overall structure and sequence. The better students become
in employing this double-barreled criterion in their narrative essays, the better
those students will be at responding to a task such as the illustrative one we just
saw about the local firefighters’ visit to the classroom. This skill-focused evalua-
tive criterion for organization, in other words, can generalize to a wide variety of
narrative-writing tasks, not just an essay about visiting fire-folk.

And even if teachers do not create their own rubrics—for instance, if rubrics
are developed by a school district’s curriculum or assessment specialists—those
teachers who familiarize themselves with skill-focused rubrics in advance of

M08_POPH0936_10_SE_C08.indd 224M08_POPH0936_10_SE_C08.indd 224 09/11/23 6:21 PM09/11/23 6:21 PM

rubrics: tWe retcWed and tWe rapturous 225

instructional planning will usually plan better instruction than will teachers who
aren’t familiar with a skill-focused rubric’s key features. Skill-focused rubrics
make clear what a teacher should emphasize instructionally when the teacher
attempts to promote students’ mastery of the skill being measured. Remember,
although students’ acquisition of knowledge is an important curricular aspiration
for teachers, assessment of students’ knowledge can obviously be accomplished
by procedures other than the use of performance tests.

Let’s now identify five rules you are encouraged to follow if you’re creating
your own skill-focused scoring rubric—a rubric you should generate before you
plan your instruction.

Rule 1: Make sure the skill to be assessed is significant. It takes time and trouble
to generate skill-focused rubrics. It also takes time and trouble to score
students’ responses by using such rubrics. Make sure the skill being pro-
moted instructionally, and scored via the rubric, is worth all this time
and trouble. Skills that are scored with skill-focused rubrics should rep-
resent demanding accomplishments by students, not trifling ones.

Teachers should not be ashamed to be assessing their students with
only a handful of performance tests, for example. It makes much more
sense to measure a modest number of truly powerful skills properly than
to do a shoddy job in measuring a shopping-cart full of rather puny skills.

Rule 2: Make certain that all the rubric’s evaluative criteria can be addressed instruction-
ally. This second rule calls for you always to “keep your instructional wits
about you” when generating a rubric. Most important, you must scrutinize
every potential evaluative criterion in a rubric to make sure you can actu-
ally teach students to master it.

This rule doesn’t oblige you to adhere to any particular instructional
approach. Regardless of whether you are wedded to the virtues of direct
instruction, indirect instruction, constructivism, or any other instruction-
al strategy, what you must be certain of is that students can be taught to
employ appropriately every evaluative criterion used in the rubric.

Rule 3: Employ as few evaluative criteria as possible. Because people who start
thinking seriously about a skill will often begin to recognize a host of
nuances associated with the skill, they sometimes feel compelled to set
forth a lengthy litany of evaluative criteria. But for instructional pur-
poses, as for many other missions, less here is more groovy than more.
Try to focus your instructional attention on three or four evaluative cri-
teria; you’ll become overwhelmed if you foolishly try to promote stu-
dents’ mastery of a dozen evaluative criteria.

Rule 4: Provide a succinct label for each evaluative criterion. The instructional yield
of a skill-focused rubric can be increased simply by giving each evaluative
criterion a brief explanatory label. For instance, suppose you were trying
to improve your students’ oral presentation skills. You might employ a

M08_POPH0936_10_SE_C08.indd 225M08_POPH0936_10_SE_C08.indd 225 09/11/23 6:21 PM09/11/23 6:21 PM

226 Chapter 8 performance assessment

skill-focused rubric for oral communication containing four evaluative
criteria—delivery, organization, content, and language. These one-word,
easy-to-remember labels will help remind you and your students of what’s
truly important in judging mastery of the skill being assessed.

Rule 5: Match the length of the rubric to your own tolerance for detail. During educators’
earliest experiences with scoring rubrics, those early evaluative schemes
were typically employed for use with high-stakes statewide or districtwide
assessments. Such detailed rubrics were often intended to constrain scorers
so their scores would be in agreement. But, as years went by, most educa-
tors discovered that many classroom teachers consider fairly long, rather
detailed rubrics to be altogether off-putting. Whereas those teachers might
be willing to create and/or use a one-page rubric, they regarded a six-page
rubric as wretchedly repugnant. More recently, therefore, we encounter rec-
ommendations for much shorter rubrics—rubrics that rarely exceed one or
two pages. In the generation of useful rubrics, brevity wins.

Of course, not all teachers are put off by very detailed rubrics. Some teachers,
in fact, find that abbreviated rubrics just don’t do an adequate job for them. Such
detail-prone teachers would rather work with rubrics that specifically spell out just
what’s involved when differentiating between quality levels regarding each evalua-
tive criterion. A reasonable conclusion, then, is that rubrics should be built to match
the level-of-detail preferences of the teachers involved. Teachers who believe in
brevity should create brief rubrics, and teachers who believe in detail should create
lengthier rubrics. If school districts (or publishers) are supplying rubrics to teachers,
ideally both a short and a long version will be provided. Let teachers decide on the
level of detail that best meshes with their own tolerance or need for detail.

So far, skill-focused rubrics have been touted in this chapter as the most use-
ful for promoting students’ mastery of really high-level skills. Such advocacy has
been present because of the considerable instructional contributions that can be
made by skill-focused rubrics. But here’s a requisite caution. If you are employing
a rubric that incorporates multiple evaluative criteria, beware of the ever-present
temptation to sum those separate per-criterion scores to come up with one, big
“overall” score. Only if the several evaluative criteria, when squished together, still
make interpretive or instructional sense should such scrunching be undertaken.

But please do not feel obliged to carve out a skill-focused rubric every time you
find yourself needing a rubric. There will be occasions when your instruction is not
aimed at high-level cognitive skills, but you still want your students to achieve what’s
being sought of them. To illustrate, you might want to determine your students’ mas-
tery of a recently enacted federal law—a law that will have a significant impact on
how American citizens elect their legislators. Well, suppose you decide to employ a
constructed-response testing strategy for assessing your students’ understanding of this
particular law. More specifically, you want your students to write an original, explana-
tory essay describing how certain of the law’s components might substantially influ-
ence U.S. elections. Because, to evaluate your students’ essays, you’ll need some sort

M08_POPH0936_10_SE_C08.indd 226M08_POPH0936_10_SE_C08.indd 226 09/11/23 6:21 PM09/11/23 6:21 PM

ratings and Observations 227

of scoring guide to assist you, what sort of rubric should you choose? Well, in that sort
of situation, it is perfectly acceptable to employ a task-specific rubric rather than a skill-
focused rubric. If there is no powerful, widely applicable cognitive skill being taught
(and assessed), then the scoring can properly be focused on the given task involved,
rather than on a more generally applicable skill.

Ratings and Observations
Once you’ve selected your evaluative criteria, you need to apply them reliably to
the judgment of students’ responses. If the nature of the performance-test task calls
for students to create some sort of product, such as a written report of an experi-
ment carried out in a biology class, then at your leisure you can rate the product’s
quality in relation to the criteria you’ve identified as important. For example, if
you had decided on three criteria to use in evaluating students’ reports of biology
experiments, and could award from 0 to 4 points for each criterion, then you could
leisurely assign from 0 to 12 points for each written report. The more clearly you
understand what each evaluative criterion is, and what it means to award differ-
ent numbers of points on whatever scale you’ve selected, the more accurate your
scores will be. Performance tests that yield student products are definitely easier
to rate, because you can rate students’ responses when you’re in the mood.

It is often the case with performance tests, however, that the student’s perfor-
mance takes the form of some kind of behavior. With such performance tests, it will
usually be necessary for you to observe the behavior as it takes place. To illustrate,
suppose that you are an elementary school teacher whose fifth-grade students have
been carrying out fairly elaborate social studies projects culminating in 15-minute
oral reports to classmates. Unless you have the equipment to videotape your stu-
dents’ oral presentations, you’ll have to observe the oral reports and make judg-
ments about the quality of a student’s performance as it occurs. As was true when
scores were given to student products, in making evaluative judgments about stu-
dents’ behavior, you will apply whatever criteria you’ve chosen and assign what you
consider to be the appropriate number of points on whatever scales you are using.

For some observations, you’ll find it sensible to make instant, on-the-spot
quality judgments. For instance, if you are judging students’ social studies reports
on the basis of (1) content, (2) organization, and (3) presentation, you might make
observation-based judgments on each of those three criteria as soon as a report is
finished. In other cases, your observations might incorporate a delayed evaluative
approach. For instance, let’s say that you are working with students in a speech
class on the elimination of “filler words and sounds,” two of the most prominent
of which are starting a sentence with “Well” and interjecting frequent “uh”s into
a presentation. In the nonevaluative phase of the observation, you could simply
count the number of “well”s and “uh”s uttered by a student. Then, at a later time,
you could decide on a point allocation for the criterion “avoids filler words and
sounds.” Putting it another way, systematic observations may be set up so you

M08_POPH0936_10_SE_C08.indd 227M08_POPH0936_10_SE_C08.indd 227 09/11/23 6:21 PM09/11/23 6:21 PM

228 Chapter 8 performance assessment

make immediate or delayed allocations of points for the evaluative criteria you’ve
chosen. If the evaluative criteria involve qualitative factors that must be appraised
more judgmentally, then on-the-spot evaluations and point assignments are typi-
cally the way to go. If the evaluative criteria involve more quantitative factors,
then a “count now and judge later” approach usually works better.

Sources of Error in Scoring Student
Performances
When scoring student performances, there are three common sources of error
that can contribute to inaccurate inferences. First, there is the scoring scale. Second,
there are the scorers themselves, who may bring a number of bothersome biases to
the enterprise. Finally, there are errors in the scoring procedure—that is, the process
by which the scorers employ the scoring scale.

Scoring-Instrument Flaws
The major defect with most scoring instruments is the lack of descriptive rigor
with which the evaluative criteria to be used are described. Given this lack of
rigor, ambiguity exists in the interpretations that scorers make about what the
scoring criteria mean. This typically leads to a set of unreliable ratings. For exam-
ple, if teachers are to rate students on the extent to which students are “control-
ling,” some teachers may view this as a positive quality and some may view it as
a negative quality. Clearly, an inadequately clarified scoring form can lead to all
sorts of “noise” in the scores provided by teachers.

Procedural Flaws
Among common problems with scoring students’ responses to performance
tests, we usually encounter demands on teachers to rate too many qualities.
Overwhelmed scorers are scorers rendered ineffectual. Teachers who opt for a
large number of evaluative criteria are teachers who have made a decisively inept
opt. Care should be taken that no more than three or four evaluative criteria are
to be employed in evaluations of students’ responses to your performance assess-
ments. Generally speaking, the fewer the evaluative criteria used, the better.

Teachers’ Personal-Bias Errors
If you recall Chapter 5’s consideration of assessment bias, you’ll remember that
bias is clearly an undesirable commodity. Teachers, albeit unintentionally, are
frequently biased in the way they score students’ responses. Several kinds of
personal-bias errors are often encountered when teachers score students’ con-
structed responses. The first of these, known as generosity error, occurs when a

M08_POPH0936_10_SE_C08.indd 228M08_POPH0936_10_SE_C08.indd 228 09/11/23 6:21 PM09/11/23 6:21 PM

Sources of error in Scoring Student performances 229

teacher’s bias leads to higher ratings than are warranted. Teachers with a procliv-
ity toward generosity errors see good even where no good exists.

At the other extreme, some teachers display severity errors. A severity error,
of course, is a tendency to underrate the quality of a student’s work. When a stu-
dent’s product deserves a “good,” teachers suffering from this personal-bias error
will award the product only an “average” or even a “below average.”

Another sort of personal-bias error is known as central-tendency error. This
describes a tendency for teachers to view everything as being “in the middle
of the scale.” Very high or very low ratings are assiduously avoided by people
who exhibit central-tendency error. They prefer the warm fuzziness of the mean
or the median. They tend to regard midpoint ratings as inoffensive—hence they
dispense midpoint ratings perhaps thoughtlessly and even gleefully.

A particularly frequent error arises when a teacher’s overall impression of a
student influences how the teacher rates that student with respect to an individual
criterion. This error is known as the halo effect. If a teacher has a favorable attitude
toward a student, that student will often receive a host of positive ratings (deserved
or not) on a number of individual criteria. Similarly, if a teacher has an unfavorable
attitude toward a student, the student will receive a pile of negative ratings on all
sorts of separate criteria. One nifty way to dodge halo effect, if the teacher can pull
it off, is to score responses anonymously whenever this is a practical possibility.

But Wat Does tWis have to Do witW teacWingt
Teachers are busy people. (This perceptive insight
will not come as a surprise to anyone who really
knows what a teacher’s life is like.) But busy people
who survive in their task-buffeted lives will typically
try to manage their multiple responsibilities efficiently.
Busy people who are teachers, therefore, need to
make sure they don’t expend their finite reservoirs of
energy unwisely. And that’s where caution needs to
be exercised with respect to the use of performance
tests. Performance testing, you see, can be so very
seductive.

Because performance tests are typically focused
on measuring a student’s mastery of real-world,
significant skills, such tests are appealing. If you’re
an English teacher, for example, wouldn’t you
rather determine whether your students possess
composition skills by having them whip out actual
“written-from-scratch” compositions than by having
them merely spot punctuation problems in multiple-
choice items? Of course, you would.

But although the allure of performance
assessment is considerable, you should never
forget that it takes time! And the time-consumption
requirements of performance testing can readily eat
into a teacher’s daily 24-hour allotment. It takes time
for teachers to come up with defensible tasks for
performance tests, time to devise suitable rubrics for
scoring students’ responses, and, thereafter, time to
score those responses. Performance assessment
takes time. But, of course, proper performance
testing not only helps teachers aim their instruction
in the right directions; it also allows students to
recognize the nature of the skill(s) being promoted.

So, the trick is to select judiciously the cognitive
skills you will measure via performance tests. Focus
on a small number of truly significant skills. You’ll
be more likely to maintain your instructional sanity if
you employ only a handful of high-level performance
tests. Too many performance tests might push a
teacher over the edge.

M08_POPH0936_10_SE_C08.indd 229M08_POPH0936_10_SE_C08.indd 229 09/11/23 6:21 PM09/11/23 6:21 PM

230 Chapter 8 performance assessment

One way to minimize halo effect, at least a bit, is occasionally to reverse the
order of the high and low positions on the scoring scale so the teacher cannot
unthinkingly toss out a whole string of positives (or negatives). What you really
need to do to avoid halo effect when you’re scoring students’ responses is to
remember it’s always lurking in the wings. Try to score a student’s responses
on each evaluative criterion by using that specific criterion, not a contaminated
general impression of the student’s ability.

Thinking back to Chapter 5 and its treatment of educational fairness,
please recognize that in the scoring of students’ responses to any sort of
student-generated product or behavior, teachers are typically tromping around in
terrain that’s just brimming with the potential for unfairness. These are instances
in which teachers need to be especially on guard against unconsciously bring-
ing their personal biases into the scoring of students’ responses. Fairness is one
of educational testing’s big three requisites (along with reliability and validity).
When evaluating students’ performances, unfairness can frequently flower. One
of the most straightforward ways of spotting fairness-flaws in your teacher-
made tests is to mentally plop yourself into the test-taking seats of students who,
because they are different (according to the subgroup label they carry), might be
offended or unfairly penalized by what’s in your test. If you spot some potential
problems, then do something that you think has a reasonable chance of fixing the
problem. In short, put your unfairness-detection machinery into overdrive, and
then follow up this unfairness-detection effort with your best effort to fix it.

What Do Classroom Teachers Really
Need to Know About Performance
Assessment?
Performance assessment has been around for a long, long while. Yet, in recent
years, a growing number of educators have become strong supporters of
this form of assessment because it (1) represents an alternative to traditional
paper-and-pencil tests and (2) is often more authentic—that is, more reflective
of tasks people routinely need to perform in the real world. One of the things
you need to understand about performance assessment is that it differs from
more conventional assessment chiefly in the degree to which the assessment task
matches the skills and/or knowledge about which you wish to make inferences.
Because performance tasks often coincide more closely with high-level cognitive
skills than do paper-and-pencil tests, more accurate interpretations can often be
derived about students. Another big plus for performance tests is that they estab-
lish assessment targets which, because such targets often influence the teacher’s
instruction, can have a positive impact on instructional activities.

The chapter’s final admonitions regarding the biases that teachers bring to
the scoring of students’ performance-test responses should serve as a reminder.

M08_POPH0936_10_SE_C08.indd 230M08_POPH0936_10_SE_C08.indd 230 09/11/23 6:21 PM09/11/23 6:21 PM

CWapter Summary 231

Chapter Summary
Although this chapter dealt specifically with
performance assessment, a number of the points
made in the chapter apply with equal force to the
scoring of any type of constructed-response items
such as those used in essay tests or short-answer
tests. After defining performance tests as a mea-
surement procedure in which students create
original responses to an assessment task, it was
pointed out that performance tests differ from
more conventional tests primarily in the degree
to which the test situation approximates the real-
life situations to which inferences are made.

The identification of suitable tasks for per-
formance assessments was given considerable

attention in the chapter because unsuitable tasks
will surely lead to unsatisfactory performance
assessments. Seven evaluative criteria were
supplied for performance-test tasks: (1) gener-
alizability, (2) authenticity, (3) multiple foci, (4)
teachability, (5) fairness, (6) feasibility, and (7)
scorability. Particular emphasis was given to
selecting tasks about which defensible inferences
could be drawn regarding students’ generalized
abilities to perform comparable tasks.

The significance of the skill to be assessed via
a performance task was stressed. Next, evaluative
criteria were defined as the factors by which the
acceptability of a student’s performance is judged.

parent talk
The vice president of your school’s Parent Advisory
Council has asked you to tell him why so many
of your school’s teachers are now assessing
students with performance tests instead of the more
traditional paper-and-pencil tests.

If I were you, here’s how I’d respond:

“The reason why most of us are increasingly
relying on performance tests these days is
that performance tests almost always measure
higher-level student skills. With traditional tests,
too many of our teachers found that they were
inadvertently assessing only the students’ abilities
to memorize facts.

“In recent years, educators have become
far more capable of devising really demanding
performance tests that require students to display
genuinely high-level intellectual skills—skills that are
way, way beyond memorization. We’ve learned how
to develop the tests and how to score students’
responses using carefully developed scoring rules
that we refer to as rubrics.

“Please stop by my classroom and I’ll show you
some examples of these demanding performance
tests. We’re asking more from our students, and—
we have evidence galore—that we’re getting it.”

Now, how would you respond to the vice
president of the Parent Advisory Council?

If you employ performance tests frequently in your classrooms, you’ll need to be
careful at every step of the process—from the original conception and birthing of
a performance test down to the bias-free scoring of students’ responses. This is
not fool’s play for sure. But, given the instructional targets on which many perfor-
mance tests are based, it can definitely be worth the effort. A first-rate educational
program can be rooted in a top-drawer assessment program—and this is often a
top-drawer performance-test assessment program.

M08_POPH0936_10_SE_C08.indd 231M08_POPH0936_10_SE_C08.indd 231 09/11/23 6:21 PM09/11/23 6:21 PM

232 Chapter 8 performance assessment

The evaluative criteria constitute the most impor-
tant features of a rubric that’s employed to score
student responses. The significance of selecting
suitable evaluative criteria was emphasized. Once
the criteria have been identified, a numerical scor-
ing scale, usually consisting of from 3 to 6 score
points, is devised for each evaluative criterion.
The evaluative criteria are applied to student

performances in the form of ratings (for student
products) or observations (for student behaviors).

Distinctions were drawn among task-specific,
hypergeneral, and skill-focused rubrics. Deficits in
the former two types of rubrics render them less
appropriate for supporting a teacher’s classroom
instructional effort. A skill-focused rubric, how-
ever, can markedly enhance a teacher’s instruction.

References
Brookhart, S. (2013). How to create and use rubrics

for formative assessment and grading. ASCD.
Brookhart, S. (2023). Classroom assessment

essentials. ASCD.
Chowdhury, F. (2019). Application of rubrics in

the classroom: A vital tool for improvement in
assessment, feedback and learning, International
Education Studies, 12(1): 61–68. https://files.eric.
ed.gov/fulltext/EJ12 01525

Education Week. (2019, February 25). K–12
Performance assessment terms explained
[Video]. Retrieved September 20, 2022,
from https://www.edweek.org/teaching-
learning/ video-k-12-performance-
assessment-terms-explained/2019/02

Fitzpatrick, R., & Morrison, E. J. (1971).
Performance and product evaluation. In
E. L. Thorndike (Ed.), Educational measurement
(pp. 237–270). American Council on Education.

Lane, S. (2013). Performance assessment. In
J. H. McMillan (Ed.), SAGE Handbook of
research on classroom assessment (pp. 313–329).
SAGE Publications.

Marzano, R. J. (2008). Vision document. Marzano
Research Laboratory.

McTighe, J., Doubet, K. J., & Carbaugh, E. M.
(2020). Designing authentic performance tasks
and projects: Tools for meaningful learning and
assessment. ASCD.

Sawchuk, S. (2019, February 6). Performance
assessment: 4 Best practices, Education
Week. Retrieved September 20, 2022, from
https://www.edweek.org/teaching-
learning/performance-assessment-4-best-
practices/2019/02

Wormeli, R. (2018). Fair isn’t always equal:
Assessment & grading in the differentiated
classroom (2nd ed.). Stenhouse Publishers.

M08_POPH0936_10_SE_C08.indd 232M08_POPH0936_10_SE_C08.indd 232 09/11/23 6:21 PM09/11/23 6:21 PM

a testing takeaway 233

A Testing Takeaway

Performance Tests: Tension in the Assessment World*
W. James Popham, University of California, Los Angeles

Educators test their students to determine what those students know and can do. This
simplified depiction of educational assessment captures the rationale for teachers’ testing of
their students. Yet, as is the case with most simplifications, often lurking within such no-frills
explanations are issues of considerable complexity. In educational testing, one of these rarely
discussed complexities is the appropriate use of performance testing. Let’s briefly consider it.

Even though all educational tests require students to “perform” in some way, most
conventional test items ask students to respond (on paper or computer) to stimuli presented
by others. In contrast, the essential element of a performance test is that the test-taker is
required to construct an original response—either as a process that can be evaluated as it takes
place (for example, an impromptu speech) or as a product (for instance, a bookend made in
woodshop) that’s subsequently evaluated.

A performance test typically simulates the criterion situation; that is, it approximates the
real-world setting to which students must apply a skill. More conventional assessment, such
as paper-and-pencil tests and computer-administered exams, are clearly limited in the degree
to which they can approximate lifelike demands for students’ skills. Let’s turn, however, to
an emerging conflict about performance testing.

The most prominent reason for advocating more widespread performance testing is that
whatever is tested in our schools will almost certainly end up being taught in our classrooms.
If we can more frequently measure students’ mastery of a wide range of real-world skills, such
as composing persuasive essays or organizing a monthly budget, then more skills related to
such performance tests will surely be taught.

During recent decades, educators were sometimes asked to increase their performance
testing, but currently such demands seem to have slackened. This arises because whatever
the instructional dividends of performance tests, they are expensive to score. The more elaborate
the performance test, the more expensive it is to train and supervise scorers. Even though
computers can be “trained” to provide accurate scores for uncomplicated performance tasks,
for the foreseeable future we must often rely on costly human scorers.

Hence, the positives and negatives of such assessments:

• Pro: Performance tests can simulate real-world demands for school-taught skills.

• Con: Employing performance tests is both costly and time-consuming.

Because better performance testing leads to better taught students, do what you can to
support funding for the wider use of performance tests. It’s an educationally sensible
investment.

*From CWapter 8 of Classroom Assessment: What Teachers Need to Know, 10tW ed., by . James popWam. CopyrigWt 2022 by pearson, wWicW Wereby grants
permission for tWe reproduction and distribution of tWis Testing Takeaway, witW proper attribution, for tWe intent of increasing assessment literacy. a digitally
sWareable version is available from Wttps://www.pearson.com/store/en-us/pearsonplus/login.

M08_POPH0936_10_SE_C08.indd 233M08_POPH0936_10_SE_C08.indd 233 09/11/23 6:21 PM09/11/23 6:21 PM

Turn in your highest-quality paper
Get a qualified writer to help you with

“ EDU 530 Week 1 Discussion ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now