essay

article 00582973 and instructions posted in hwessay

Readthe article entitled, “Status Report on Software Measurement” by Shari
Lawrence Pfleeger, Ross Jeffrey, Bill Curtis, and Barbara Kitchenham, published

inIEEE Software*, and write a one to two page paper that covers the following:

1. List up to three ideas discussed that were new to you.

2. Identify anything that was unclear in the paper or that you didn’t understand.

3. List any ideas presented that you disagree with (and why).

4. What do you think is the most important point made in this paper (and why)?

Status Report on
Software

Measurement
SHARI LAWRENCE PFLEEGER, Systems/Software and Howard University

ROSS JEFFERY, University of New South Wales
BILL CURTIS, TeraQuest Metrics

BARBARA KITCHENHAM, Keele University

The most successful
measurement programs
are ones in which
researcher, practitioner,
and customer work hand
in hand to meet goals
and solve problems

But
such collaboration is rare.
The authors explore the
gaps between these
groups and point toward
ways to bridge them.

n any scientific field, measurement generates quantita-
tive descriptions of key processes and products,
enabling us to understand behavior and result. This
enhanced understanding lets us select better tech-
niques and tools to control and improve our processes,
products, and resources. Because engineering involves

the analysis of measurements, software engineering cannot
become a true engineering discipline unless we build a solid
foundation of measurement-based theories.

One obstacle to building this base is the gap between measure-
ment research and measurement practice. This status report
describes the state of research, the state of the art, and the state of
practice of software measurement. It reflects discussion at the
Second International Software Metrics Symposium, which we
organized. The aim of the symposium is to encourage researchers
and practitioners to share their views, problems, and needs, and to
work together to define future activities that will address common
goals. Discussion at the symposium revealed that participants had
different and sometimes conflicting motivations.

I E E E S O F T W A R E 0 7 4 0 – 7 4 5 9 / 9 7 / $ 1 0 . 0 0 © 1 9 9 7 I E E E 3 3

♦ Researchers, many of whom are
in academic environments, are moti-
vated by publication. In many cases,
highly theoretical results are never
tested empirically, new metrics are
defined but never used, and new theo-

ries are promulgated but never exer-
cised and modified to fit reality.

♦ Practitioners want short-term,
useful results. Their projects are in
trouble now, and they are not always
willing to be a testbed for studies
whose results won’t be helpful until the
next project. In addition, practitioners
are not always willing to make their
data available to researchers, for fear
that the secrets of technical advantage
will be revealed to their competitors.

♦ Customers, who are not always
involved as development progresses,
feel powerless. They are forced to
specify what they need and then can
only hope they get what they want.

It is no coincidence that the most
successful examples of software mea-
surement are the ones where re-
searcher, practitioner, and customer
work hand in hand to meet goals and
solve problems. But such coordination
and collaboration are rare, and there
are many problems to resolve before
reaching that desirable and productive
state. To understand how to get there,
we begin with a look at the right and
wrong uses of measurement.

MEASUREMENT:
USES AND ABUSES

Software measurement has existed
since the first compiler counted the

number of lines in a program listing.
As early as 1974, in an ACM Computing
Surveys article, Donald Knuth reported
on using measurement data to demon-
strate how Fortran compilers can be
optimized, based on actual language
use rather than theory. Indeed, mea-
surement has become a natural part of
many software engineering activities.

♦ Developers, especially those
involved in large projects with long
schedules, use measurements to help
them understand their progress toward
completion.

♦ Managers look for measurable
milestones to give them a sense of pro-
ject health and progress toward effort
and schedule commitments.

♦ Customers, who often have little
control over software production, look
to measurement to help determine the
quality and functionality of products.

♦ Maintainers use measurement to
inform their decisions about reusabili-
ty, reengineering, and legacy code
replacement.

Proper usage. IEEE Software and other
publications have many articles on how
measurement can help improve our
products, processes, and resources. For
example, Ed Weller described how
metrics helped to improve the inspec-
tion process at Honeywell;1 Wayne Lim
discussed how measurement supports
Hewlett-Packard’s reuse program, help-
ing project managers estimate module
reuse and predict the savings in
resources that result;2 and Michael
Daskalontanakis reported on the use of
measurement to improve processes at
Motorola.3 In each case, measurement
helped make visible what is going on in
the code, the development processes,
and the project team.

For many of us, measurement has
become standard practice. We use
structural-complexity metrics to target
our testing efforts, defect counts to
help us decide when to stop testing, or
failure information and operational
profiles to assess code reliability. But

we must be sure that the measurement
efforts are consonant with our project,
process, and product goals; otherwise,
we risk abusing the data and making
bad decisions.

Real-world abuses. For a look at how
dissonance in these goals can create
p r o b l e m s , c o n s i d e r a n e x a m p l e
described by Michael Evangelist.4
Suppose you measure program size
using lines of code or Halstead mea-
sures (measures based on the number
of operators and operands in a pro-
gram). In both cases, common wis-
dom suggests that module size be kept
small, as short modules are easier to
understand than large ones. More-
over, as size is usually the key factor in
p r e d i c t i n g e f f o r t , s m a l l m o d u l e s
should take less time to produce than
large ones. However, this metrics-
driven approach can lead to increased
effort during testing or maintenance.
For example, consider the following
code segment:

FOR i = 1 to n DO
READ (x[i])

Clearly, this code is designed to
read a list of n things. But Brian
Kernighan and William Plauger, in
their classic book The Elements of
Programming Style, caution program-
mers to terminate input by an end-of-
file or marker, rather than using a
count. If a count ends the loop and the
set being read has more or fewer than
n elements, an error condition can
result. A simple solution to this prob-
lem is to code the read loop like this:

i = 1
WHILE NOT EOF DO

READ (x[i])
i:= i+1

END

This improved code is still easy to
read but is not subject to the counting
errors of the first code. On the other
hand, if we judge the two pieces of
code in terms of minimizing size, then

3 4 M A R C H / A P R I L 1 9 9 7

We must ensure
that measurement
efforts are
consonant with
our project goals.

the first code segment is better than
the second. Had standards been set
according to size metrics (as some-
times happens), the programmer could
have been encouraged to keep the
code smaller, and the resulting code
would have been more difficult to test
and maintain.

Another abuse can occur when you
use process measures. Scales such as
the US Software Engineering
Institute’s Capability Maturity Model
can be used as an excuse not to imple-
ment an activity. For example, man-
agers complain that they cannot insti-
tute a reuse program because they are
only a level 1 on the maturity scale.
But reuse is not prohibited at level 1;
the CMM suggests that such practices
are a greater risk if basic project disci-
plines (such as making sensible com-
mitments and managing product base-
lines) have not been established. If pro-
ductivity is a particular project goal,
and if a rich code repository exists
from previous projects, reuse may be
appropriate and effective regardless of
your organization’s level.

Roots of abuse. In each case, it is not
the metric but the measurement
process that is the source of the abuse:
The metrics are used without keeping
the development goals in mind. In the
code-length case, the metrics should be
chosen to support goals of testability
and maintainability. In the CMM case,
the goal is to improve productivity by
introducing reuse. Rather than prevent
movement, the model should suggest
which steps to take first.

Thus, measurement, as any technol-
ogy, must be used with care. Any appli-
cation of software measurement should
not be made on its own. Rather, it
should be an integral part of a general
assessment or improvement program,
where the measures support the goals
and help to evaluate the results of the
actions. To use measurement properly,
we must understand the nature and
goals of measurement itself.

MEASUREMENT THEORY

One way of distinguishing between
real-world objects or entities is to
describe their characteristics. Measure-
ment is one such description. A mea-
sure is simply a mapping from the real,
empirical world to a mathematical
world, where we can more easily
understand an entity’s attributes and
relationship to other entities. The diffi-
culty is in how we interpret the mathe-
matical behavior and judge what it
means in the real world.

None of these notions is particular
to software development. Indeed, mea-
surement theory has been studied for
many years, beginning long before
computers were around. But the issues
of measurement theory are very impor-
tant in choosing and applying metrics
to software development.

Scales. Measurement theory holds, as
a basic principle, that there are several
scales of measurement—nominal, ordi-
nal, interval, and ratio—and each cap-
tures more information than its prede-
cessor. A nominal scale puts items into
categories, such as when we identify a
programming language as Ada, Cobol,
Fortran, or C++. An ordinal scale ranks
items in an order, such as when we
assign failures a progressive severity like
minor, major, and catastrophic.

An interval scale defines a distance
from one point to another, so that
there are equal intervals between con-
secutive numbers. This property per-
mits computations not available with
the ordinal scale, such as calculating the
mean. However, there is no absolute
zero point in an interval scale, and thus
ratios do not make sense. Care is thus
needed when you make comparisons.
The Celsius and Fahrenheit tempera-
ture scales, for example, are interval, so
we cannot say that today’s 30-degree
Celsius temperature is twice as hot as
yesterday’s 15 degrees.

The scale with the most information
and flexibility is the ratio scale, which

incorporates an absolute zero, preserves
ratios, and permits the most sophisti-
cated analysis. Measures such as lines of
code or numbers of defects are ratio
measures. It is for this scale that we can
say that A is twice the size of B.

The importance of measurement
type to software measurement rests in
the types of calculations you can do
with each scale. For example, you can-
not compute a meaningful mean and
standard deviation for a nominal scale;
such calculations require an interval or
ratio scale. Thus, unless we are aware
of the scale types we use, we are likely
to misuse the data we collect.

Researchers such as Norman Fenton
and Horst Zuse have worked extensive-
ly in applying measurement theory to
proposed software metrics. Among the
ongoing questions is whether popular
metrics such as function points are
meaningful, in that they include unac-
ceptable computations for their scale
types. There are also questions about
what entity function points measure.

Validation. We validate measures so
we can be sure that the metrics we use
are actually measuring what they claim
to measure. For example, Tom McCabe
proposed that we use cyclomatic num-
ber, a property of a program’s control-
flow graph, as a measure of testing com-

plexity. Many researchers are careful to
state that cyclomatic number is a mea-
sure of structural complexity, but it does
not capture all aspects of the difficulty
we have in understanding a program.
Other examples include Ross Jeffery’s
study of programs from a psychological
perspective, which applies notions

I E E E S O FT W A R E 3 5

Unless we are
aware of the scale
types we use, we
are likely to misuse
the data we collect.

about how much we can track and
absorb as a way of measuring code
complexity. Maurice Halstead claimed
that his work, too, had psychological
underpinnings, but the psychological
basis for Halstead’s “software science”

measures have been soundly debunked
by Neil Coulter.5 (Bill Curtis and his
colleagues at General Electric found,
however, that Halstead’s count of
operators and operands is a useful mea-
sure of program size.6)

We say that a measure is valid if it
satisfies the representation condition: if
it captures in the mathematical world
the behavior we perceive in the empiri-
cal world. For example, we must show
that if H is a measure of height, and if A
is taller than B, then H(A) is larger than
H(B). But such a proof must by its
nature be empirical and it is often diffi-
cult to demonstrate. In these cases, we
must consider whether we are measur-
ing something with a direct measure
(such as size) or an indirect measure
(such as using the number of decision
points as a measure of size) and what
entity and attribute are being addressed.

Several attempts have been made to
list a set of rules for validation. Elaine
Weyuker suggested rules for validating
complexity,7 and Austin Melton and
his colleagues have proffered a similar,
general list for the behavior of all met-
rics.8 However, each of these frame-
works has been criticized and there is
not yet a standard, accepted way of val-
idating a measure.

The notion of validity is not specific
to software engineering, and general
concepts that we rarely consider—such
as construct validity and predictive
validity—should be part of any discus-
sion of software engineering measure-
ment. For example, Kitchenham,
Pfleeger, and Fenton have proposed a
general framework for validating soft-
ware engineering metrics based on mea-
surement theory and statistical rules.9

Apples and oranges. Measurement the-
ory and validation should not distract us
from the considerable difficulty of mea-
suring software in the field. A major dif-
ficulty is that we often try to relate mea-
sures of a physical object (the software)
with human and organizational behav-
iors, which do not follow physical laws.

Consider, for example, the capability
maturity level as a measure. The matu-
rity level reflects an organization’s soft-
ware development practices and is pur-
ported to predict an organization’s abil-
ity to produce high-quality software on
time. But even if an organization at
level 2 can be determined, through
extensive experimentation, to produce
better software (measured by fewer
delivered defects) than a level 1 organi-
zation, it doesn’t hold that all level 2
organizations develop software better
than level 1 organizations. Some
researchers welcome the use of capabil-
ity maturity level as a predictor of the
likelihood (but not a guarantee) that a
level n organization will be better than
a level n−1. But others insist that, for
CMM level to be a measure in the mea-
surement theory sense, level n must
always be better than level n−1.

Still, a measure can be useful as a
predictor without being valid in the
sense of measurement theory. More-
over, we can gather valuable informa-
tion by applying—even to heuristics—
the standard techniques used in other
scientific disciplines to assess association
by analyzing distributions. But let’s
complicate this picture further. Suppose
we compare a level 3 organization that

is constantly developing different and
challenging avionics systems with a level
2 organization that develops versions of
a relatively simple Cobol business appli-
cation. Obviously, we are comparing
sliced apples with peeled oranges, and
the domain, customer type, and many
other factors moderate the relationships
we observe.

This situation reveals problems not
with the CMM as a measure, but with
the model on which the CMM is based.
We begin with simple models that pro-
vide useful information. Sometimes
those models are sufficient for our
needs, but other times we must extend
the simple models in order to handle
more complex situations. Again, this
approach is no different from other sci-
ences, where simple models (of molec-
ular structure, for instance) are expand-
ed as scientists learn more about the
factors that affect the outcomes of the
processes they study.

State of the gap. In general, measure-
ment theory is getting a great deal of
attention from researchers but is being
ignored by practitioners and cus –
tomers, who rely on empirical evidence
of a metric’s utility regardless of its sci-
entific grounding.

Researchers should work closely
with practitioners to understand the
valid uses and interpretations of a soft-
ware measure based on its measure-
ment-theoretic attributes. They should
also consider model validity separate
from measurement validity, and devel-
op more accurate models on which to
base better measures. Finally, there is
much work to be done to complete a
framework for measurement valida-
tion, as well as to achieve consensus
within the research community on the
framework’s accuracy and usefulness.

MEASUREMENT MODELS

A measurement makes sense only
when it is associated with one or more

3 6 M A R C H / A P R I L 1 9 9 7

Measurement
theory is getting
attention from
researchers but
is being ignored
by practitioners
and customers.

models. One essential model tells us
the domain and range of the measure
mapping; that is, it describes the entity
and attribute being measured, the set
of possible resulting measures, and the
relationships among several measures
(such as productivity is equal to size
produced per unit of effort). Models
also distinguish prediction from assess-
ment; we must know whether we are
using the measure to estimate future
characteristics from previous ones
(such as effort, schedule, or reliability
estimation) or determining the current
condition of a process, product, or
resource (such as assessing defect den-
sity or testing effectiveness).

There are also models to guide us in
deriving and applying measurement. A
commonly used model of this type is
the Goal-Question-Metric paradigm
suggested by Vic Basili and David
Weiss (and later expanded by Basili
and Dieter Rombach).10 This approach
uses templates to help prospective
users derive measures from their goals
and the questions they must answer
during development. The template
encourages the user to express goals in
the following form:

Analyze the [object] for the purpose
of [purpose] with respect to [focus]
from the viewpoint of [viewpoint] in
the [environment].

For example, an XYZ Corporation
manager concerned about overrunning
the project schedule might express the
goal of “meeting schedules” as

Analyze the project for the purpose of
control with respect to meeting sched-
ules from the viewpoint of the project
manager in the XYZ Corporation.

From each goal, the manager can
derive questions whose answers will help
determine whether the goal has been
met. The questions derived suggest met-
rics that should be used to answer the
questions. This top-down derivation
assists managers and developers not only
in knowing what data to collect but also

in understanding the type of analysis
needed when the data is in hand.

Some practitioners, such as Bill
Hetzel, encourage a bottom-up ap-
proach to metrics application, where
organizations measure what is avail-
able, regardless of goals.11 Other mod-
els include Ray Offen and Jeffery’s
M3P model derived from business
goals described on page 45 and the
combination of goal-question-metric
and capability maturity built into the
European Community’s ami project
framework.

Model research. Experimentation
models of measurement are essential
for case studies and experiments for
software engineering research. For
example, an organization may build
software using two different tech-
niques: one a formal method, another
not. Researchers would then evaluate
the resulting software to see if one
method produced higher quality soft-
ware than the other.

An experimentation model de-
scribes the hypothesis being tested, the
factors that can affect the outcome, the
degree of control over each factor, the
relationships among the factors, and
the plan for performing the research
and evaluating the outcome. To
address the lack of rigor in software
experimentation, projects such as the
UK’s Desmet—reported upon exten-
sively in ACM Software Engineering
Notes beginning in October of 1994—
have produced guidelines to help soft-
ware engineers design surveys, case
studies, and experiments.

Model future. As software engineers,
we tend to neglect models. In other sci-
entific disciplines, models act to unify
and explain, placing apparently disjoint
events in a larger, more understandable
framework. The lack of models in soft-
ware engineering is symptomatic of a
much larger problem: a lack of systems
focus. Few software engineers under-
stand the need to define a system

boundary or explain how one system
interacts with another. Thus, research
and practice have a very long way to go
in exploring and exploiting what mod-
els can do to improve software products
and processes.

MEASURING THE PROCESS

For many years, computer scientists
and software engineers focused on
measuring and understanding code. In
recent years—as we have come to
understand that product quality is evi-
dence of process success—software
process issues have received much
attention. Process measures include
large-grain quantifications, such as the
CMM scale, as well as smaller-grain
evaluations of particular process activi-
ties, such as test effectiveness.

Process perspective. Process research
can be viewed from several perspec-
tives. Some process researchers devel-
op process description languages, such
as the work done on the Alf (Esprit)
project. Here, measurement supports
the description by counting tokens that
indicate process size and complexity.
Other researchers investigate the actu-
al process that developers use to build

software. For example, early work by
Curtis and his colleagues at MCC
revealed that the way we analyze and
design software is typically more itera-
tive and complex than top-down.12

Researchers also use measurement
to help them understand and improve

I E E E S O FT W A R E 3 7

The lack of
models in software
engineering is
symptomatic
of a lack of
systems focus.

3 8 M A R C H / A P R I L 1 9 9 7

Books
♦ D. Card and R. Glass, Measuring Software Design

Complexity, Prentice Hall, Englewood Cliffs, N.J., 1991.
♦ S.D. Conte, H E. Dunsmore, and V.Y. Shen, Software

Engineering Metrics and Models, Benjamin Cummings, Menlo
Park, Calif., 1986.

♦ T. DeMarco, Controlling Software Projects, Dorset
House, New York, 1982.

♦ J.B. Dreger, Function Point Analysis, Prentice Hall,
Englewood Cliffs, N.J., 1989.

♦ N. Fenton and S.L. Pfleeger, Software Metrics: A
Rigorous and Practical Approach, second edition, International
Thomson Press, London, 1996.

♦ R.B. Grady, Practical Software Metrics for Project
Management and Process Improvement, Prentice Hall,
Englewood Cliffs, N.J., 1992.

♦ R.B. Grady and D.L. Caswell, Software Metrics:
Establishing a Company-Wide Program, Prentice Hall,
Englewood Cliffs, N.J., 1987.

♦ T.C. Jones, Applied Software Measurement: Assuring
Productivity and Quality, McGraw Hill, New York, 1992.

♦ K. Moeller and D.J. Paulish, Software Metrics: A
Practitioner’s Guide to Improved Product Development, IEEE
Computer Society Press, Los Alamitos, Calif., 1993.

♦ P. Oman and S.L. Pfleeger, Applying Software Metrics,
IEEE Computer Society Press, Los Alamitos, Calif., 1996.

Journals
♦ IEEE Software (Mar. 1991 and July 1994, special issues

on measurement; January 1996, special issue on software
quality)

♦ Computer (September 1994, special issue on product
metrics)

♦ IEEE Transactions on Software Engineering
♦ Journal of Systems and Software
♦ Software Quality Journal
♦ IEE Journal
♦ IBM Systems Journal
♦ Information and Software Technology
♦ Empirical Software Engineering: An International Journal

Key Journal Articles
♦ V.R. Basili and H.D. Rombach, “The TAME Project:

Towards Improvement-Oriented Software Environments,”
IEEE Trans. Software Eng., Vol. 14, No. 6, 1988, pp. 758-773.

♦ C. Billings et al., “Journey to a Mature Software Pro-
cess,” IBM Systems Journal, Vol. 33, No. 1, 1994, pp. 46-61.

♦ B. Curtis, “Measurement and Experimentation in
Software Engineering,” Proc. IEEE, Vol. 68, No. 9, 1980, pp.
1144-1157.

♦ B. Kitchenham, L. Pickard, and S.L. Pfleeger, “Using
Case Studies for Process Improvement,” IEEE Software,
July 1995.

♦ S.L. Pfleeger, “Experimentation in Software
Engineering,” Annals Software Eng., Vol. 1, No. 1, 1995.

♦ S.S. Stevens, “On the Theory of Scales of
Measurement,” Science, No. 103, 1946, pp. 677-680.

Conferences
♦ Applications of Software Measurement, sponsored by

Software Quality Engineering, held annually in Florida and
California on alternate years. Contact: Bill Hetzel, SQE,
Jacksonville, FL, USA.

♦ International Symposium on Software Measurement,
sponsored by IEEE Computer Society (1st in Baltimore,
1993, 2nd in London, 1994, 3rd in Berlin, 1996; 4th is
upcoming in Nov. 1997 in Albuquerque, New Mexico); pro-
ceedings available from IEEE Computer Society Press.
Contact: Jim Bieman, Colorado State University, Fort
Collins, CO, USA.

♦ Oregon Workshop on Software Metrics Workshop,
sponsored by Portland State University, held annually near
Portland, Oregon. Contact: Warren Harrison, PSU,
Portland, OR, USA.

♦ Minnowbrook Workshop on Software Performance
Evaluation, sponsored by Syracuse University, held each
summer at Blue Mountain Lake, NY. Contact: Amrit Goel,
Syracuse University, Syracuse, NY, USA.

♦ NASA Software Engineering Symposium, sponsored
by NASA Goddard Space Flight Center, held annually at the
end of November or early December in Greenbelt,
Maryland; proceedings available. Contact: Frank McGarry,
Computer Sciences Corporation, Greenbelt, MD, USA.

♦ CSR Annual Workshop, sponsored by the Centre for
Software Reliability, held annually at locations throughout
Europe; proceedings available. Contact: Bev Littlewood,
Centre for Software Reliability, City University, London,
UK.

Organizations
♦ Australian Software Metrics Association. Contact:

Mike Berry, School of Information Systems, University of
New South Wales, Sydney 2052, Australia.

♦ Quantitative Methods Committee, IEEE Computer
Society Technical Council on Software Engineering.
Contact: Jim Bieman, Department of Computer Science,
Colorado State University, Fort Collins, CO, USA.

♦ Centre for Software Reliability. Contact: Bev
Littlewood, CSR, City University, London, UK.

♦ Software Engineering Laboratory. Contact: Vic Basili,
Department of Computer Science, University of Maryland,
College Park, MD, USA.

♦ SEI Software Measurement Program. Contact: Anita
Carleton, Software Engineering Institute, Carnegie Mellon
University, Pittsburgh, PA, USA.

♦ Applications of Measurement in Industry (ami) User
Group. Contact: Alison Rowe, South Bank University,
London, UK.

♦ International Society of Parametric Analysts. Contact:
J. Clyde Perry & Associates, PO Box 6402, Chesterfield, MO
63006-6402, USA.

♦ International Function Point Users Group. Contact:
IFPUG Executive Office, Blendonview Office Park, 5008-28
Pine Creek Drive, Westerville, OH 43081-4899, USA.

MORE INFORMATION ABOUT MEASUREMENT

◆

existing processes. A good example is
an ICSE 1994 report in which Larry
Votta, Adam Porter, and Basili report-
ed that scenario-based inspections
(where each inspector looked for a par-
ticular type of defect) produced better
results than ad hoc or checklist-based
inspections (where each inspector
looks for any type of defect).13 Basili
and his colleagues at the NASA
Software Engineering Laboratory con-
tinue to use measurement to evaluate
the impact of using Ada, cleanroom,
and other technologies that change the
software development process. Billings
and his colleagues at Loral (formerly
IBM) are also measuring their process
for building space shuttle software.

Remeasuring. The reuse community
provides many examples of process-
related measurement as it tries to
determine how reuse affects quality
and productivity. For example, Wayne
Lim has modeled the reuse process and
suggested measurements for assessing
reuse effectiveness.14 Similarly, Shari
Lawrence Pfleeger and Mary Theo-
fanos have combined process maturity
concepts with a goal-question-metric
approach to suggest metrics to instru-
ment the reuse process.15

Reengineering also offers opportuni-
ties to measure process change and its
effects. At the 1994 International
Software Engineering Research
Network meeting, an Italian research
group reported on their evaluation of a
large system reengineering project. In
the project, researchers kept an extensive
set of measurements to track the impact
of the changes made as a banking appli-
cation’s millions of lines of Cobol code
were reengineered over a period of
years. These measures included the sys-
tem structure and the number of help
and change requests. Measurement let
the team evaluate the success and pay-
back of the reengineering process.

Process problems. Use of these and
other process models and measurements

raises several problems. First, large-
grained process measures require valida-
tion, which is difficult to do. Second,
project managers are often intimidated
by the effort required to track process
measures throughout development.
Individual process activities are usually
easier to evaluate, as they are smaller and
more controllable. Third, regardless of
the granularity, process measures usually
require an underlying model of how
they interrelate; this model is usually
missing from process understanding and
evaluation, so the results of research are
difficult to interpret. Thus, even as
attention turns increasingly to process in
the larger community, process measure-
ment research and practice lag behind
the use of other measurements.

MEASURING THE PRODUCTS

Because products are more concrete
than processes and resources and are
thus easier to measure, it is not surpris-
ing that most measurement work is
directed in this area. Moreover, cus-
tomers encourage product assessment
because they are interested in the final
product’s characteristics, regardless of
the process that produced it. As a
result, we measure defects (in specifica-
tion, design, code, and test cases) and
failures as part of a broader program to
assess product quality. Quality frame-
works, such as McCall’s or the pro-
posed ISO 9126 standard, suggest ways
to describe different aspects of product
quality, such as distinguishing usability
from reliability from maintainability.

Measuring risk. Because failures are
the most visible evidence of poor quali-
ty, reliability assessment and prediction
have received much attention. There
are many reliability models, each
focused on using operational profile
and mean-time-to-failure data to pre-
dict when the next failure is likely to
occur. These models are based on
probability distributions, plus assump-

tions about whether new defects are
introduced when old ones are repaired.
However, more work is required both
in making the assumptions realistic and
in helping users select appropriate
models. Some models are accurate
most of the time, but there are no
guarantees that a particular model will
perform well in a particular situation.

Most developers and customers do
not want to wait until delivery to
determine if the code is reliable or
maintainable. As a result, some practi-
tioners measure defects as evidence of
code quality and likely reliability. Ed
Adams of IBM showed the dangers of
this approach. He used IBM operating
system data to show that 80 percent of
the reliability problems were caused by
only 2 percent of the defects. 16
Research must be done to determine
which defects are likely to cause the
most problems, as well as to prevent
such problems before they occur.

Early measurement. Earlier life-cycle
products have also been the source of
many measurements. Dolly Samson
and Jim Palmer at George Mason
University have produced tools that
measure and evaluate the quality of
informal, English-language require-
ments; these tools are being used by the
US Federal Bureau of Investigation and

the Federal Aviation Authority on pro-
jects where requirements quality is
essential. Similar work has been pur-
sued by Anthony Finkelstein’s and
Alistair Sutcliffe’s research groups at
City University in London. Suzanne

I E E E S O FT W A R E 3 9

There are no
guarantees that
a particular model
will perform well
in a particular
situation.

Robertson and Shari Pfleeger are cur-
rently working with the UK Ministry
of Defence to evaluate requirements
structure as well as quality, so require-

ments volatility and likely reuse can be
assessed. However, because serious
measurement of requirements attribut-
es is just beginning, very little require-
ments measurement is done in practice.

Design and code. Researchers and
practitioners have several ways of evalu-
ating design quality, in the hope that
good design will yield good code. Sallie
Henry and Dennis Kafura at Virginia
Tech proposed a design measure based
on the fan-in and fan-out of modules.
David Card and Bill Agresti worked
with NASA Goddard developers to
derive a measure of software design
complexity that predicts where code
errors are likely to be. But many of the
existing design measures focus on func-
tional descriptions of design; Shyam
Chidamber and Chris Kemerer at MIT
have extended these types of measures
to object-oriented design and code.

The fact that code is easier to mea-
sure than earlier products does not
prevent controversy. Debate continues
to rage over whether lines of code are a
reasonable measure of software size.
Bob Park at SEI has produced a frame-
work that organizes the many decisions
involved in defining a lines-of-code
count, including reuse, comments, exe-
cutable statements, and more. His
report makes clear that you must know
your goals before you design your
measures. Another camp measures
code in terms of function points,
claiming that such measures capture

the size of functionality from the speci-
fication in a way that is impossible for
lines of code. Both sides have valid
points, and both have attempted to
unify and define their ideas so that
counting and comparing across organi-
zations is possible. However, practi-
tioners and customers have no time to
wait for resolution. They need mea-
sures now that will help them under-
stand and predict likely effort, quality,
and schedule.

Thus, as with other types of mea-
surement, there is a large gap between
the theory and practice of product
measurement. The practitioners and
customers know what they want, but
the researchers have not yet been able
to find measures that are practical, sci-
entifically sound (according to mea-
surement theory principles), and cost-
effective to capture and analyze.

MEASURING THE RESOURCES

For many years, some of our most
insightful software engineers (includ-
ing Jerry Weinberg, Tom DeMarco,
Tim Lister, and Bill Curtis) have
encouraged us to look at the quality
and variability of the people we employ
for the source of product variations.
Some initial measurement work has
been done in this area.

DeMarco and Lister report in
Peopleware on an IBM study which
showed that your surroundings—such
as noise level, number of interruptions,
and office size—can affect the produc-
tivity and quality of your work.
Likewise, a study by Basili and David
Hutchens suggests that individual vari-
ation accounts for much of the differ-
ence in code complexity;17 these results
support a 1979 study by Sylvia
Sheppard and her colleagues at ITT,
showing that the average time to locate
a defect in code is not related to years
of experience but rather to breadth of
experience. However, there is relative-
ly little attention being paid to human

resource measurement, as developers
and managers find it threatening.

Nonhuman resources. More attention
has been paid to other resources: bud-
get and schedule assessment, and effort,
cost, and schedule prediction. A rich
assortment of tools and techniques is
available to support this work, includ-
ing Barry Boehm’s Cocomo model,
tools based on Albrecht’s function
points model, Larry Putnam’s Slim
model, and others. However, no model
works satisfactorily for everyone, in
part because of organizational and pro-
ject differences, and in part because of
model imperfections. June Verner and
Graham Tate demonstrated how tailor-
ing models can improve their perfor-
mances significantly. Their 4GL modi-
fication of an approach similar to func-
tion points was quite accurate com-
pared to several other alternatives.18,19
Barbara Kitchenham’s work on the
Mermaid Esprit project demonstrated
how several modeling approaches can
be combined into a larger model that is
more accurate than any of its compo-
nent models.20 And Boehm is updating
and improving his Cocomo model to
reflect advances in measurement and
process understanding, with the hope
of increasing its accuracy.21

Shaky models. The state of the prac-
tice in resource measurement lags far
behind the research. Many of the
research models are used once, publi-
cized, and then die. Those models that
are used in practice are often imple-
mented without regard to the underly-
ing theory on which they are built. For
example, many practitioners implement
Boehm’s Cocomo model, using not
only his general approach but also his
cost factors (described in his 1981
book, Software Engineering Economics).
However, Boehm’s cost factor values
are based on TRW data captured in the
1970s and are irrelevant to other envi-
ronments, especially given the radical
change in development techniques and

4 0 M A R C H / A P R I L 1 9 9 7

Debate continues
over whether lines
of code are a
reasonable measure
of software size.

tools since Cocomo was developed.
Likewise, practitioners adopt the equa-
tions and models produced by Basili’s
Software Engineering Laboratory, even
though the relationships are derived
from NASA data and are not likely to
work in other environments. The
research community must better com-
municate to practitioners that it is the
techniques that are transferable, not the
data and equations themselves.

STORING, ANALYZING
AND REPORTING THE
MEASUREMENTS

Researchers and practitioners alike
often assume that once they choose the
metrics and collect the data, their mea-
surement activities are done. But the
goals of measurement—understanding
and change—are not met until they
analyze the data and implement change.

Measuring tools. In the UK, a team
led by Kitchenham has developed a
tool that helps practitioners choose
metrics and builds a repository for the
collected data. Amadeus, an American
project funded by the Advanced
Research Projects Agency, has some of
the same goals; it monitors the devel-
opment process and stores the data for
later analysis. Some Esprit projects are
working to combine research tools into
powerful analysis engines that will help
developers manipulate data for decision
making. For example, Cap Volmac in
the Netherlands is leading the Squid
project to build a comprehensive soft-
ware quality assessment tool.

It is not always necessary to use
sophisticated tools for metrics collec-
tion and storage, especially on projects
just beginning to use metrics. Many
practitioners use spreadsheet software,
database management systems, or off-
the-shelf statistical packages to store
and analyze data. The choice of tool
depends on how you will use the mea-
surements. For many organizations,

simple analysis techniques such as scat-
ter charts and histograms provide use-
ful information about what is happen-
ing on a project or in a product. Others
prefer to use statistical analysis, such as
regression and correlation, box plots,
and measures of central tendency and
dispersion. More complex still are clas-
sification trees, applied by Adam
Porter and Rick Selby to determine
which metrics best predict quality or
productivity. For example, if module
quality can be assessed using the num-
ber of defects per module, then a clas-
sification tree illustrates which of the
metrics collected predicts modules
likely to have more than a threshold
number of defects.22

Process measures are more difficult
to track, as they often require trace-
ability from one product or activity to
another. In this case, databases of
traceability information are needed,
coupled with software to track and
analyze progress. Practitioners often
use their configuration management
system for these measures, augmenting
existing configuration information
with measurement data.

Analyzing tools. For storing and ana-
lyzing large data sets, it is important to
choose appropriate analysis techniques.
Population dynamics and distribution
are key aspects of this choice. When
sampling from data, it is essential that
the sample be representative so that
your judgments about the sample apply
to the larger population. It is equally
important to ensure that your analysis
technique is suitable for the data’s dis-
tribution. Often, practitioners use a
technique simply because it is available
on their statistical software packages,
regardless of whether the data is dis-
tributed normally or not. As a result,
invalid parametric techniques are used
instead of the more appropriate non-
parametric ones. Many of the paramet-
ric techniques are robust enough to be
used with nonnormal distributions, but
you must verify this robustness.

Applying the appropriate statistical
techniques to the measurement scale is
even more important. Measures of
central tendency and dispersion differ
with the scale, as do appropriate trans-
formations. You can use mode and fre-
quency distributions to analyze nomi-
nal data that describes categories, but
you cannot use means and standard
deviations. With ordinal data—where
an order is imposed on the cate –
gories—you can use medians, maxima,
and minima for analysis. But you can
use means, standard deviations, and
more sophisticated statistics only when
you have interval or ratio data.

Presentation. Presenting measure-
ment data so that customers can
understand it is problematic because
metrics are chosen based on business
and development goals and the data is
collected by developers. Typically, cus-
tomers are not experts in software
engineering; they want a “big picture”
of what the software is like, not a large
vector of measures of different aspects.
Hewlett-Packard has been successful in
using Kiviat diagrams (sometimes
called radar graphs) to depict multiple
measures in one picture, without losing
the integrity of the individual mea-
sures. Similarly, Contel used multiple
metrics graphs to report on software
switch quality and other characteristics.

Measurement revisited. A relatively
new area of research is the packaging of
previous experience for use by new
development and maintenance projects.
Since many organizations produce new
software that is similar to their old soft-
ware or developed using similar tech-
niques, they can save time and money

I E E E S O FT W A R E 4 1

The choice of tool
depends on how
you will use the
measurements.

by capturing experience for reuse at a
later time. This reuse involves not only
code but also requirements, designs,
test cases, and more. For example, as
part of its Experience Factory effort,
the SEL is producing a set of docu-
ments that suggests how to introduce
techniques and establish metrics pro-
grams. Guillermo Arango’s group at
Schlumberger has automated this expe-
rience capture in a series of “project
books” that let developers call up
requirements, design decisions, mea-
surements, code, and documents of all
kinds to assist in building the next ver-
sion of the same or similar product.23

Refining the focus. In the past, mea-
surement research has focused on met-
ric definition, choice, and data collec-
tion. As part of a larger effort to exam-
ine the scientific bases for software
engineering research, attention is now
turning to data analysis and reporting.

Practitioners continue to use what is
readily available and easy to use, regard-
less of its appropriateness. This is in

part the fault of researchers, who have
not described the limitations of and
constraints on techniques put forth for
practical use.

Finally, the measurement commu-
nity has yet to deal with the more
global issue of technology transfer. It is
unreasonable for us to expect practi-
tioners to become experts in statistics,
probability, or measurement theory, or
even in the intricacies of calculating
code complexity or modeling parame-
ters. Instead, we need to encourage
researchers to fashion results into tools
and techniques that practitioners can
easily understand and apply.

ust as we preach the need for mea-
surement goals, so too must we

base our activities on customer goals.
As practitioners and customers cry out
for measures early in the development
cycle, we must focus our efforts on
measuring aspects of requirements
analysis and design. As our customers
request measurements for evaluating
commercial off-the-shelf software, we

must provide product metrics that sup-
port such purchasing decisions. And as
our customers insist on higher levels of
reliability, functionality, usability,
reusability, and maintainability, we
must work closely with the rest of the
software engineering community to
understand the processes and resources
that contribute to good products.

We should not take the gap
between measurement research and
practice lightly. During an open-mike
session at the metrics symposium, a
statistician warned us not to become
like the statistics community, which he
characterized as a group living in its
own world with theories and results
that are divorced from reality and use-
less to those who must analyze and
understand them. If the measurement
community remains separate from
mainstream software engineering, our
delivered code will be good in theory
but not in practice, and developers will
be less likely to take the time to mea-
sure even when we produce metrics
that are easy to use and effective.

4 2 M A R C H / A P R I L 1 9 9 7

REFERENCES

1. E.F. Weller, “Using Metrics to Manage Software Projects,” Computer, Sept. 1994, pp. 27-34.
2. W.C. Lim, “Effects of Reuse on Quality, Productivity, and Economics,” IEEE Software, Sept. 1994, pp. 23-30.
3. M.K. Daskalantonakis, “A Practical View of Software Measurement and Implementation Experiences within Motorola,” IEEE Trans. Software Eng., Vol. 18,

No. 11, 1992, pp. 998-1010.
4. W.M. Evangelist, “Software Complexity Metric Sensitivity to Program Restructuring Rules,” J. Systems Software, Vol. 3, 1983, pp. 231-243.
5. N. Coulter, “Software Science and Cognitive Psychology,” IEEE Trans. Software Eng., Vol. 9, No. 2, 1983, pp. 166-171.
6. B. Curtis et al., “Measuring the Psychological Complexity of Software Maintenance Tasks with the Halstead and McCabe Metrics,” IEEE Trans. Software

Eng., Vol. 5, No. 2, 1979, pp. 96-104.
7. E.J. Weyuker, “Evaluating Software Complexity Measures,” IEEE Trans. Software Eng., Vol. 14, No. 9, 1988, pp. 1357-1365.
8. A.C. Melton et al., “Mathematical Perspective of Software Measures Research,” Software Eng. J., Vol. 5, No. 5, 1990, pp. 246-254.
9. B. Kitchenham, S.L. Pfleeger, and N. Fenton, “Toward a Framework for Measurement Validation,” IEEE Trans. Software Eng., Vol. 21, No. 12, 1995,

pp. 929-944.
10. V.R. Basili and D. Weiss, “A Methodology For Collecting Valid Software Engineering Data,” IEEE Trans. Software Eng., Vol. 10, No. 3, 1984, pp. 728-738.
11. W. Hetzel, Making Software Measurement Work: Building an Effective Software Measurement Program, QED Publishing, Boston, 1993.
12. B. Curtis, H. Krasner, and Neil Iscoe, “A Field Study of the Software Design Process for Large Systems,” Comm. ACM, Nov. 1988, pp. 1268-1287.

◆
.

I E E E S O FT W A R E 4 3

13. A.A. Porter, L.G. Votta, and V.R. Basili, “An Experiment to Assess Different Defect Detection Methods for Software Requirements Inspections,” Proc. 16th
Int’l Conf. Software Eng., 1994, pp. 103-112.

14. W.C. Lim “Effects of Reuse on Quality, Productivity and Economics,” IEEE Software, Sept. 1994, pp. 23-30.
15. M.F. Theofanos and S.L. Pfleeger, “A Framework for Creating a Reuse Measurement Plan,” tech. report, 1231/D2, Martin Marietta Energy Systems, Data

Systems Research and Development Division, Oak Ridge, Tenn., 1993.
16. E. Adams, “Optimizing Preventive Service of Software Products,” IBM J. Research and Development, Vol. 28, No. 1, 1984, pp. 2-14.
17. V.R. Basili and D.H. Hutchens, “An Empirical Study of a Syntactic Complexity Family,” IEEE Trans. Software Eng., Vol. 9, No. 6, 1983, pp. 652-663.
18. J.M. Verner and G. Tate, “Estimating Size and Effort in Fourth-Generation Language Development,” IEEE Software, July 1988, pp. 173-177.
19. J. Verner. and G. Tate, “A Software Size Model,” IEEE Trans. Software Eng., Vol. 18, No. 4, 1992, pp. 265-278.
20. B.A. Kitchenhamm P.A.M. Kok, and J Kirakowski, “The Mermaid Approach to Software Cost Estimation,” Proc. Esprit, Kluwer Academic Publishers,

Dordrecht, the Netherlands, 1990, pp. 296-314.
21. B.W. Boehm et al., “Cost Models for Future Life Cycle Processes: COCOMO 2.0,” Annals Software Eng. Nov. 1995, pp. 1-24.
22. A. Porter and R. Selby, “Empirically Guided Software Development Using Metric-Based Classification Trees,” IEEE Software, Mar. 1990, pp. 46-54.
23. G. Arango, E. Schoen, and R. Pettengill, “Design as Evolution and Reuse,” in Advances in Software Reuse, IEEE Computer Society Press, Los Alamitos, Calif.,

March 1993, pp. 9-18.

Shari Lawrence Pfleeger is director of the Center for
Research in Evaluating Software Technology
(CREST) at Howard University in Washington, DC.
The Center establishes partnerships with industry and
government to evaluate the effectiveness of software
engineering techniques and tools. She is also president
of Systems/Software Inc., a consultancy specializing in
software engineering and technology evaluation.
Pfleeger is the author of several textbooks and dozens
of articles on software engineering and measurement.
She is an associate editor-in-chief of IEEE Software and

is an advisor to IEEE Spectrum. Pfleeger is a member of the executive commit-
tee of the IEEE Technical Council on Software Engineering, and the program
cochair of the next International Symposium on Software Metrics in
Albuquerque, New Mexico.

Pfleeger received a PhD in information technology and engineering from
George Mason University. She is a member of the IEEE and ACM.

Ross Jeffery is a professor of information systems and
director of the Centre for Advanced Empirical Software
Research at the University of New South Wales,
Australia. His research interests include software engi-
neering process and product modeling and improve-
ment, software metrics, software technical and manage-
ment reviews, and software resource modeling. He is on
the editorial board of the IEEE Transactions on Software
Engineering, the Journal of Empirical Software
Engineering, and the editorial board of the Wiley
International Series on information systems. He is also a

founding member of the International Software Engineering Research Network.

Bill Curtis is co-founder and chief scientist of
TeraQuest Metrics in Austin, Texas where he works
with organizations to increase their software develop-
ment capability. He is a former director of the
Software Process Program in the Software Engineering
Institute at Carnegie Mellon University, where he is a
visiting scientist. Prior to joining the SEI, Curtis
worked at MCC, ITT’s Programming Technology
Center, GE’s Space Division, and the University of
Washington. He was a founding faculty member of the
Software Quality Institute at the University of Texas.

He is co-author of the Capability Maturity Model for software and the principal
author of the People CMM. He is on the editorial boards of seven technical
journals and has published more than 100 technical articles on software engi-
neering, user interface, and management.

Barbara Kitchenham is a principal researcher in soft-
ware engineering at Keele University. Her main inter-
est is in software metrics and their support for project
and quality management. She has written more than 40
articles on the topic as well as the book Software Metrics:
Measurement for Software Process Improvement. She spent
10 years working for ICL and STC, followed by two
years at City University and seven years at the UK
National Computing Centre, before joining Keele in
February 1996.

Kitchenham received a PhD from Leeds University.

Address questions about this article to Pfleeger at CREST, Howard University Department of Systems and Computer Science, Washington, DC 20059;
s.pfleeger@ieee.org.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ essay ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now