Application 2 – Annotated Bibliography

As part of your doctoral seminar for this set of weeks, you are participating in a seminar-style discussion about the weekly topics. Recall that you were asked to address 5 of the Required Resources and at least 5 additional resources from the Walden Library and to incorporate them into your posting. As a related exercise, submit an annotated bibliography of the 10 resources you referred to this week. For each entry, be sure to address the following as a minimum:

Include the full APA citation
Discuss the scope of the resource
Discuss the purpose and philosophical approach
Discuss the underlying assumptions
If referring to a research reporting article, present the methodology
Relate the resource to the body of resources you have consulted in this course
Discuss any evident limitations and opportunities for further inquiry

IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 1, May 2010

ISSN (Online): 1694-0784
ISSN (Print): 1694-0814

Different Forms of Software Testing Techniques for Finding
Errors

Mohd. Ehmer Khan

Department of Information Technology
Al Musanna College of Technology, Sultanate of Oman

Abstract
Software testing is an activity which is aimed for evaluating an
attribute or capability of a program and ensures that it meets
the required result. There are many approaches to software
testing, but effective testing of complex product is essentially a
process of investigation, not merely a matter of creating and
following route procedure. It is often impossible to find all the
errors in the program. This fundamental problem in testing
thus throws open question, as to what would be the strategy
that we should adopt for testing. Thus, the selection of right
strategy at the right time will make the software testing
efficient and effective. In this paper I have described software
testing techniques which are classified by purpose.
Keywords: Correctness Testing, Performance Testing,
Reliability Testing, Security

Testing

1. Introduction

Software testing is a set of activities conducted with the
intent of finding errors in software. It also verifies and
validate whether the program is working correctly with
no bugs or not. It analyzes the software for finding bugs.
Software testing is not just used for finding and fixing of
bugs but it also ensures that the system is working
according to the specifications. Software testing is a
series of process which is designed to make sure that the
computer code does what it was designed to do.
Software testing is a destructive process of trying to find
the errors. The main purpose of testing can be quality
assurance, reliability estimation, validation or
verification. The other objectives or software testing
includes. [6][7][8]

 The better it works the more efficiently it can
be tested.

 Better the software can be controlled more the

testing can be automated and optimized.

 The fewer the changes, the fewer the disruption
to testing.

 A successful test is the one that uncovers an

undiscovered error.

 Testing is a process to identify the correctness

and completeness of the software.

 The general objective of software testing is to
affirm the quality of software system by
systematically exercising the software in
carefully controlled circumstances.

Classified by purpose software testing can be divided
into [4]

1. Correctness Testing
2. Performance Testing
3. Reliability Testing
4. Security Testing

2. Software Testing Techniques

Software testing is a process which is used to measure
the quality of software developed. It is also a process of
uncovering errors in a program and makes it a feasible
task. It is useful process of executing program with the
intent of finding bugs. The diagram below represents
some of the most prevalent techniques of software
testing which are classified by purpose. [4]

Fig. 1 Represent different software testing techniques which are
classified by purpose

SOFTWARE
TESTING

Correctness
Testing

Security
Testing

Performance
Testing

Reliability
Testing

IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 1, May 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814

2.1 Correctness Testing

The most essential purpose of testing is correctness
which is also the minimum requirement of software.
Correctness testing tells the right behavior of system
from the wrong one for which it will need some type of
Oracle. Either a white box point of view or black box
point of view can be taken in testing software as a tester
may or may not know the inside detail of the software
module under test. For e.g. Data flow, Control flow etc.
The ideas of white box, black box or grey box testing are
not limited to correctness testing only. [4]

Fig. 2 Represent various form of correctness testing

2.1.1 White Box Testing

White box testing based on an analysis of internal
working and structure of a piece of software. White box
testing is the process of giving the input to the system
and checking how the system processes that input to
generate the required output. It is necessary for a tester
to have the full knowledge of the source code. White
box testing is applicable at integration, unit and system
levels of the software testing process. In white box
testing one can be sure that all parts through the test
objects are properly executed. [2][10]

Fig. 3 Represent working process of White Box Testing

Some synonyms of white box testing are [5]

 Logic Driven Testing

 Design Based Testing
 Open Box Testing
 Transparent Box Testing
 Clear Box Testing
 Glass Box Testing
 Structural Testing

Some important types of white box testing techniques
are:

1. Control Flow Testing
2. Branch Testing
3. Path Testing
4. Data flow Testing
5. Loop Testing

There are some pros & cons of white box testing-

Pros-

1. Side effects are beneficial.

2. Errors in hidden codes are revealed.

3. Approximate the partitioning done by execution

equivalence.

4. Developer carefully gives reason about
implementation.

Cons-

1. It is very expensive.

2. Missed out the cases omitted in the code.

2.1.2 Black Box Testing

Basically Black box testing is an integral part of
‘Correctness testing’ but its ideas are not limited to
correctness testing only. Correctness testing is a method
which is classified by purpose in software testing.

Black box testing is based on the analysis of the
specifications of a piece of software without reference to
its internal working. The goal is to test how well the
component conforms to the published requirement for
the component. Black box testing have little or no regard
to the internal logical structure of the system, it only
examines the fundamental aspect of the system. It makes
sure that input is properly accepted and output is
correctly produced. In black box testing, the integrity of
external information is maintained. The black box
testing methods in which user involvement is not
required are functional testing, stress testing, load
testing, ad-hoc testing, exploratory testing, usability
testing, smoke testing, recovery testing and volume
testing, and the black box testing techniques where user
involvement is required are user acceptance testing,

White Box
Testing

Black Box
Testing

GREY
BOX

TESTING

CORRECTNESS
TETSING

INPUT System PROCESS OUTPUT
Analyze
Internal
Working

IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 1, May 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814

alpha testing and beta testing. Other types of Black box
testing methods includes graph based testing method,
equivalence partitioning, boundary value analysis,
comparison testing, orthogonal array testing, specialized
testing, fuzz testing, and traceability metrics. [2]

Fig. 4 Represent working process of Black Box Testing

There are various pros and cons of black box testing- [5]

Pros-

1. Black box tester has no “bond” with the code.

2. Tester perception is very simple.

3. Programmer and tester both are independent of

each other.

4. More effective on larger units of code than
clear box testing.

Cons-

1. Test cases are hard to design without clear
specifications.

2. Only small numbers of possible input can

actually be tested.

3. Some parts of the back end are not tested at all.

2.1.3 Grey Box Testing

Grey box testing techniques combined the testing
methodology of white box and black box. Grey box
testing technique is used for testing a piece of software
against its specifications but using some knowledge of
its internal working as well. [2]

Grey box testing may also include reverse engineering to
determine, for instance, boundary values or error
messages. Grey box testing is a process which involves
testing software while already having some knowledge
of its underline code or logic. The understanding of
internals of the program in grey box testing is more than
black box testing, but less than clear box testing. [11]

2.2 Performance Testing

‘Performance Testing’ involve all the phases as the
mainstream testing life cycle as an independent
discipline which involve strategy such as plan, design,

execution, analysis and reporting. This testing is
conducted to evaluate the compliance of a system or
component with specified performance requirement. [2]
Evaluation of a performance of any software system
includes resource usage, throughput and stimulus
response time.

By performance testing we can measure the
characteristics of performance of any applications. One
of the most important objectives of performance testing
is to maintain a low latency of a website, high
throughput and low utilization. [5]

Fig. 5 Represent two types of performance testing

Some of the main goals of performance testing are: [5]

 Measuring response time of end to end

transactions.

 Measurement of the delay of network between

client and server.

 Monitoring of system resources which are

under various loads.

Some of the common mistakes which happen during
performance testing are: [5]

 Ignoring of errors in input.
 Analysis is too complex.
 Erroneous analysis.
 Level of details is inappropriate.
 Ignore significant factors.
 Incorrect Performance matrix.
 Important parameter is overlooked.
 Approach is not systematic.

There are seven different phases in performance testing
process: [5]

 Phase 1 – Requirement Study
 Phase 2 – Test plan
 Phase 3 – Test Design
 Phase 4 –

Scripting

 Phase 5 – Test

Execution

 Phase 6 – Test

Analysis

PERFORMANCE
TESTING

Load
Testing

Stress
Testing

INPUT System PROCESS OUTPUT
Analyze only
fundamental

aspects

IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 1, May 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814

 Phase 7 – Preparation of Report

Fig. 6 Represent Performance Testing Process

Typically to debug applications, developers would
execute their applications using different execution
stream. Which are completely exercised the applications
in an attempt to find errors. Performance testing is
secondary issue when looking for errors in the
applications but, however, it is still an issue.

There are two kinds of performance testing:

2.2.1 Load Testing

Load Testing is an industry term for the effort of
performance testing. The main feature of the load testing
is to determine whether the given system is able to

handle the anticipated no. of users or not. This can be
done by making the virtual user to exhibit as real user so
that it will be easy to perform load testing. It is carried
only to check whether the system is performing well or
not. The main objective of load testing is to check
whether the system can perform well for specified user
or not. Load testing increases the up time for critical
web applications by helping us to spot the bottle necks
in the system which is under large user stress.

Load testing is also used for checking an application
against heavy load or inputs such as testing of website in
order to find out at what point the website or
applications fails or at what point its performance
degrades. [2][5]

Two ways for implementing load testing are

1. Manual Testing: It is not a very practical option as it

is very iterative in nature and it involves [5]
 Measure response time
 Compare results

2. Automated Testing: As compared to manual load
testing the automated load testing tools provide
more efficient and cost effective solutions. Because
with automated load testing, tools test can easily be
rerun any number of times and decreases the
chances of human error during testing. [5]

2.2.2 Stress Testing

We can define stress testing as performing random
operational sequence, at larger than normal volume, at
faster than normal speed and for longer than normal
periods of time, as a method to accelerate the rate of
finding defects and verify the robustness of our product,
or we can say stress testing is a testing, which is
conducted to evaluate a system or component at or
beyond the limits of its specified requirements to
determine the load under which it fails and how. Stress
testing also determines the behaviour of the system as
user base increases. In stress testing the application is
tested against heavy loads such as large no. of inputs,
large no. of queries, etc. [2] [5]

There are some weak and strong points of stress testing.

Weak Points

1. Not able to test the correctness of a system.
2. Defects are reproducible.
3. Not representing real world situation.

Strong Points

1. No other type of test can find defect as stress
testing.

2. Robustness of application is tested.

Requirement
Collection

Preparation of
Plan

Designing

Scripting

Preparation of
Report

Execution
Analysis

Is Goal
Achieved?

YES

Final Report is

Prepared

IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 1, May 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814

3. Very helpful in finding deadlocks.

2.3 Reliability Testing

Fig. 7 Represent Reliability testing

‘Reliability Testing’ is very important, as it discover all
the failures of a system and removes them before the
system is deployed. Reliability testing is related to many
aspects of software in which testing process is included;
this testing process is an effective sampling method to
measure software reliability. Estimation model is
prepared in reliability testing which is used to analyze
the data to estimate the present and predict future
reliability of software. [4][2]

Depending on that estimation, the developers can decide
whether to release the software or not and the end user
will decide whether to adopt that software or not.

Based on reliability information, the risk of using
software can also be assessed. Robustness testing and
stress testing are the variances of reliability testing. By
Robustness we mean how software component works
under stressful environmental conditions. Robustness
testing only watches the robustness problem such as
machine crashes, abnormal terminations etc. Robustness
testing is very portable and scalable. [4]

2.4 Security Testing

Security Testing: ‘Security testing’ makes sure that only
the authorized personnel can access the program and
only the authorized personnel can access the functions
available to their security level. Security testing of any
developed system or (system under development) is all
about finding the major loopholes and weaknesses of a
system which can cause major harm to the system by an
authorized user. [1][2]

Security testing is very helpful for the tester for finding
and fixing of problems. It ensures that the system will
run for a ling time without any major problem. It also
ensures that the systems used by any organization are
secured from any unauthorized attack. In this way,
security testing is beneficial for the organization in all
aspects. [1][2]

Five major concepts which are covered by security
testing are

 Confidentiality: By security testing, we will ensure

the confidentiality of the system i.e. no disclosure of
the information to the unknown party other than
intended recipient.

 Integrity: By security testing, we will maintain the

integrity of the system by allowing the receiver to
determine that the information which he is getting is
correct.

 Authentication: Security testing maintains the

authentications of the system and WPA, WPA2,
WEP are several forms of authentication.

 Availability: Information is always kept available

for the authorized personnel whenever they needed
and assures that information services will be ready
for use whenever expected.

 Authorization: Security testing ensures that only the

authorized user can access the information or
particular service. Access control is an example of
authorization.

Fig. 8 Represent various type of security testing

Different types of security testing in any organization
are as follows: [3]

1. Security Auditing and Scanning: Security
Auditing includes direct inspection of the
operating system and of the system on which
it is developed. In Security Scanning the
auditor scan the operating system and then
tries to find out the weaknesses in the
operating and network.

SECURITY
TESTING

Security
Scanning

Security
Auditing

Vulnerability
Scanning

Posture
Assessment
& Security

Testing

Risk
Assessment

Penetration
Testing

Ethical
Hacking

Reliability
Testing

Robustness
Testing

IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 1, May 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814

2. Vulnerability Scanning: Various vulnerability

scanning software performs Vulnerability
Scanning, which involves the scanning of the
program for all known vulnerability.

3. Risk Assessment: Risk Assessment is a

method in which the auditors analyze the risk
involved with any system and all the
probability of loss which occurs because of
that risk. It is analyzed through interviews,
discussions, etc.

4. Posture Assessment and Security Testing:

Posture Assessment and Security Testing help
the organization to know where it stands in
context of security by combining the features
of security scanning, risk assessment and
ethical hacking.

5. Penetration Testing: Penetration Testing is an

effective way to find out the potential
loopholes in system and it is done by a tester
which forcibly enters into the application
under test. A tester enters into the system with
the help of combination of loopholes that the
application has kept open unknowingly.

6. Ethical Hacking: Ethical Hacking involves

large no. of penetration test on a system under
test. To stop the forced entry of any external
elements into a system which is under security
testing.

3. Conclusion

Software testing is an important technique for the
improvement and measurement of a software system
quality. But it is really not possible to find out all the
errors in the program. So, the fundamental question
arises, which strategy we would adopt to test. In my
paper, I have described some of the most prevalent and
commonly used strategies of software testing which are
classified by purpose and they are classified into [5]

1. Correctness testing, which is used to test the
right behavior of the system and it is further
divided into black box, white box and grey box
testing techniques (combines the features of
black box and white box testing).

2. Performance testing, which is an independent

discipline and involves all the phases as the
main stream testing life cycle i.e. strategy, plan,
design, execution, analysis and reporting.
Performance testing is further divided into load
testing and stress testing.

3. Reliability testing, which discovers all the failure of

the system and removes them before the system
deployed.

4. Security testing makes sure that only the authorized

personnel can access the system and is further
divided into Security Auditing and Scanning,
Vulnerability Scanning, Risk Assessment, Posture
Assessment and Security Testing, Penetration
Testing and Ethical Hacking.

The successful use of these techniques in industrial
software development will validate the results of the
research and drive future research. [8]

References:

[1] Software testing-Brief introduction to security

testing by Nilesh Parekh published on 14-07-2006
available at http://www.buzzle.com/editorial/7-14-
2006-102344.asp

[2] Software testing glossary available at
http://www.aptest.com/glossary.html#performance
testing

[3] Open source security testing methodology manual of
PETE HERZOG and the institute for security and
open methodology-ISECOM.

[4] Software testing by Jiantao Pan available at
http://www.ece.cmu.edu/~roopman/des-
899/sw_testing/

[5] Software Testing by Cognizant Technology Solution.
[6] Introduction to software testing available at

http://www.onestoptetsing.com/introduction/
[7] Software testing techniques available at

http://pesona.mmu.edu.my/~wruslan/SE3/Readings/
GB1/pdf/ch14-GB1

[8] Paper by Lu Luo available at
http://www.cs.cmu.edu/~luluo/Courses/17939Report

[9] Security testing-wikipedia the free encyclopedia
available at http://en.wikipedia.org/wiki/security-
tetsing.

[10] White box testing from wikipedia, the free
encyclopedia.

[11] Software testing for wikipedia available at
http://en.wikipedia.org/wiki/grey_box_testing#grey_
box_tetsing

Mohd. Ehmer Khan

I completed my B.Sc in 1997 and M.C.A. in 2001 from Aligarh
Muslim University, Aligarh, India, and pursuing Ph.D (Computer
Science) from Singhania University, Jhunjhunu, India. I have
worked as a lecturer at Aligarh College Engineering &
Management, Aligarh, India from 1999 to 2003. From 2003 to
2005 worked as a lecturer at Institute of Foreign Trade &
Management, Moradabad, India. From 2006 to present working
as a lecturer in the Department of Information Technology, Al
Musanna College of Technology, Ministry of Manpower,
Sultanate of Oman. I am recipient of PG Merit Scholarship in
MCA. My research area is software engineering with special
interest in driving and monitoring program executions to find
bugs, using various software testing techniques.

Computer and Information Science www.ccsenet.org/cis

4Ps of Business Requirements Analysis for Software Implementation

Mingtao Shi
FOM Fachhochschule für Oekonomie & Management

University of Applied Science
Bismarckstr. 107, 10625 Berlin, Germany

Tel: 49-171-2881-169 E-mail: Consulting_Shi@yahoo.de

Abstract
Introduction of new software applications to achieve significant improvement of business performance is a
general phenomenon that can be observed in a variety of firms and industries. While carrying out such complex
activities, firms are frequently struggling with quality and time, which, as this paper argues, can be achieved by
basing the implementation upon 4Ps of business requirements analysis: Process, Product, Parameter and Project.
Keywords: Requirements analysis, Software implementation, Process analysis, Product analysis,
Parameterisation, Project management
1. Business Requirements and Software Implementation
Industrial firms today are achieving significant scale and scope advantages by introducing new software
applications tailored to firm-specific value chain activities. The pervasive deployment of software and
burgeoning growth of specialist vendors has fostered the emergence of industrial applications based upon core
systems that can be individualised by parameterisation and customisation. Flexible core systems have become
the general pattern of dominant software products in a variety of industries. This trend is especially favourable
for small and medium-sized firms that necessarily concentrate on software application rather than on software
production. These firms are typically in possession of an own low-budget IT department or are outsourcing their
software-related activities to external system integrators or software consultants.
The internal IT department is organically integrated and may be provided with business knowledge quickly, but
it is in most cases too small to conduct software development from scratch for comprehensive business activities.
The external IT specialists may have remarkable software knowledge, but it is organisationally more difficult for
them to assimilate business information of the potential software user firm. Core software systems that can be
parameterised and customised smoothly fit in such business and technical scenarios that are common in a wide
range of industries, such as wholesales, logistics and banking.
How to implement such core systems in accordance with the firm-specific business requirements therefore has
become an area of common interest for practitioners as well as for academicians because of its huge financial
implication. Most of such software systems are expensive. Software itself must be paid. Hardware must be
purchased to accommodate the software. Personnel are to be trained for configuring and using the applications,
most probably, on a short-term run. Furthermore, unsuccessful implementation would lead to inefficient or even
insufficient business performance, causing further unmanageable costs. Under such circumstances, software
requirements analysis for system implementation undoubtedly is crucial for a firm to further prosper or even
survive in the marketplace.
Classical approaches in the area of requirements engineering are rigorously defined and extensively discussed in
the literature. Probably the techniques advanced by Pressman (2004) and Pressman (2009) are most systematic,
which aver that requirements analysis centres on a few key elements, including scenario-based, flow-based,
behaviour-based, and class-based system views, and the data modelling. Other authors have argued more or less
in a similar manner (see Wiegers, 2003; Robertson & Robertson, 2006; Pohl, 2007). However, these rather
technical methods are beneficial for software development from scratch. On the one hand, most of small and
medium-sized business firms lack the capability or are reluctant to spend much resource to conduct such kind of
highly technical analysis. On the other hand, firms adopting software applications would not be able to touch the
codes and there is no need for them to manipulate the underlying technical design in the core. The essence of the
challenge for them is rather the match between business requirements and software application through
parameterisation and customisation.

Computer and Information Science Vol. 3, No. 2; May 2010

2. Analysis of Business Processes
The depictions of business processes are vital notations to describe, examine, streamline a firm’s value activities.
The business requirements analysis may begin with process mapping. The typical notation in this context is the
activity diagram. Although textual use-cases are also widely used for this purpose, the activity diagram is less
time-consuming and more powerful in terms of ease of use and intelligibility, especially for implementation
projects with stringent time demand.
Professionals dealing with Software Engineering frequently use Unified Modelling Language (UML) as a
unified standard platform to map the business processes for requirements analysis purposes. However, in order to
gain the necessary skill, potential analysts have to take part in formal trainings. Furthermore, commercial UML
tools equipped with regular upgrades and updates are likely to be expensive. Microsoft Excel, on the contrary, is
available to most firms and can also be used for process mapping. Whichever tool is used, a number of essential
aspects must be taken into account to sufficiently decipher important information in business processes.
A role is a group of system users performing the same functions in the business processes. The description in the
activity diagram needs to unambiguously define the role conducting a certain activity (process step). If necessary,
detailed explanations can be given to each activity, in order to map the essential content of the activity. It is
highly recommended that the system-related activities are highlighted. By doing so, the business specialists and
the analysts can subsequently defined the data fields of system inputs and outputs at a particular system process
step. It is beneficial that these inputs and outputs are streamlined at a later stage to make the data structure of
future application more meaningful. System printouts need to be indicated, analysed and carefully defined at
relevant process steps. Although the activity diagram is not supposed to be a system specification, it is
recommended that the analysts define as much detailed information as possible, in order to gain time advantages
for further implementation. In a business environment, where time is always in shortage, successful process
mapping with detailed information might be a possible substitute of a time-consuming specification.
Importantly, the mapped processes then need to be tested in the pre-configured software to be introduced, so that
the business specialists can witness that the mapped processes can be realised by the system. This kind of test is
not the highly stringent software testing in the traditional sense, but a kind of functional and performance tests at
a high level. The result of the test is a brief but written Gap Analysis, documenting the activities, process steps or
step sequences that cannot be exactly realised by the system. In most cases, the solution to bridge the disparities
can be found by providing workarounds and customisation. Workarounds are techniques that utilise and combine
the existing system features, in order to achieve the missed linkages to the business processes. Customisation
means that programming at the application level is necessary to map the system to the processes identified,
however without the necessity to touch the source codes in the core system. While implementing a new or new
generation of software system, changes in the core system (development efforts by the coding firm) is generally
not recommended, because this would imply intensive time and resource capacity for expertise exchange
between the implementing and coding firm, unless it is unavoidable. Painful costs are foreseeable.
3. Analysis of Business Products
Not all important information is contained in the business processes. In a retail system for example, firms must
certainly comply with governmental pricing regulations, which are normally readily available in the system.
However, these firms may also desire to create their own product-related pricing and charging schemes based
upon proprietary calculations. This kind of product-specific information typically resides in documented product
descriptions.
The realisation of a product may involve different activities in different processes. Process and product views
shed light on the business portfolio of a firm from different angles of perspective. While analysing the business
products, analysts need to pay intensive attention to features, interfaces, ancillary products. Features are
functionalities (e.g. proprietary calculations), delineating what the product should perform and how the added
value is created. Interfaces include system-internal communication with other products within the portfolio and
system-external communication with other systems if necessary. Today, standardised interfaces have made the
industrial value chain operate more smoothly. Ancillary products are resultant aspects of an existing business
portfolio. Businesses not only reply upon on activities and values, but also on the reporting, controlling and audit
of these activities and values.
Similar to the process mapping, product analysis within the context of system implementation should include a
Gap Analysis that documents the vacuum between the wished products and features provided by the software.
Workarounds or customisation should be conceived, designed and carried out subsequently if necessary. Again,
core changes should be held in a minimum scale.

Computer and Information Science www.ccsenet.org/cis

4. Determination of Application Parameters
After having analysed the processes and products, the business specialists and system analysts need to focus on
the definition of application parameters. User rights and categorised parameter tables must be discussed in great
detail, before figures, number and ranges can be setup in the system to achieve desired products and processes.
Careful definition of user rights is highly salient for the security and smoothness of the business operations. The
analysis of user rights for each user group must at least deal with access to system masks, level of data inputs
and processing, authorisation of data processing, access to data outputs (e.g. reports), and access to system
administration.
Other parameters can be categorised in individual tables as the basis of discussion. Business specialists from
different product or process background must first understand what the parameters defined by the core software
system mean. System specialists or external consultants familiar with the system are required. The central task
here is to map the system defined parameters to the product parameters used in the businesses, in terms of both
definition and terminology. These discussion sessions are highly important. One major result of the
parameterisation is to enable the purchased system to perform the business content of the purchasing firm,
partially through workarounds. Another major result should be that customisation and external development
tasks are figured out in detail, upon which follow-up resource and budget needs can be based.
Multinational businesses are additionally faced with the difficulty of parameter differences that are necessary for
different national marketplaces. Therefore, system implementation may require the definition of a multinational
parameter mix. Because experts of local parameters are located in the respective local markets, parameterisation
under such circumstances may require intensive communication with subsidiaries and representatives located in
other countries. Analysts, business and software specialists in the central headquarters can use this opportunity to
reside in the multinational sites for information elicitation and analysis, and by doing so, amplify the knowledge
of the local business environment. It is worth mentioning that short stays in countries with lower living standard
may not be highly comfortable, but the learning effect potentially achievable may be comfortably high.
Although parameter difference may exist across the national borders, unified process and product definition may
be beneficial to achieve scales economies in the global strategy of the business firms.
5. Project Management
Tailoring the software system to the firm-specific business requirements is a daunting task of high complexity,
consisting of hundreds or even thousands of work units and packages. Without proper management the whole
project would be a monstrous amount of work without prosperity. Project management is in most cases
inevitable in general. In particular, time, resources and budgets are to be planned and managed delicately.
Time management should include a highly detailed listing of work packages and their interdependencies. The
duration of each work package is estimated carefully. Statistical methods such as beta distribution can be
deployed here. Project professionals use to apply network depictions to illustrate the structure of the project and
identify the critical path, upon which the complete project duration may be estimated. Resource loading diagram
and resource levelling help the project management maintain an overview of deployed resources and avoid
extreme over- or under-occupation. In software implementation projects, human resources must be dealt with
more carefully. Holidays and travel plans must be considered. Furthermore, the costs of workarounds,
parameterisation and customisation mentioned above are certainly also a part of overall costs. Although the
top-down approach is the predominant budgeting policy in most implementation projects, the mangers should
always have an open ear for bottom-up information coming from project members to insure a more realistic
budgeting plan. It is worth mentioning that the project communication with top management and external system
supplier must be honest and transparent. Business requirements analysis and software implementation are not
just about software and hardware, but also about trust and relationship. These soft factors can sometimes even be
decisive for the overall success of the project.
6. Conclusion: 4Ps of business requirements for successful software implementation
Parameterisation-capable and customisable software applications tailored to firm-specific business requirements
have become highly coveted in myriads of industries and firms. This paper argues that 4Ps are most essential for
integrating such systems seamlessly in the firm-individual operational environment:
(P)rocess: An effective process mapping should delineate functional roles, process steps and detailed content

of process steps. It should highlight the system-related activities and, data inputs and outputs at these
activities. It should also define system printouts. Analysts and business specialists must analyse the process a
number of times, thereby moving from less to more detailed levels. By conducting a careful GAP analysis,

Computer and Information Science Vol. 3, No. 2; May 2010

analysts can identify the needs for future workarounds and customisation. External development is to be
avoided as much as possible.

(P)roduct: Products must be defined in a written document, which clarifies the product features, internal and
external interfaces and ancillary products. Similarly, a resultant Gap Analysis is strongly recommended. The
software to be purchase must perform what the defined products need, at least through workarounds and
customisation. Important product features should be realised straightaway by the system. External
development is to be held in a minimum scale.

(P)arameter: User rights, process-related and product-related parameters, parameter ranges to be applied in
the system should be defined and reviewed carefully. Sometimes, it is necessary that business parameters
familiar to the business specialists must be mapped to the system parameters familiar to the system specialists.
The analysts should, accompany and intermediate in, this process. Multinational corporations may have to
adapt the parameters to the local operational sites in a parameter mix.

(P)roject: System implementation tailored to firm-specific business requirements consists of complex
activities that can only be effectively and efficiently managed if a proper project management environment is
in place. Completion time, resource allocation and budgeting are most important aspects. Project
management should also take into account the activities necessary after the business requirements analysis
has been carried out. Typical activities are the setup of defined parameters in the system, customisation
activities, acceptance testing, go-live of the system and, screening of results and communication for further
system improvement.

References
Pohl, K. (2007). Requirements Engineering: Grundlagen, Prinzipien, Techniken. Heidelberg: dpunkt.verlag
GmbH.
Pressman, R. S. (2004). Software engineering: A practitioner’s approach. (6th ed.). New York, NY: McGraw
Hill.
Pressman, R. S. (2009). Software engineering: A practitioner’s approach. (7th ed.). New York, NY: McGraw
Hill.
Robertson, S., & Robertson, J. (2006). Mastering the requirements process. (2nd ed.). Westford, MA: Pearson
Education Inc.
Wiegers, K. E. (2003). Software requirements. (2nd ed.). Redmond, Washington: Microsoft Press.

Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=ucis20

Journal of Computer Information Systems

ISSN: 0887-4417 (Print) 2380-2057 (Online) Journal homepage: http://www.tandfonline.com/loi/ucis20

Improving Open Source Software Maintenance

Vishal Midha, Rahul Singh, Prashant Palvia & Nir Kshetri

To cite this article: Vishal Midha, Rahul Singh, Prashant Palvia & Nir Kshetri (2010) Improving
Open Source Software Maintenance, Journal of Computer Information Systems, 50:3, 81-90

To link to this article: https://doi.org/10.1080/08874417.2010.11645410

Published online: 11 Dec 2015.

Submit your article to this journal

Article views: 13

View related articles

Citing articles: 2 View citing articles

http://www.tandfonline.com/action/journalInformation?journalCode=ucis20

http://www.tandfonline.com/loi/ucis20

https://doi.org/10.1080/08874417.2010.11645410

http://www.tandfonline.com/action/authorSubmission?journalCode=ucis20&show=instructions

http://www.tandfonline.com/doi/mlt/10.1080/08874417.2010.11645410

http://www.tandfonline.com/doi/citedby/10.1080/08874417.2010.11645410#tabModule

Spring 2010 Journal of Computer Information Systems 81

ImprovIng open SourCe
Software maIntenanCe

vIShal mIdha praShant palvIa
The University of Texas – Pan American The University of North Carolina at Greensboro
Edinburg, TX 78539 Greensboro, NC 27402

rahul SIngh nIr KShetrI
The University of North Carolina at Greensboro The University of North Carolina at Greensboro
Greensboro, NC 27402 Greensboro, NC 27402

Received: June 22, 2009 Revised: August 17, 2009 Accepted: September 9, 2009

aBStraCt

Maintenance is inevitable for almost any software. Software
maintenance is required to fix bugs, to add new features, to
improve performance, and/or to adapt to a changed environment.
In this article, we examine change in cognitive complexity and its
impacts on maintenance in the context of open source software
(OSS). Relationships of the change in cognitive complexity with
the change in the number of reported bugs, time taken to fix the
bugs, and contributions from new developers are examined and
are all found to be statistically significant. In addition, several
control variables, such as software size, age, development status,
and programmer skills are included in the analyses. The results
have strong implications for OSS project administrators; they
must continually measure software complexity and be actively
involved in managing it in order to have successful and sustainable
OSS products.
Keywords: OSS, Complexity, Software Maintenance

IntroduCtIon

The importance of software maintenance in today’s software
industry cannot be underestimated. Maintenance is inevitable
for almost any software. Software maintenance is required to
fix bugs, to add new features, to improve performance, and/or to
adapt to a changed environment. Pigoski [39] illustrated that the
portion of industry’s expenditures used for maintenance purposes
was 40% in the early 1970s, 55% in the early 1980s, 75% in the
late 1980s, and 90% in the early 1990s. Over 75% of software
professionals perform program maintenance of some sort [24].
Given the numbers, the understanding of software maintenance is
prudent.
It is not unusual that a developer modifying the source code
has not participated in the development of the original program
[31]. As a consequence, a large amount of the developer’s efforts
goes into understanding and comprehending the existing source
code [46]. Comprehending existing source code, which involves
identifying the logic in and between various segments of the
source code and understanding their relationships, is essentially
a mental pattern-recognition by the software developer and
involves filtering and recognizing enormous amount of data
[43]. As software is becoming increasingly complex, the task of
comprehending existing software is becoming increasingly tough

[43]. Fjelstad and Hamlen [17] reported that more than 50% of
all software maintenance effort is devoted to comprehension. The
comprehension of source code, thus, plays a prominent role in
software development.
In this article, we examine software complexity and its impacts
in the context of open source software (OSS). Past efforts have
been piecemeal or based on limited information. For example,
comprehension of the source code has been linked with source
code complexity. The empirical evidence on the magnitude of the
link is relatively weak [29]. However, many such attempts are
based on experiments involving small pieces of code or analysis of
software written by students [2]. In order to remedy this situation,
we analyze real world software written by the OSS developer
community. A number of studies has examined the impact of
complexity on maintainability and made recommendations to
reduce the complexity [30][31]. But, no study, to the best of our
knowledge, has tested if the reduced complexity was actually
beneficial to the developers performing software maintenance.
This study specifically examines the impact of change in software
complexity on maintenance efforts.

open Source Software development

A typical open source project starts when an individual (or
group) feels a need for a new feature or entirely new software, and
someone in that group, eventually writes one. In order to share it
with others who have similar needs, the individual/ group releases
the software under a license that allows the community to use, and
to see and modify the source code to meet local needs and improve
the product by fixing bugs. Making software available widely on
an open network, e.g., the Internet, allows developers around the
world to contribute code, add new features, improve the present
code, report bugs, and submit fixes to the current version. The
developers of the project incorporate the features and fixes into
the main source code and a new version of the software is made
available to the public. This process of code contribution and bug
fixing is continued in an iterative manner as shown in Fig 1.
OSS supporters often claim that OSS has faster software
evolution. The idea is that multiple contributors can be writing,
testing, or debugging the product in parallel. Raymond [42]
mentioned that more people looking at the code will result in more
bugs found, which is likely to accelerate software improvement.
The OSS model claims that the rapid evolution produces better

82 Journal of Computer Information Systems Spring 2010

software than the traditional closed model because in the latter
“only a very few programmers can see the source and everybody
else must blindly use an opaque block of bits” [38].
One interpretation of the OSS development process is that of
a perpetual maintenance task. Developing an OSS system implies
a series of frequent maintenance efforts for bugs reported by
various users. As most of the OSS projects are results of voluntary
work [14][48], it is crucial to ensure that such volunteers are able
to work with minimal effort. The motivation for why developers
contribute to a source code has received a great deal of attention
from researchers [34]. However, the factors that can make the
OSS community to not contribute to a source code have received
limited attention.
In this light, von Hippel and von Krogh [51] noted that
the major concern among developers was the complexity of
the source code and the level of difficulty of the embedded
algorithms. Fitzgerald [15] pointed that increasing complexity
posits a barrier in the OSS development and may trigger the need
for either substantial software reengineering or the entire system
replacement. Therefore, it is vital to understand the complexity of
the source code and its impact on software development, and even
more importantly, on OSS development.

oSS and Complexity

A complex project, in general, demands a large share of
resources to modify and correct. When the source code is easy,
it is easier to maintain it. On the contrary, when a source code
is complex, developers have to expend a large portion of their
limited time and resources to become familiar with the source
code. In OSS, where the developers seek to gain personal
satisfaction and value from peer review and are not bound to
projects by employment relationships, they have the option to
leave the project at any time and join other projects where their
resources could be used more efficiently. Therefore controlling
complexity in OSS projects may have several benefits, including
facilitation of new developers’ learning. Feller and Fitzgerald
[14] pointed that if new contributors are to have any chance at
contributing to OSS projects, they should be able to do so with
minimal effort. Controlled complexity helps achieve that; thus
being indispensable for OSS [14].
Much of what we know about software complexity comes

from analyses of closed source development (e.g., [5]). As noted
by Stewart et al [49], even though the results from those findings
have been applied to OSS (e.g., study of Debian 2.2 development
[21]), there remains a relative scarcity of academic research on
the subject. More importantly, these studies were limited to a
small number of projects.
The remainder of the paper is organized as follows: The next
section draws on relevant literature to develop a theoretical model.
It is followed by a description of the methods and measures used
in the study. The following sections present the evaluation of the
model and discussion of the results. The paper is concluded by
acknowledging its limitations and highlighting its contributions
to both research and practice.

model development

Basili and Hutchens [4] define complexity as a measure of the
resources expended by a system while interacting with a piece
of software to perform a given task. It is important to clearly
understand the term “system” in this definition. If the interacting
system is a computer, then complexity is defined by the execution
time and storage required to perform the computation. For
example, as the number of distinct control paths through a
program increases, the complexity may increase. This kind of
complexity is called “Computational Complexity” [11]. If the
interacting system is a programmer, then complexity is defined by
the difficulty of performing tasks. This complexity comes from
“the organization of program elements within a program” [22], for
example, tasks such as coding, debugging, testing, or modifying
the software. This kind of complexity is known as “Cognitive
Complexity”. Cognitive complexity refers to the characteristics
of the software which make it difficult to understand and work
with [11]. It is our primary concern.
The notion of cognitive complexity is linked with the
limitations of short term memory. According to the cognitive
load theory, all information processed for comprehension must at
some time occupy short-term memory [43]. Short term memory
is described as the capacity of information that brain can hold
in an active, highly available state. Short term memory can be
thought of as a container, where a small finite number of concepts
can be stored. If data are presented in such a way that too many
concepts must be associated in order to make a correct decision,

fIgure 1 — oSS development

Spring 2010 Journal of Computer Information Systems 83

then the risk of error increases. In OSS, a voluntary developer
must retain the existing source code in short term memory in
order to successfully modify the existing code. The capacity
of holding information may vary depending on the individual
and may limit the capability of developers to comprehend and
modify the existing source code. Kearney et al [29] suggested
that the difficulty of understanding depends, in part, on structural
properties of the source code. As we are concerned with the
impact of complexity on source code comprehension, we focus
on properties related to the source code. This argument forms the
basis for theorizing the impact of complexity on various aspects
on OSS development, as described below.

number of Bugs

The main idea behind the relationship between complexity
and number of bugs is that when comparing two different
solutions to the same problem, all other things being equal, the
more complex solution will generate more bugs. This relationship
is one of the most analyzed by software metrics’ researchers and
previous studies and experiments have found this relationship to
be statistically significant [11][27].
In order for a programmer to understand the existing source
code, he needs to understand the flow of logic. And, when a
programmer has to deal with a source code with high cognitive
complexity, he has to frequently search among dispersed pieces
of code to determine the flow of logic [40]. Understanding and
recollecting such dispersed pieces increase the cognitive load on
the programmer making complex code maintenance more liable to
human errors. Complex software, hence, need more maintenance
efforts. Gill and Kemerer [20] reported that the number of bugs
in a program is positively associated with maintenance effort and
recommended further empirical testing with a larger data set.
Therefore OSS projects which experience increase in complexity
over its previous version also would experience an increase in the
number of bugs (over its previous version). Based on above, we
propose:

H1: An increase in the source code’s cognitive complexity
is positively associated with an increase in the number
of bugs in the OSS source code.

Contributions from new developers

Because of the important role of volunteer developers in the
OSS development, attracting new developers and keeping them
motivated is crucial to OSS development. Keeping the developers
motivated is especially important during the early development
stage so that the number of developers can reach a critical mass.
Some of the cited developers’ motivations include intellectual
gratification, career future incentives, learning and enjoyment,
ego-boosting, and peer recognition [6][8][35][37].
Once a new developer is motivated to voluntarily contribute,
he needs to first spend a large amount of time and resources to
understand the existing source code. When the source code is easy
to comprehend, it is easier to modify. However, when the source
code is complex, a developer is required to invest additional
effort and resources to understand it. Devoting such effort and
resources may pose a barrier to the developer’s motivation to
contribute. Such a barrier may lead the potential developer to
not contribute to the project at all, or, in worst case, to leave the
project. Hence,

H2: An increase in the source code’s cognitive complexity
is negatively associated with an increase in the
number of contributions to the OSS source code from
new developers.

time to fix Bugs

More complex source code adds to a programmer’s cognitive
load [12]. High cognitive load requires more time-consuming and
resource-demanding effort to familiarize oneself with the code.
It is even possible that a source code is so complex that it cannot
be comprehended at all. In such a scenario, the programmer may
spend time and resources on other activities, thereby further
lowering the productivity of the project.
In other words, a source code with lesser cognitive complexity
does not need as much effort or resources, thus reducing the
turnaround time required to fix repairs. This leads to the next
hypothesis that OSS projects which experience increase in
cognitive complexity over its previous version require longer time
to fix the bugs. Hence, we hypothesize,

H3: An increase in the source code’s cognitive complexity
is positively associated with an increase in the average
time taken to fix the bugs in OSS source code.

Combining all the preceding conceptual arguments gives
the research model shown in Fig. 2. Note that several control
variables have been included in the model in order to increase
the robustness of our findings. The specific variables will be
described in the next section.

methodS

The following explanation is helpful in understanding the
research design and methods. This study investigates the impact
of change in complexity. To compute the change in complex-
ity, the complexity of two consecutive versions of software
must be looked at. It is important to note that the complexity
of the source code of a software version can only be measured
after it has been released to the OSS community. Only after it
has been used, the discovered bugs are reported and the code
is modified to fix these bugs. Once significant amount of modi-
fications have been made, a new version is released to the pub-
lic. Due to the modifications in the source code, the complexity
of the source code changes. In order to compute the change in
complexity of the current version (say Nth) from its previous
version (N-1th), one needs to measure the complexity of both the
current (Nth) and the previous version (N-1th). As the modifications
and contributions made to the current version (Nth) are available
in the next version (N+1th), one needs to also look at the next
version (N+1th) to find these modifications and contributions.
As a consequence, for each project, we need to study three
releases, referred to as the first (N-1th), the second (Nth), and the
third (N+1th

OSS projects hosted at SourceForge were examined in this
study. SourceForge is the primary hosting place for OSS projects
which houses about 90% of all OSS projects. It has been argued
SourceForge is the most representative of the OSS movement, in
part because of its popularity and the large number of developers
and projects registered [23][54]. Researchers interested in
investigating issues related to the OSS phenomenon have
predominantly used SourceForge data [23][51][54].

84 Journal of Computer Information Systems Spring 2010

Studying all projects hosted on SourceForge was unfeasible
and impractical due to resource limitations. Data selection
was limited to projects that were targeted to either end users
or developers. In order to avoid ambiguity, projects that were
targeted to both end users and developers were excluded.
Further selection was made by controlling for the programming
language and the operating system. Past literature suggests that
programming language has an explicit impact on complexity
[52] and program size [28]. It is also difficult to compare lines
of code between “high” and “low” level programming lan-
guages. Lower level programming languages have more lines
of code and take longer to develop than higher level program-
ming languages. As C family of languages is the most preferred
by the OSS developers [45], only projects written in C/C++
or multiple languages including C/C++ were selected. Sec-
ondly, operating system of the project impacts the complexity
of the software and the development effort required. To en-
compass majority of the projects targeted for developers and end
users, all projects in the data set were designed either for the
Windows or the Linux/Unix operating system.
As the data was collected from three different versions of
software, the sample was further restricted to the projects that
had at least 3 versions. A version released between first 3-months
of the registration date is considered First release, another major
version released between 3 to 6 months of its registration date is
considered Second release, and yet another major version released
within 6 to 12 months of its registration date is considered the
Third release for this study. Therefore, to be able to get the data
for three different versions, we considered all projects that were
registered between SourceForge between January 2003 and
August 2006 so that the third release for the projects that were
registered in August 2006 was released by August 2007. The final
data collection was completed in August 2007. Lastly, projects
were chosen for which the required data were publicly available

(not all projects allow public access to the bug tracking system

Following the above criteria, the final sample size was limited to
450 projects.

meaSureS

Cognitive Complexity

McCabe’s cyclomatic complexity (CC) assesses the diffi-
culty faced by the maintainer in order to follow the flow con-
trol of the program. It is considered an indicator of the effort
needed to understand and test the source code [47]. Kemerer
and Slaughter [30] used McCabe’s cyclomatic metric to eval-
uate decision density, which represents the cognitive burden
on a programmer in understanding the source code. In order
to compute cyclomatic complexity, each source code file was
subjected to a commercial software code analysis tool. To ac-
count for the effects of size, the complexity metric was normalized
by dividing it by the number of lines of code for each software
project. This procedure also reduces collinearity problems when
size is included in the regression models [20]. The Change in
Cognitive Complexity (ChgCC) was calculated by subtracting
cyclomatic complexity measure of the first version from the
cyclomatic complexity measure of the second version, i.e.,
CC

2nd
– CC

1st
.

Change in number of Bugs and time taken to fix Bugs

Various elements of data were extracted from the bug tracking
system and the Concurrent Versioning System (CVS) reports,
including the bugs reported, the date on which the bugs were
reported, the date on which the bugs were fixed, and the version
number. One problem faced was that all the bugs in the current
version were not closed at the time of the study. To overcome

fIgure 2 — the research model

Spring 2010 Journal of Computer Information Systems 85

the problem, earlier versions that had more than 90% of the bugs
closed at the time of study were included. From these extracted
elements, the number of bugs reported and the time taken to fix
them for different software versions were computed. From the
number of bugs and the time to fix these bugs for each version,
the change in the number of bugs (ChgBugsReported) over
previous version and the change in the average time to fix the
bugs (ChgFixTime) were computed (i.e., BugsReported

3rd
–

BugsReported
2nd

Contributions from new developers

Software developers use CVS to manage the software
development process. CVS stores the current version(s) of the
project and its history. A developer can check-out the complete
copy of the code, work on this copy and then check back the
changes. The modifications are peer reviewed ensuring quality.
CVS updates the modified file automatically and registers it as a
commit. CVS keeps track of what change was made, who made
the change, and when the change was made. This information
can be gathered from the log files of the CVS repository of a
project. As CVS commits provide a measure of novel invention
that is internally validated by peers [10][23], the number of
CVS commits is used as a measure of contributions of devel-
opers. A commit is considered as ‘contribution from a new
developer’, when the developer has not contributed to the pre-
vious version. The number of contributions made by new
developers is represented as ChgNewDevs (i.e., ChgNewDevs

3rd

– ChgNewDevs
2nd

Control variables

Age

Brook’s Law [7] states that “adding more programmers to a
late project makes it later”. Based on this, adding new developers
at later stages will increase the average time taken to fix bugs. On
the other hand, age may indicate the legitimacy and popularity
of the software. Popular software attracts more developers and
thus older software will have higher number of contributions
from developers. To control for age, the Age variable is defined
as the number of months till the second release since a project’s
inception at SourceForge.

Size

Size is the oldest measure of software complexity and is
believed to be a major driver of software maintenance effort [53].
Larger software is likely to receive more enhancements and more
repairs than smaller software, ceteris paribus, as larger software
embodies greater amount of functionality subject to change. The
larger the software, the more difficult it is to test and validate its
functionality. This implies that larger software tend to incorporate
more errors. Keeping the above in mind, Size is used as a control
variable and is captured by the number of lines of code of the
second release.

Number of downloads

OSS developers can leverage the law of large numbers to
identify and fix the bugs [41]. Given enough eyeballs, all bugs
are shallow. A huge user base for the software implies that the

software will be tested in numerous different environments, more
bugs will surface, these will be communicated efficiently to more
bug fixers, the fix being obvious to someone, and the fix will
be communicated effectively back and integrated into the core
of the product. To isolate this effect, the number of cumulative
downloads (Downloads) of till second release of the project is
used as a control variable.

New Developer Knowledge and Skills

The literature on performance has identified individual
characteristics such as knowledge and skills as antecedents.
Such characteristics are, however, difficult to measure, and are
frequently measured through the use of surrogate measures like the
level of education and experience. Curtis et al. [11] reported that
in a series of experiments involving professional programmers,
the number of years of experience was not a significant predictor
of comprehension, debugging, or modification time, but that
number of languages known was. They suggest that the breadth
of experience may be a more reliable guide to ability than length
of programming experience. In this work, we also use the breadth
of the experience as a surrogate for developer’s knowledge and
skills. So, to control for the effect of new developers’ skills,
the variable SkillsChg (i.e. Skills

2nd
–Skills

1st
) was used and was

measured by the change in team skills with the addition of new
developers to the team.

Sponsorship

An increasing number of open source projects have opted
to receive monetary donations from organizations and users.
Although some developers and projects choose to allocate part or
all of the incoming donations to SourceForge, most recipients of
the donations rely on monetary support to fund development time
and other key resources that are necessary for the continuation of
the projects. It is expected that developers receiving additional
monetary benefits will devote extra effort and time into
comprehending and fixing the source code. The control variable
AcceptSponsors is used to capture whether a project is accepting
external funds and using monetary compensation as part of its
incentive mechanism. It takes the value of 1 if the project is
accepting donations and 0 otherwise.

Development Status and Maturity

To capture the development stage of a project, which is
typically determined by the developer in charge of the project on
SourceForge, the control variable DevStatus takes values ranging
from 1 to 6 representing development stages of Planning, Pre-
Alpha, Alpha, Beta, Production/Stable, and mature respectively.
DevStatus was also measured at second release. The larger the
value of DevStatus, the more mature the project is.

transformations

Initial investigations indicated that the dependent variable and
many of the independent variables were not normally distributed.
In such case, linear regression analysis might yield biased and non
interpretable parameter estimates [19]. Therefore, as suggested
by Gelman and Hill [19], a logarithmic transformation on the
dependent and the not-normally distributed independent variables
was performed.

86 Journal of Computer Information Systems Spring 2010

reSultS

The Variance Inflation Factor (VIF) was computed for all
variables in order to test for multicollinearity. VIF is one measure
of the effect other independent variables might have on the
variance of a regression coefficient. Large VIF values indicate
high multicollinearity. Studenmund [50] recommends a cut
of 10 for VIF. The VIF values for the different variables in the
regression analyses are reported in Table 1, and in no case exceed
1.2. The low VIF values indicate that multicollinearity is not a
serious problem.
As we are interested in studying the impact of change of
complexity on three dependent variables which are largely
distinct, we formulate three separate regression equations
analyzing each of the dependent variables. For the dependent
measure, ChgBugsReported, the impact of change in complexity
on the number of bugs (Hypothesis H1) was found by estimating
the parameters in the following regression model:

ChgBugsReported = α + β
1
ChgCC + β

2
lnSize + β3lnDownloads

+ β
4
AcceptSponsors + β

5
DevStatus + β

6
lnAge + β

7
lnSkillsChg

A positive and significant estimate of parameter β
1
would

indicate that the probability of having bugs in a source code
increases as the cognitive complexity of software increases. The
results of the regression (Hypothesis 1) are presented in Table 1.
The model shows a good fit with the data (F=33.552, p<0.00). The parameter estimate for ChgCC is positive and significant (β

1
=0.303, p<0.00). The results suggest that projects with unit

increase in cognitive complexity experience 0.303 units increase
in the number of bugs, and H1 is supported. The studied variables
explained 37.5% of the total variance in the change in bugs
reported (R2=0.375).
Tested next is the impact of complexity on the number of
contributions from new developers (hypothesis H2) by estimating
the parameters for the following regression model:

ChgNewDevCommits = α + β
1
ChgCC + β

2
lnSize +β

3
lnDownloads

β
4
Sponsors + β

5
DevStatus + β
6
lnAge + β
7
lnSkillsChg

The results of the regression (Hypothesis 2) are presented
in Table 1. The model shows good fit with the data (F=34.702,
p<0.000). The parameter estimate for ChgCC is significantly

negative (β
1
=-0.359, p<0.000). The results suggest that a unit

increase in cognitive complexity decreases the contributions from
new developers by 0.359 units. Hypothesis H2 is supported. The
studied variables explained 38.5% of the total variance in the
change in new developers’’ commits (R2=0.385).
Finally, examined is the impact of complexity on the time
taken to fix bugs (hypothesis H3) by estimating the parameters
for the following regression model:

Time to fix bugs = α + β
1
ChgCC + β

2
lnSize + β

3
lnDownloads ++

β
4
Sponsors + β
5
DevStatus + β
6
lnAge + β
7
lnSkillsChg

Table 1 shows the results of the regression analysis (Hypothesis
3). The model shows a good fit with the data (F=70.660, p<0.000). The parameter estimate for ChgCC is significant and positive (β

1
=0.720, p<0.000), indicating that projects that experience a

unit increase in cognitive complexity takes 0.720 units additional
time to fix bugs. Thus hypothesis H3 supported. The studied
variables explained 56.1% of the total variance in the change in
time taken to fix the reported bugs (R2=0.561).

dISCuSSIon and ImplICatIonS

main effects

The increase in the cognitive complexity of open software as
it evolves over time is of significant concern, as it will make
software maintenance increasingly difficult. In the extreme,
developers may stop making fixes and refinements rendering
the software error-prone and obsolete. Ultimately the open
software may perish its own death, be replaced by another
software project, or may go a major and laborious overhaul; all
options are expensive. In this section, we discuss our findings on
how complexity and control variables influence different aspects
of software maintenance.
The literature shows mixed support for the negative impact
of complexity on software quality. For example, Harter and
Slaughter [25] found a negative association between complex-
ity and quality. However, Gaffney [18] did not find software
complexity to be associated with error rates. Fitzsimmons
and Love [16] reported that the correlation between cognitive
complexity and the reported number of bugs ranges from 0.75
to 0.81. In our data, the correlation between the number of

taBle 1 — regression results

hypothesis 1 hypothesis 2 hypothesis 3 Collinearity
model Statistics
β Sig. β Sig. β Sig. (VIF)

ChgCC .303 .000 -.359 .000 .720 .000 1.150

Size .228 .000 -.157 .000 .010 .775 1.160

Downloads .173 .000 .099 .016 -.100 .004 1.205

AcceptSponsors -.171 .000 .330 .000 -.082 .012 1.053

DevStatus -.067 .083 .097 .011 -.010 .764 1.042

Age .156 .000 .038 .350 -.040 .245 1.177

SkillsChg -.016 .664 -.105 .006 .069 .030 1.014

Adjusted R-Square 0.375 0.385 0.561

Spring 2010 Journal of Computer Information Systems 87

bugs reported and complexity was 0.43. It is interesting to
note that the correlation found in this study was much smaller
than the correlations reported in earlier studies for non-open
source software; however, it is consistent with the literature
on OSS. In the context of OSS, Schröter et al. [44] reported
the correlation value in the range of 0.40. Furthermore, Kem-
erer and Slaughter [30] found that complex software is more
frequently repaired, which has the effect of increasing the
number of bugs. Therefore, it can be said with confidence that
as the complexity of the software increases, the number of re-
ported bugs, and by implication the actual number of bugs
increases.
Another measure of software quality is the time taken to fix
bugs. In fact, by mining software histories of two projects, Kim
and Whitehead [32] recommended to use time taken to fix bugs
as a measure of software quality. In our analysis, we found that
the complexity of software has a strong positive influence on the
time taken to fix bugs. It is common that when a bug is fixed in
one segment of the source code, it usually causes ripple effects
and adjustments in other segments [36]. The more complex the
software is, the more are the adjustments in other segments. As
a consequence, the developer has to simultaneously understand,
and repair related pieces in dispersed segments. Handling all
segments together has a detrimental effect on the time devoted by
the developer because more time is needed to follow the flow of
logic within the code [3]. This is supported by several empirical
studies that have found that time required to fix bugs increases
as complexity increases [5][20]. This result has another spurious
effect on software maintenance. When a developer becomes
conscious of long time needed to fix a bug, there is tendency for
the developer find “quick and dirty” solutions, thereby making
the code even less maintainable. Such half-baked efforts lead to
a vicious cycle in which the complexity, the number of bugs, and
the time taken to fix those bugs feed on each other until a dead end
is reached with the only option of either reengineering the project
or shutting it down completely.
Another reason for the longer time to fix bugs in complex
code can be found in Dymo’s [13] observations. Dymo noted that
most people prefer to work on software enhancements by adding
features rather than working on fixing bugs. This is especially
true, when the source code is more complex. Debugging and
understanding the existing code, written by someone else, takes
more time and resources. As the majority of the work is done on
voluntary basis in open software and developers are not bound by
contracts, developers tend to work on new versions of the software
rather than continue to work on improving the old ones. Although
this has the potential of bringing them more visibility in the OSS
community, the net effect is further delay in fixing bugs.
Another impact of source code complexity analyzed in the
study is on attracting contributions from new developers. Analysis
shows that cognitive complexity has a strong negative influence
on the number of contributions from new developers. As OSS
thrives upon voluntary contributions, the project managers must
actively control the source code complexity in order to attract
contributions from new developers. In a complex piece of code,
it takes longer for a developer to determine the flow of logic
resulting in slower progress of the project [40]. Cavalier [8]
pointed that the willingness of people to continue to contribute
to a project is related to the progress that is made in the project.
If a large number of activities do not seem to be moving forward,
participants lose interest, leading them to leave the project. This
leads to a higher likelihood of activities not being completed, and

ultimately, the death of the project. Such projects become inactive
over time and fail to attract any contributions.

effects of Control variables

Interesting observations can be made based on the effects of
the control variables. Our analysis found strong effects of size on
the number of bugs and the number of contributions from new
developers. It is often argued that complexity and size are strongly
correlated and that could lead to the problem of multicollinearity,
which tends to inflate regression coefficients. As mentioned
earlier, multicollinearity was tested by computing variance
inflation factors and was found to be within permissible limits.
Accordingly the effects of size are independent of the effects of
complexity.
The number of downloads has strong effects on the number of
bugs, time to fix bugs, and the number of contributions from new
developers. The number of downloads indicates the popularity
of a project; popular projects attract more user and developers
[33]. As the number of users and developer community grow,
the number of eyes watching the source code increases. As Eric
Raymond [41] repeatedly mentions “to many eyes, all bugs are
shallow”. When source code is open and freely visible, users can
readily identify flaws. The probability of finding a bug increases
with the increase in the number of eyes. As a result, the number
of hands working on code also increases leading to increased
contributions from new developers.
The continued development of a project, represented by
its age, gives software legitimacy, reputation and attention of
the community. However, in our study, age did not show any
significant effect. The reason could be because a large number of
OSS projects on SourceForge are in early stages of development
and there was not much variance in the data. This could be
attributed to the ease with which new projects can be started.
Such projects become inactive over time and have almost zero
contributions from the developer community. It could be argued
that age can bring legitimacy, reputation, and attention only if the
project is active. Therefore, a more reliable indicator of continued
development is the development status of a project, which was also
studied and was found to have a significant positive impact on the
number of commits from new developers. In the OSS literature,
development status has been shown to have a positive impact on
project’s popularity. Al Marzouq et al. [1] argue that a project
attracts more developers as the software becomes more stable.
In turn, these new developers bring effort and contribution that
improves the software. A growth cycle begins a network effect
that feeds both the community and development of the software.
Lakhani and Wolf [34] showed that developers receiving
money in any form spend more time working on OSS than their
peers. Similar results are shown by this study. We found that the
projects that have any form of sponsorship have higher number
of contributions from new developers. Such projects also had less
number of bugs and took lesser time to fix the bugs. This clearly
indicates that developers are receptive to external stimuli such
as a monetary reward. Henkel [26] illustrated a similar impact
of external sponsorship on the development of applications for
Linux, one of the most successful OSS project. Henkel noticed
that most contributors in the field of embedded Linux are salaried
or contract developers working for commercial firms.
The change in team skills with the addition of new developers
was found to have significant influence on the number of
contributions from new developers and the time taken to fix

88 Journal of Computer Information Systems Spring 2010

bugs. However, both relationships were in a direction opposite to
what was expected. The expectation was that as new developers
increase, the number of contributions will increase and the time
taken to fix bugs will reduce. The opposite directions of the
relationships indicate that with the increase in number of skills,
the overall time to fix bugs increases and the new contributions
decrease. A logical explanation is that either the developers are
just joining the development team without actually contributing
towards project development or the amount of contributions is
not proportionate to the number of skills they possess. Possibly
the same core group of developers are largely responsible for
the majority of contributions, and new developers do not add
anything substantive. This logic is consistent with the commonly
held belief in OSS that development follows Pareto’s law, where
a small number of developers (~20%) are responsible for the
majority of the work accomplished (~80%).

lImItatIonS and ContrIButIonS

Some limitations of the study need to be pointed out. The first
limitation is the sample frame. While SourceForge has data about
a vast collection of OSS projects, it does not capture all OSS
projects, which is the ultimate population of interest. While the
sample size is by far large enough to ensure statistical validity,
the choice of the sample frame may have some bearing on the
outcomes of the study. Additionally, it can be argued that the
change log only records the committer; whether the developer of
the code is ever acknowledged is uncertain. And, do all bugs get
reported? There could be bugs that are probably fixed but never
reported.
In spite of the limitations, this study makes important
contributions to both the literature and practice. The results are
robust as the hypotheses regarding cognitive complexity were
supported after having controlled for various factors. In other
words, our conclusions cannot be seen as artificial due to possible
correlation with other factors. The most important contribution
is the strong support for the relationships between cognitive
complexity and software quality, and cognitive complexity and
contributions from new developers. Our models indicate that,
on the average, OSS development projects with high cognitive
complexity are significantly associated with increased bugs and
repair time and decreased contributions from new developers.
These findings have at least two immediate implications for
software managers and project administrators. First, they must
measure software complexity on a continual basis, at least once for
each release or at regular intervals. Second, they need to implement
guidelines for upper bounds of complexity and recommend that
software versions at no stage exceed these guidelines. However,
no standard guidelines are probably universally applicable for all
software development projects. Developers and administrators
may want to set their own standards for their specific projects, like
the NSA (National Security Agency) standard, which is derived
from an analysis of 25 million lines of software code written for
NSA.
Furthermore, project administrators for OSS projects need to
learn the importance of controlling complexity. As recommended
by Lehman [35], strategies need to be developed not only to
control complexity, but also to actively reduce it. As a software
project progresses, it becomes increasingly complex making it
difficult to understand and manage [14]. Project administrators
need to be careful about subsequent changes between different
versions. Such changes can have strong debilitating impacts on

projects. If changes are not well monitored, they can lead to a
ripple effect. Ripple effect refers to the phenomenon of changes
made to one part of the software affecting and propagating to
other parts of the software. Lehman’s operating system example
clearly shows the ripple effect since the percentage of modules
changed in Release 15 is 33% while the percentage of modules
changed in Release 19 is 56%. The OSS development, thriving on
voluntary contributions, must keep a close watch on the cognitive
complexity of the software in order to attract contributions from
new developers.
Another important contribution of this research is for
organizations involved in or interested in getting involved in
OSS development. Our results indicate that, contrary to OSS
ideological beliefs, offering a monetary reward for participation
may successfully attract increased contributions from the OSS
community.

referenCeS

[1]. AlMarzouq, M., Zheng, L., Rong, G., and Grover, V.
“Open Source: Concepts, Benefits, and Challenges,” Com-
munications of the AIS, 16, Article-37, 2005, pp.756:784.

[2]. Banker, R. D., Datar, S., Kemerer, C., and Zweig, D.
“Software Complexity and Maintenance Costs,” Com-
munications of the ACM, 36(11), 1993, pp.81-94.

[3]. Banker, R., Davis, G., and Slaughter, S. “Software
development practices, software complexity, and software
maintenance effort: a field study,” Management Science,
44(4), 1998, pp.433-450.

[4]. Basili, V. and Hutchens, D. “An Empirical Study of a
Syntactic Complexity Family,” IEEE Trans. Software
Engineering, 9, 1983, pp.664-672.

[5]. Boehm, B. Software Engineering Economics, Prentice-
Hall, New York, 1981.

[6]. Bonaccorsi, A., and Rossi, C. “Why open source software
can succeed,” Research Policy, 32(7), 2003, pp.1243-1258

[7]. Brooks, F. The Mythical Man-Month, Addison-Wesley,
Reading, Mass., 1975.

[8]. Carillo, K. and Okuli, C. “The Open Source Movement:
A Revolution in Software Development,” Journal of
Computer Information Systems, 49(2), Winter2008/2009,
pp.1-9.

[9]. Cavalier, F. “Some Implications of Bazaar Size,” 1998,
available at http://www.mibsoftware.com/bazdev/ accessed
8 May 2006.

[10]. Crowston, K., Annabi, H, Howison, J. “Defining Open
Source Software Project Success,” Proceedings of ICIS,
Seattle, WA, 2003.

[11]. Curtis, B., Sheppard, S., Milliman, P., Borst, M., and Love,
T. “Measuring the Psychological Complexity of Software
Maintenance Tasks with the Halstead and McCabe
Metrics,” IEEE Transactions on Software Engineering,
5(2), 1979, pp.96-104.

[12]. Darcy, D., Kemerer, C., Slaughter, S., and Tomayko, J.
“The Structural Complexity of Software: An Experimental
Test,” IEEE Transactions on Software Engineering, 31(11),
2005, pp.982-995.

[13]. Dymo, A. “Open Source Software Engineering,” II Open
Source World Conference, Málaga, 2006.

[14]. Feller, J. and Fitzgerald, B. Understanding open source
software development, London: Addison-Wesley, 2002.

[15]. Fitzgerald, B. “Has Open Source Software a Future?,”

Spring 2010 Journal of Computer Information Systems 89

Perspectives on Free and Open Source Software, MIT
Press, 2005, pp.93-106.

[16]. Fitzsimmons, A. and Love, T. “A review and evaluation
of software science,” Computer Survey, 10(1), 1978, pp.3-
18.

[17]. Fjeldstad, R. and Hamlen, W. “Application program
maintenance-report to our respondents,” Tutorial on
Software Maintenance, 1983, pp. 13-27.

[18]. Gaffney, J. “Estimating the Number of Faults in Code,”
IEEE Transactions on Software Engineering, 10(4), 1984,
pp. 13-27.

[19]. Gelman, A., and Hill, J. Data Analysis Using Regression
and Multilevel/Hierarchical Models, Cambridge University
Press, 2007.

[20]. Gill, G. and Kemerer, C. “Cyclomatic complexity density
and software maintenance productivity,” Transactions on
Software Engineering, 17(12), 1991, pp. 1284-1288.

[21]. González-Barahona, J., Miguel A, Pérez, O, Quirós, P.,
González, J., and Olivera, V. “Counting potatoes. The size
of Debian 2.2,” Upgrade, 2(6), 2001, pp. 60-66.

[22]. Gorla, N., and Ramakrishnan, R. “Effect of Software
Structure Attributes Software Development Productivity,”
Journal of Systems and Software, 36(2), 1997, pp. 191-
199.

[23]. Grewal, R., Lilien, G., Mallapragada, G. “Location,
Location, Location: How Network Embeddedness Affects
Project Success in Open Source Systems,” Management
Science 52(7), 2006, pp. 1043-1056.

[24]. Harrison, W. and Cook, C. “Insights on improving the
maintenance process through software measurement,”
Proceedings of Conference on Software Maintenance, San
Diego,CA, 1990, pp. 37-44.

[25]. Harter, D. and Slaughter, S. “Process maturity and
software quality: a field study,” International Conference
on Information Systems, Brisbane, Australia, 2000, pp.
407-411.

[26]. Henkel, J. “Selective Revealing in Open Innovation
Processes: The Case of Embedded Linux,” Research
Policy, 35(7), 2006, pp. 953-969.

[27]. Henry, S., Kafura, D., and Harris, K. “On the Relationship
among Three Software Metrics,” ACM SIGMETRICS:
Performance Evaluation Review, 10(1), 1981, pp. 81-88.

[28]. Jones, T. Programming Productivity, McGraw-Hill, Inc.,
New York, 1986.

[29]. Kearney, J., Sedlmeyer, R., Thompson, W., Gray, M., and
Adler, M. “Software Complexity Measurement,” Com-
munications of the ACM, 29(11), 1986, pp. 1044-1050.

[30]. Kemerer, C. and. Slaughter, S. “Determinants Of Software
Maintenance Profiles: An Empirical Investigation,” Soft-
ware Maintenance: Research And Practice, 9(4), 1997, pp.
235-251.

[31]. Kemerer, C. F. “Software complexity and software main-
tenance: A survey of empirical research,” Annals of
Software Engineering, 1(1), 1995, pp. 1-22.

[32]. Kim, S., Whitehead, E, and Bevan, J. “Analysis of signature
change patterns,” Proceedings of the 2005 international
workshop on Mining software repositories, St.Louis,MO,
2005, pp. 1-5.

[33]. Krishnamurthy, S. “Cave or Community? An Empirical
Examination of 100 Mature Open Source Projects,” First
Monday, 7(6), 2002.

[34]. Lakhani, K., and Wolf, B. “Why Hackers Do What They

Do: Understanding Motivation and Effort in Free/Open
Source Software Projects,” Perspectives on Free and Open
Source Software, MIT Press, Cambridge, 2005.

[35]. Lerner, J., and Tirole, J. “Some Simple Economics of Open
Source,” The Journal of Industrial Economics, 1(2), 2002,
pp. 197-234.

[36]. Loch, C., Mihm, J., and Huchzermeier, A. “Concurrent
Engineering and Design Oscillations in Complex
Engineering Projects,” Concurrent Engineering, 11(3),
2003, pp. 187-199.

[37]. Markus, M., Manville, B., and Agres, C. “What makes a
virtual organization work?,” Sloan Management Review,
42(1), 2000, pp. 13-26.

[38]. Opensource.org, “The Open Source Definition (Version
1.9)”, 2002, at http://www.opensource.org/ docs/definition.
html, accessed 5 May 2006.

[39]. Pigoski, T. Practical Software Maintenance. Wiley com-
puter publishing, 1997.

[40]. Ramanujan, S. and Cooper, R. “A human information
processing approach to software maintenance,” Omega,
22(2), 1994, pp. 85-203.

[41]. Raymond, E. “The Cathedral and the Bazaar,” 1999, at
http://tuxedo.org/~esr/writings/cathedral-bazaar/

[42]. Raymond, E. The cathedral and the bazaar: musings on
Linux and open source by an accidental revolutionary,
Sebastopol, CA, O’Reilly, 2001.

[43]. Rilling, J. and Klemola, T. “Identifying Comprehension
Bottlenecks Using Program Slicing and Cognitive
Complexity Metrics,” Proceedings of the 11th IEEE
International Workshop on Program Comprehension,
2003, pp. 115.

[44]. Schröter, A., Zimmermann, T., Premraj, R., and Zeller, A.
“If Your Bug Database Could Talk . . . ,” Proceedings of
ACM-IEEE 5th International Symposium on Empirical
Software Engineering, Volume II: Short Papers and
Posters, Brazil, 2006.

[45]. Sen, R, Subramaniam, C, and Nelson, M. “Determinants of
the Choice of Open Source Software License,” Journal of
Management Information Systems, 25(3), 2008-9, pp. 207-
240.

[46]. Smith, N., Capiluppi, A., and Ramil, J. “Agent-based
Simulation of Open Source Evolution,” Software Process
Improvement and Practice, 11(4), 2006, pp. 423-434.

[47]. Stamelos, I.; Angelis, L.; Oikonomou, A.; and Bleris,
G. “Code Quality Analysis in Open Source Software
Development,” Information Systems Journal, 12(1), 2002,
pp. 43-60.

[48]. Stewart, K., Ammeter, A., Maruping, L. “A Preliminary
Analysis of the Influences of Licensing and Organizational
Sponsorship on Success in Open Source Projects,”
Proceedings of the 38th Hawaii International Conference
on System Sciences, 2005, pp. 197-203.

[49]. Stewart, K., Darcy, D., Daniel, S. “Observations on Patterns
of Development in Open Source Software Projects, Open
Source Application Spaces,” Fifth Workshop on Open
Source Software Engineering, 2005, St Louis, MO, pp. 1-
5.

[50]. Studenmund, A. Using Econometrics: A Practical Guide,
Harper Collins, New York, NY, 1992.

[51]. vonHippel, E., G. vonKrogh. “Open Source Software and
the “Private-Collective” Innovation Model: Issues for
Organization Science,” Organization Science, 14(2), 2003,

90 Journal of Computer Information Systems Spring 2010

pp. 209-225.
[52]. Weyuker, E. “Evaluating software complexity measures,”

IEEE Transactions on Software Engineering, 14(9), 1988,
pp. 1357-1365.

[53]. Withrow, C. “Error Density and Size in Ada Software,”

IEEE Software, 7(1), 1990, pp. 26-30.
[54]. Xu, J., Gao, Y, S. Christley, G. Madey. “A Topological

Analysis of the Open Source Software Development
Community,” Proceedings of the 38th HICSS, 2005, pp.
198.

Journal of Theoretical and Applied Information Technology

www.jatit.org

MODEL BASED OBJECT-ORIENTED SOFTWARE TESTING

1SANTOSH KUMAR SWAIN, 2SUBHENDU KUMAR PANI, 3DURGA PRASAD MOHAPATRA
1School of Computer Engineering, KIIT University, Bhubaneswar, Orissa, India-751024

2 Department of Computer Application, RCM Autonomous, Bhubaneswar, Orissa, India -751021
3Department of Computer Science & Engineering, NIT, Rourkela, Orissa, India

ABSTRACT

Testing is an important phase of quality control in Software development. Software testing is necessary to
produce highly reliable systems. The use of a model to describe the behavior of a system is a proven and
major advantage to test. In this paper, we focus on model-based testing. The term model-based testing
refers to test case derivation from a model representing software behavior. We discuss model-based
approach to automatic testing of object oriented software which is carried out at the time of software
development. We review the reported research result in this area and also discuss recent trends. Finally, we
close with a discussion of where model-based testing fits in the present and future of software engineering.

Keywords: Testing, Object-oriented Software, UML, Model-based testing.

1. INTRODUCTION

The IEEE definition of testing is “the process of
exercising or evaluating a system or system
component by manual or automated means to verify
that it satisfies specified requirements or to identify
differences between expected and actual results.”
[16]. Software testing is the process of executing a
software system to determine whether it matches its
specification and executes in its intended
environment. A software failure occurs when a
piece of software does not perform as required and
expected. In testing, the software is executed with
input data, or test cases, and the output data is
observed. As the complexity and size of software
grows, the time and effort required to do sufficient
testing grows. Manual testing is time consuming,
labor-intensive and error prone. Therefore it is
pressing to automate the testing effort. The testing
effort can be divided into three parts: test case
generation, test execution, and test evaluation.
However, the problem that has received the highest
attention is test-case selection. A test case is the
triplet [S, I, O] where I is the data input to the
system, S is the state of the system at which the
data is input, and O is the expected output of the
system [17]. The output data produced by the
execution of the software with a particular test case
provides a specification of the actual program
behavior. Test case generation in practice is still
performed manually most of the time, since

automatic test case generation approaches require
formal or semi-formal specification to select test
case to detect faults in the code implementation.

Code based testing not an entirely satisfactory
approach to generate guarantee acceptably thorough
testing of modern software products. Source code is
no longer the single source for selecting test cases,
and nowadays, we can apply testing techniques all
along the development process, by basing test
selection on different pre-code artifacts, such as
requirements, specifications and design models
[2],[3]. Such a model may be generated from a
formal specification [7, 14] or may be designed by
software engineers through diagrammatic tools
[15]. Code based testing has two important
disadvantages. First, certain aspects of behavior of
a system are difficult to extract from code but are
easily obtained from design models. The state
based behavior captured in a state diagram and
message paths are simple examples of this. It is
very difficult to extract the state model of a class
from its code. On the other hand, it is usually
explicitly available in the design model. Similarly,
all different sequences in which messages may be
interchanged among classes during the use of a
software is very difficult to extract from the code,
but is explicitly available in the UML sequence
diagrams. Another prominent disadvantage of code
based testing is very difficult to automate and code
based testing overwhelmingly depends on manual
test case design.

Journal of Theoretical and Applied Information Technology

www.jatit.org

An alternative approach is to generate test cases
from requirements and specifications. These test
cases are derived from analysis and design stage
itself. Test case generation from design
specifications has the added advantage of allowing
test cases to be available early in the software
development cycle, thereby making test planning
more effective. Model based testing (MBT), as
implied by the name itself, is the generation of test
cases and evaluation of test results based on design
and analysis models. This type of testing is in
contrast to the traditional approach that is based
solely on analysis of code and requirements
specification. In traditional approaches to software
testing, there are specific methodologies to select
test cases based on the source code of the program
to be tested. Test case design from the requirements
specification is a black box approach [14], where as
code-based testing is typically referred to as white
box testing. Model based testing, on the other hand
is referred to as the gray box testing approach.

Modern software products are often large and
exhibit very complex behavior. The Object-oriented
(OO) paradigm offers several benefits, such as
encapsulation, abstraction, and reusability to
improve the quality of software. However, at the
same time, OO features also introduce new
challenges for testers: interactions between objects
may give rise to subtle errors that could be hard to
detect. Object-oriented environment for design and
implementation of software brings about new issues
in software testing. This is because the above
important features of an object oriented program
create several testing problems and bug hazards [3].
Last decade has witnessed a very slow but steady
advancement made to the testing of object-oriented
systems. One of the main problems in testing
object-oriented programs is test case selection.
Models being simplified representations of systems
are more easily amenable for use in automated test
case generation. Automation of software
development and testing activities on the basis of
models can result in significant reductions in fault-
removal, development time and the overall cost
overheads.
The concept of model-based testing was originally
derived from hardware testing, mainly in the
telecommunications and avionics industries. Of
late, the use of MBT has spread to a wide variety of
software product domains. The practical
applications of MBT are referred to [18]. A model
is a simplified depiction of a real system. It
describes a system from a certain viewpoint. Two
different models of the same system may appear
entirely different since they describe the system
from different perspectives. For example control

flow, data flow, module dependencies and program
dependency graphs express very different aspects
of the behavior of an implementation. A wide range
of model types using a variety of specification
formats, notations and languages ,such as UML,
state diagrams, data flow diagrams, control flow
diagrams, decision table, decision tree etc, have
been established. We can roughly classify these
models into formal, semiformal and informal
models. Formal models have been constructed
using mathematical techniques such as theory,
calculus, logic, state machines, markov chains,
petrinets etc. Formal models have been successfully
used to automatically generate test cases. However,
at present formal models are very rarely constructed
in industry. Most of the models of software systems
constructed in industry are semiformal in nature. A
possible reason for this may be that the formal
models are very hard to construct. Our focus
therefore in this paper is the use of semiformal
models in testing object-oriented systems.
Pretschner et al. [3] present a detailed discussion
reviewing model based test generators. Barsel et al.
[20] study the relationship between model and
implementation coverage. The studies by
Heimadahl and George[19] indicate that different
test suites with the same coverage may detect
fundamentally different number of errors.
This paper has been organized as follows. The next
section presents an overview of various models
used in object-oriented software testing. The key
activities in an MBT process are discussed in
section 3. Section 4 discusses the key benefits and
pitfall of MBT. Section 5 focuses use of model-
based testing in the present and future of software
engineering. Section 6 concludes the paper.

2. MODELS USED IN SOFTWARE

TESTING
In this section, we briefly review the important
software models that have been used in object-
oriented software testing.

2.1 UML Based Testing

Unified modeling language (UML) has over the last
decade turned out to be immensely popular in both
industry and academics and has been very widely
used for model based testing. Since being reported
in 1997, UML has undergone successive
refinements. UML 2.0, the latest release of UML
allows a designer to model a system using a set of
nine diagrams to capture five views of the system.
The use case model is the user’s view of the
system. A static /structural view (i.e. class diagram)

Journal of Theoretical and Applied Information Technology
© 2005 – 2010 JATIT. All rights reserved.

www.jatit.org

is used to model the structural aspects of the
system. The behavioral views depict various types
of behavior of a system. For example, the state
charts are used to describe the state based behavior
of a system. The sequence and collaboration
diagrams are used to describe the interactions that
occur among various objects of a system during the
operation of the system. The activity diagram
represents the sequence, concurrency, and
synchronization of various activities performed by
the system. Behavioral models are very important
in test case design, since most of the testing detect
bugs that manifest during specific run of the
software i.e. during a specific behavior of the
software. Besides the behavioral models, it is
possible to construct the implementation and
environmental views of the system. The object
constraint language (OCL) makes it possible to
have precise models.
The work reported in [1-3, 5, 8] discuss various
aspects of UML-based model testing. A vast
majority of work examining MBT of object –
oriented systems focuses on the use of either class
or state diagrams. Both these categories of work
overwhelmingly address unit testing. Class
diagrams provide information about public
interfaces of classes, method signatures, and the
various types of relationships among classes. The
state diagram-based testing focuses on making the
objects all possible states and undertake all possible
transitions. Several work reported recently address
use of sequence diagrams, activity diagrams and
collaboration diagrams in testing [9].

2.2 Finite State Machines

FSM (Finite State machines) have been used since
long to capture the state –based behavior of
systems. Finite state machines (also known as finite
automata) have been around even before the
inception of software engineering. There is a stable
and mature theory of computing at the center of
which are finite state machines and other variations.
Using finite state models in the design and testing
of computer hardware components has been long
established and is considered a standard practice
today. [13] was one of the earliest, generally
available articles addressing the use of finite state
models to design and test software components.
Finite state models are an obvious fit with software
testing where testers deal with the chore of
constructing input sequences to supply as test data;
state machines (directed graphs) are ideal models
for describing sequences of inputs. This, combined
with a wealth of graph traversal algorithms, makes

generating tests less of a burden than manual
testing. On the downside, complex software implies
large state machines, which are nontrivial to
construct and maintain. However, FSMs being flat
representations are handicapped by the state
explosion problem. State charts are an extension of
FSMs that has been proposed specifically to
address the shortcomings of FSMs [13].State charts
are hierarchical models. Each state of a state chart
may consist of lower-level state machines.
Moreover they support specifications of state-level
concurrency. Testing using state charts has been
discussed in[21].

2.2 Markov Chains

Markov chains are stochastic models [24]. A
specific class of Markov chains, the discrete-
parameter, finite-state, time-homogenous,
irreducible Markov chain, has been used to model
the usage of software. They are structurally similar
to finite state machines and can be thought of as
probabilistic automata. Their primary worth has
been, not only in generating tests, but also in
gathering and analyzing failure data to estimate
such measures as reliability and mean time to
failure. The body of literature on Markov chains in
testing is substantial and not always easy reading.
Work on testing particular systems can be found in
[22] and [23].

2.2 Grammars

Grammars have mostly been used to describe the
syntax of programming and other input languages.
Functionally speaking, different classes of
grammars are equivalent to different forms of state
machines. Sometimes, they are much easier and
more compact representation for modeling certain
systems such as parsers. Although they require
some training, they are, thereafter, generally easy to
write, review, and maintain. However, they may
present some concerns when it comes to generating
tests and defining coverage criteria, areas where not
many articles have been published.

3. A TYPICAL MODEL-BASED TESTING

PROCESS
In this section, we discuss the different activities
constituting a typical MBT process.Fig.1 displays
the main activities in a life cycle of a MBT process
.the rectangles in Fig. 1 represent specific artifacts
developed used during MBT. The ovals represent
activities processes during MBT.

Journal of Theoretical and Applied Information Technology
© 2005 – 2010 JATIT. All rights reserved.

www.jatit.org

Figure 1. A Typical Model Based Testing Process

3.1 Construction of intermediate model

Several strategies have been reported to generate
test cases using a variety of models. However in
many cases the test cases based on more than one
model type. In such cases ,it becomes necessary to
first construct an integrated model based on the
information present in different models.

3.2 Generation of test scenarios

The test cases generated from models are in form of
sequences of test scenarios. Test scenarios specify a
high level test case rather than the exact data to be
input to the system. For example, in the case of
FSMs, it can be the sequence in which specifies
states and transitions must be undertaken to test the
system-called a transition path. The sequences of
different transition labels along the generated paths
form the required test scenarios. Similarly from
the sequence diagrams the message paths can be
generated. The exact sequence messages in which
the classes must interact for testing the system is
shown.

3.3 Test Generation

The difficulty of generating tests from a model
depends on the nature of the model. Models that are
useful for testing usually possess properties that
make test generation effortless and, frequently,

automatable. For some models, all that is required
is to go through combinations of conditions
described in the model, requiring simple knowledge
of combinatory. There are a variety of constraints
on what constitutes a path to meet the criteria for

tests. It includes having the path start and end in the
starting state, restricting the number of loops or
cycles in a path, and restricting the states that a path
can visit.

3.4 Automatic test case execution

In certain cases the tests can even be performed
manually. Manual testing is labor-intensive and
time consuming. However, the generated test suite
is usually too large for a manual execution.
Moreover, a key point in MBT is the frequent
regeneration and re-running of the test suite
whenever the underlying model is changed.
Accordingly achieving the full potential of MBT
requires automated test execution. Usually, using
the available testing interface for the software, the
abstract test suite is translated into an executable
test script. Automatic test case execution also
involves test coverage analysis. Based on the test
coverage analysis, the tests generation step may be
fine tuned or different strategies may be tried out.

Software
Model(s)

Test
Scenarios

Transform
Intermediate

Testing
Representation

Test Case
Generator

Coverage
Criteria
Analysis

Test Results
Test
Execution

Test Data Test Cases

Journal of Theoretical and Applied Information Technology
© 2005 – 2010 JATIT. All rights reserved.

www.jatit.org

3.5 Test Coverage Analysis

Each test generation method targets certain specific
features of the system to be tested. The extent to
which the targetted features are tested can be
determined using test coverage analysis[10,12]. The
important coverage analysis based on a model can
be the following: all model parts(or test
scenarios)coverage is achieved when the test
reaches every part in the model at least once.
Important test coverage required based on UML
models can be the following: path coverage,
message path coverage, transition path coverage,
scenario coverage, dataflow coverage, polymorphic
coverage, inheritance coverage. Scenarios coverage
is achieved when the test executes every scenario
identifiable in the model at least once.

4. A CRITIQUE OF MBT

Some important MBT advantages can be
summarized in the following points. It allows
achieving higher test coverage. This is especially
true of certain behavioral aspects which are difficult
to identify in the code. Another important
advantage of model–based testing is that when a
code change occurs to fix a coding error, the test
cases generated from the model need not change.
As an example, changing the behavior of a single
control in the user interface of the software makes
all the test cases using that control outdated. In
traditional testing scenarios, the tester has to
manually search the affected test cases and update
them. As even when code changes, the changed
code still confirms to the model. Model based test
suite generation often overcomes this problem.
However MBT does have certain restrictions and
limitations. Needless to say, as with several other
approaches, to reap the most benefit from MBT,
substantial investment needs to be made. Skills,
time, and other resources need to be allocated for
making preparations, overcoming common
difficulties, and working around the major
drawbacks. Therefore, before embarking on a MBT
endeavor, this overhead needs to be weighed
against potential rewards in order to determine
whether a model-based technique is sensible to the
task at hand.
MBT demands certain skills of testers. They need
to be familiar with the model and its underlying and
supporting mathematics and theories. In the case of
finite state models, this means a working
knowledge of the various forms of finite state
machines and a basic familiarity with formal
languages, automata theory, and perhaps graph
theory and elementary statistics. They need to

possess expertise in tools, scripts, and programming
languages necessary for various tasks. For example,
in order to simulate human user input, testers need
to write simulation scripts in a specialized
language.
In order to save resources at various stages of the
testing process, MBT requires sizeable initial effort.
Selecting the type of model, partitioning system
functionality into multiple parts of a model, and
finally building the model are all labor-intensive
tasks that can become prohibitive in magnitude
without a combination of careful planning, good
tools, and expert support. Finally, there are
drawbacks of models that cannot be completely
avoided, and workarounds need to be devised. The
most prominent problem for state models (and most
other similar models) is state space explosion.
Briefly, models of almost any non-trivial software
functionality can grow beyond management even
with tool support. State explosion propagates into
almost all other model-based tasks such as model
maintenance, checking and review, non-random test
case generation, and achieving coverage criteria.
The generated test cases may in many cases get
irrevalent due to the disparity between a model
and its corresponding code.MBT can never
displace code based testing, since models
constructed during the development process lack
several details of implementation that are required
to generate test cases.
Fortunately, many of these problems can be
resolved one way or the other with some basic skill
and organization. Alternative styles of testing need
to be considered where insurmountable problems
that prevent productivity are encountered.

5. MBT IN SOFTWARE ENGINEERING:

TODAY AND TOMORROW

Good software testers cannot avoid models. MBT
calls for explicit definition of the testing endeavor.
However, software testers of today have a difficult
time planning such a modeling effort. They are
victims of the ad hoc model, either in advance or
throughout the nature of the development process
where requirements change drastically and the rule
of the day is constant ship mode. Today, the scene
seems to be changing. Modeling in general seems
to be gaining favor; particularly in domains where
quality is essential and less-than-adequate software
is not an option. When modeling occurs as a part of
the specification and design process, these models
can be leveraged to form the basis of MBT.
There is promising future for MBT as software
becomes even more ubiquitous and quality

Journal of Theoretical and Applied Information Technology
© 2005 – 2010 JATIT. All rights reserved.

www.jatit.org

becomes the only distinguishing factor between
brands. When all vendors have the same features,
the same ship schedules and the same
interoperability, the only reason to buy one product
over another is quality. MBT, of course, cannot and
will not guarantee or even assure quality. However,
its very nature, thinking through uses and test
scenarios in advance while still allowing for the
addition of new insights, makes it a natural choice
for testers concerned about completeness,
effectiveness and efficiency.
The real work that remains for the near future is
fitting specific models (finite state machines,
grammars or language-based models) to specific
application domains. Perhaps, special purpose
models will be made to satisfy very specific testing
requirements and models that are more general will
be composed from any number of pre-built special-
purpose models. However, to achieve these goals,
models must evolve from mental understanding to
artifacts formatted to achieve readability and
reusability. We must form an understanding of how
we are testing and be able to sufficiently
communicate that understanding so that testing
insight can be encapsulated as a model for any and
all to benefit from.
6. CONCLUSION
Good software testers cannot avoid models. MBT
has emerged as a useful and efficient testing
method for realizing adequate test coverage of
systems. The usage of MBT reveals substantial
benefit in terms of increase productivity and
reduced development time and costs. On the other
hand MBT can’t replace code based testing since
models are abstract higher level representations and
lack of several details present in the code. It is
expected that in future models shall be constructed
by extracting relevant information both from the
design which can automate the test case design
process to a great deal.
Not surprisingly, there are no software models
today that fit all intents and purposes.
Consequently, for each situation decisions need to
be made as to what model (or collection of models)
are most suitable. There are some guidelines to be
considered that are derived from earlier
experiences. The choice of a model also depends on
aspects of the system under test and skills of user.
However, there is little or no data published that
conclusively suggests that one model outstands
others when more than one model is intuitively
appropriate.

REFRENCES:

[1]. W. Prenninger, A. Pretschner, Abstractions for

Model-Based Testing, ENTCS 116 (2005) 59–
71.

[2]. A. Pretschner, J. Philipps, Methodological
Issuesin Model-Based Testing, in: [29], 2005,
pp. 281–291.

[3]. J. Philipps, A. Pretschner, O. Slotosch,E.
Aiglstorfer, S. Kriebel, K. Scholl, Model based
test case generation for smart cards, in:Proc.
8th Intl. Workshop on Formal Meth. For
Industrial Critical Syst., 2003, pp. 168–192.

[4]. G. Walton, J. Poore, Generating transition
probabilities to support model-based software
testing,Software: Practice and Experience 30
(10) (2000) 1095–1106.

[5]. A. Pretschner, O. Slotosch, E. Aiglstorfer,S.
Kriebel, Model based testing for real–the
inhouse card case study, J. Software Tools for
Technology Transfer 5 (2-3) (2004) 140–157.

[6]. A. Pretschner, W. Prenninger, S. Wagner, C.
K¨uhnel, M. Baumgartner, B. Sostawa, R.
Z¨olch, T. Stauner, One evaluation of model
based testing and its automation, in: Proc.
ICSE’05, 2005, pp. 392–401.

[7]. E. Bernard, B. Legeard, X. Luck, F. Peureux,
Generation of test sequences from formal
specifications:GSM 11.11 standard case-study,
SW Practice and Experience 34 (10) (2004)
915 – 948.

[8]. E. Farchi, A. Hartman, S. S. Pinter, Using a
model-based test generator to test for standard
conformance, IBM Systems Journal 41 (1)
(2002) 89–110.

[9]. D. Lee, M. Yannakakis, Principles and
methods of testing finite state machines — A
survey, Proceedings of the IEEE 84 (2) (1996)
1090–1126.

[10]. H. Zhu, P. Hall, J. May, Software Unit Test
Coverage and Adequacy, ACM Computing
Surveys 29 (4) (1997) 366–427.

[11]. B. Beizer, Black-Box Testing : Techniques for
Functional Testing of Software and Systems,
Wiley, 1995.

[12]. C. Gaston, D. Seifert, Evaluating Coverage-
Based Testing, in: [29], 2005, pp. 293–322.

[13]. A. Offutt, S. Liu, A. Abdurazik, P. Ammann,
Generating test data from state-based
specifications,J. Software Testing, verification
and Reliability 13 (1) (2003) 25–53.

[14]. A. Pretschner, Model-Based Testing in
Practice,in: Proc. Formal Methods, Vol. 3582
of SpringerLNCS, 2005, pp. 537–541.

Journal of Theoretical and Applied Information Technology
© 2005 – 2010 JATIT. All rights reserved.

www.jatit.org

[15]. R. V. Binder, Testing Object-Oriented
Systems:Models, Patterns, and Tools,
Addison-Wesley,1999.

[16]. R. Helm, I. M. Holland, and D.
Gangopadhyay.Contracts: specifying
behavioral compositions in object-oriented
systems. In Proceedings of the 5th Annual
Conference on Object-OrientedProgramming
Systems, Languages, and Applications
(OOPSLA ’ 90), ACM SIGPLAN Notices,
25(10):169–180, 1990.

[17]. R. Mall, Fundamentals of Software
Engineering, Second ed., Prentice-Hall,
Englewood Cliffs, NJ, 2003.

[18]. Ilan Gronau.Alan Hartman.Andrei
Kirshin.Kenneth Nagin and Sergey
Olvovsky.A methodology and architecture for
automated software testing.Haifa technical
report IBM Research Laborotory.MATAM
,Advanced Technology Center, Haifa
31905,Israel.2000.

[19]. M.Heimadahl and D.George, “Test suite
Reduction for Model Based Tests:Effects on
Test Quality and Implecations for testing” In
proceedings of the 19th International
Conference on Automated Software
Engineering pp.176-185,2004.

[20]. A.Baresel,M.Conrad, S.Sadeghipour and
j.wegener.” the interplay between model
coverage and code coverage” in Eurocast,
DEC2003.

[21]. D.harel”Statecharts:A visual formalism for
complex systems science of computer
programming,8(3):231-274,1987.

[22]. K. Agrawal and James A. Whittaker.
Experiences in applying statistical testing to a
real-time, embedded.software system.
Proceedings of the Pacific Northwest Software
Quality Conference, October 1993.

[23]. Alberto Avritzer and Brian Larson. Load
testing software using deterministic state
testing.” Proceedings of the 1993 International
Symposium on Software Testing and Analysis
(ISSTA 1993), pp. 82-88, ACM, Cambridge,
MA, USA, 1993.

[24]. J. G. Kemeny and J. L. Snell. Finite Markov
chains. Springer-Verlag, New York 1976.

BIOGRAPHY: (Optional)

Santosh Kumar Swain is
presently working as
teaching faculty in School of
Computer Engineering,
KIIT University, KIIT,
Bhubaneswar, Orissa, India.
He has acquired his M.Tech
degree from Utkal
University, Bhubaneswar. He

has contributed more than four papers to Journals
and Proceedings. He has written one book on
“Fundamentals of Computer and Programming in
C”. He is a research student of KIIT University,
Bhubaneswar. His interests are in Software
Engineering, Object Oriented Systems, Sensor
Network and Compiler Design etc.

Dr. Durga Prasad Mohapatra
studied his M.Tech at National
Institute of Technology,
Rourkela, India. He has
received his Ph. D from Indian
Institute of Technology,
Kharagpur,India. Currently, he
is working as Associate
Professor at National Institute

of Technology, Rourkela. His special fields of
interest include Software Engineering, Discrete
Mathematical Structure, slicing Object-Oriented
Programming. Real-time Systems and distributed
computing.

Journal of Biomedical Informatics 43 (2010) 782–790

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / y j b i n

Complementary methods of system usability evaluation: Surveys and observations
during software design and development cycles

Jan Horsky a,b,c,*, Kerry McColgan a, Justine E. Pang a, Andrea J. Melnikas a, Jeffrey A. Linder a,b,c,
Jeffrey L. Schnipper a,b,c, Blackford Middleton a,b,c

a Clinical Informatics Research and Development, Partners HealthCare, Boston, USA
b Division of General Medicine and Primary Care, Brigham and Women’s Hospital, Boston, USA
c Harvard Medical School, Boston, USA

a r t i c l e i n f o a b s t r a c t

Article history:
Received 11 December 2009
Available online 26 May 2010

Keywords:
Health information technology
Clinical information systems
Usability evaluations
Design and development
Adoption of HIT

1532-0464/$ – see front matter � 2010 Elsevier Inc. A
doi:10.1016/j.jbi.2010.05.010

* Corresponding author at: Clinical Informatics
Partners Healthcare, 93 Worcester St., Wellesley, MA
8771.

E-mail address: jhorsky@partners.org (J. Horsky).

Poor usability of clinical information systems delays their adoption by clinicians and limits potential
improvements to the efficiency and safety of care. Recurring usability evaluations are therefore, integral
to the system design process. We compared four methods employed during the development of outpa-
tient clinical documentation software: clinician email response, online survey, observations and inter-
views. Results suggest that no single method identifies all or most problems. Rather, each approach is
optimal for evaluations at a different stage of design and characterizes different usability aspect. Email
responses elicited from clinicians and surveys report mostly technical, biomedical, terminology and con-
trol problems and are most effective when a working prototype has been completed. Observations of clin-
ical work and interviews inform conceptual and workflow-related problems and are best performed early
in the cycle. Appropriate use of these methods consistently during development may significantly
improve system usability and contribute to higher adoption rates among clinicians and to improved qual-
ity of care.

1. Introduction

There is a broad consensus among healthcare researchers, prac-
titioners and administrators that although health information
technology has the potential to reduce the risk of serious injury
to patients in hospitals, significant differences remain among the
multitude of electronic health record (EHR) systems with respect
to their ability to achieve high safety, quality and effectiveness
benchmarks [1–4]. In many instances, the intrinsic potential of
EHRs for preventing and mitigating errors continues to be only par-
tially realized and some implementations may, paradoxically, ex-
pose clinicians to new risks or add extra time to many routine
interactions [5,6].

Research evidence and published reports on the successes, fail-
ures, best-practices, lessons learned and barriers overcome during
implementation efforts have had only limited effect so far on accel-
erating the adoption of electronic information systems [7]. Accord-
ing to conservative estimates, at least 40% of systems either are
abandoned or fail to meet business requirements, and fewer than

ll rights reserved.

Research and Development,
02481, USA. Fax: +1 781 416

40% of large vendor systems meet their stated goals [8]. A recent
national study reported that only four percent of physicians used
a fully functional, advanced system and that 13% used systems
with only basic functions [9].

Transition from paper records to electronic means of informa-
tion management is an arduous process at large institutions and
private practices alike. It introduces new standards and reshapes
familiar practices often in ways unintended or unanticipated by
the stakeholders. Clinicians object to forced changes in established
workflows and familiar practices, long training times, and exces-
sive time spent serving the computer rather than providing care
[10,11].

Although the initial decline in efficiency generally improves
with increased skills and sufficient time to adjust to new routines
[12], systems themselves rarely evolve to better meet the demands
and requirements of the clinical processes they need to support. A
recent survey found an increase in the availability of EHRs over two
years in one state, but the researchers also reported that routine
use of ten core functions remained relatively low, with more than
one out of five physicians not using each available function regu-
larly [13]. An observational study of 88 primary care physicians
identified key information management goals, strategies, and tasks
in ambulatory practice and found that nearly half were not fully
supported by available information technology [14].

http://dx.doi.org/10.1016/j.jbi.2010.05.010

mailto:jhorsky@partners.org

http://dx.doi.org/10.1016/j.jbi.2010.05.010

http://www.sciencedirect.com/science/journal/15320464

http://www.elsevier.com/locate/yjbin

J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790 783

Developing highly functional, versatile clinical information sys-
tems that can be efficiently and conveniently used without exten-
sive training periods is predicated on incorporating rigorous and
frequent usability evaluations into the design process. Iterative
development methodology for graphical interfaces suggests evalu-
ating and revising successive prototypes in a cyclical fashion until
the product attains required characteristics. There are several com-
mon techniques that can be used to perform the evaluations that
are either carried out entirely by usability experts or involve the in-
put of intended users. Equally important is to see usability evalua-
tion as situated within the context of challenges imposed by
complex socio-technical systems [15] and within broader concep-
tual frameworks for design and evaluation such as those based on
the theory of distributed cognition and work-centered research
[16].

The broad objective of this study was to compare data gathered
by four usability evaluation methods and discuss their respective
utility at different stages of the software development process.
We hypothesized that no single method would be equally effective
in characterizing every aspect of the interface and human interac-
tion. Rather, an approach that employs a set of complementary
methods would increase their cumulative explanatory value by
applying them selectively for specific purposes. Our narrower goal
was to formulate recommendations for designers and evaluators of
health information systems on the effective use of common usabil-
ity inspection methods during the design and development cycle.

This report expands a brief discussion of methods used in the
design, pilot testing, and evaluation of the Smart Form in a previ-
ous publication [17].

2. Background

The reasons why one system may be preferred over another by
clinicians and perform closer to expectations are often complex,
vary with local conditions and almost always include financing,
leadership, prior experience and training. Among the core predic-
tors of quick adoption and successful implementation are the de-
sign quality of the graphical user interface and functionality,
along with socio-technical factors [7]. Usability has a strong, often
direct relationship with clinical productivity, error rate, user fati-
gue and user satisfaction that are critical for adoption. The system
must be fast and easy to use, and the user interface must behave
consistently in all situations [18]. At the same time, the system
must support well all relevant clinical tasks so that a clinician
working with the computer can achieve higher quality of care.
The Healthcare Information and Management Systems Society
(HIMSS) considers poor usability characteristics of current infor-
mation technology as one of the major factors, and ‘‘possibly the
most important factor” hindering its widespread adoption [19].

Historically, developers and designers have failed to tap the
experiential expertise of practicing clinicians [20]. The lack of a
systematic consideration of how clinical and computing tasks are
performed in the situational context of different clinical environ-
ments often results in designs that are off the intended mark and
fail to deliver improvements in safety and efficiency. For example,
in an experiment that examined the interactive behavior of clini-
cians entering a visit note, researchers compared the sequence
and flow of items on an electronic note form that was implied by
the designed structure to actual mouse movements and entry se-
quences recorded by a tracking software and found substantial dif-
ference between the observed behavior and prior assumptions by
the designers [21].

Existing usability studies mainly employ research designs such
as expert inspection, simulated experiments, and self-reported
user satisfaction surveys. Unfortunately, a large body of research

indicates that self-reports can be a highly unreliable source of data,
often context-dependent, and even minor changes in question
wording, format or order can profoundly affect the obtained results
[22].

While analyses that rely predominantly on a single method may
produce incomplete or unreliable results, there is considerable evi-
dence of the effectiveness of comprehensive approaches that com-
bine two or more methods, as important redesign ideas rarely
emerge as sudden insights but may evolve throughout the work
process [23,24]. For example, during the development of a decision
support system, designers employed field observations, structured
interviews, and document analyses to collect and analyze users’
workflow patterns, decision support goals, and preferences regard-
ing interactions with the system, performed think-aloud analyses
and used the technology acceptance model to direct evaluation
of users’ perceptions of the prototype [25]. A careful workflow
analysis could lead to the identification of potential breakdown
points, such as vulnerabilities in hand-offs, and communication
tasks deemed critical could be required to have a traceable elec-
tronic receipt acknowledgment [26]. The advantage of informing
the design from its conception with close insights into local needs
and actual practices the software will support is reflected in the
fact that ‘‘home-grown” systems show a higher relative risk reduc-
tion than commercial systems [1].

Iterative development of user interfaces involves the steady
refinement of the design based on user testing and other evalua-
tion methods [27]. The complexity and variability of clinical work
requires correspondingly complex information systems that are
virtually impossible to design without usability problems in a sin-
gle attempt. Experts need to create a situation in which clinicians
can instill their knowledge and concern into the design process
from the very beginning [28]. Changing or redesigning a software
system as complex as an EHR after it has been developed (or imple-
mented) is enormously difficult, error-prone, and expensive
[29,30]. Iterative evaluations early in the process allow larger con-
ceptual revisions and refinements to be done without excessive ef-
fort and resources [31].

The software developed, tested and deployed in a pilot program
in this study, the Coronary Artery Disease (CAD) and Diabetes Mel-
litus (DM) Smart Form (Fig. 1), was a prototype of an application
intended to assist clinicians with documenting and managing the
care of patients with chronic diseases [17]. Integrated within an
outpatient electronic record, it allowed direct access to laboratory
and other coded data for expedient entry into new visit notes. The
Smart Form also aggregated reviewing of prior notes and labora-
tory results to create disease-relevant context for the planning of
care, and provided actionable decision support and best-practices
recommendations. The anticipated benefit to clinicians includes
savings in time required to look up, collect, interpret and record
clinical data into a note, and an increase in the quality and com-
pleteness of documentation that may contribute to improved pa-
tient care.

In the planning stage of the development, two experts, includ-
ing a physician, conducted focus groups with approximately 25
physicians who described their usual workflows, methods for
acute and chronic disease management, attitudes towards decision
support, and their wants and needs, and summarized emerging
themes [17].

3. Methods

We have conducted four different studies of usability and hu-
man–computer interaction that were intended to collect two types
of data: comments elicited directly from clinicians working with
the Smart Form, and findings derived from formal evaluations by

Fig. 1. Screenshot of Smart Form.

784 J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790

usability experts. We rigorously maintained distinctions between
direct, free-style comments made by clinicians and objective find-
ings by usability experts. Comments were always direct expres-
sions of clinicians that originated either spontaneously or in
response to a question, written or verbal. Findings, on the other
hand, were expert opinions and recommendations based on field
notes, interviews, focus groups and on direct observation of clini-
cians interacting with the Smart Form.

The reason why we chose to count and compare comments and
findings instead of actual problems is the uncertainty in determin-
ing whether any two or more user reports describe identical prob-
lems, as comments may sometimes be vague, too general or
without the proper context to match them to unique problems.
Since we could not differentiate all problems in a consistent man-
ner, we decided to report the comments and findings themselves
as approximations to actual problems.

In the first study, clinicians sent their comments by email dur-
ing a 3-month pilot period in which they used the module for the
documentation of actual visits. Another set of comments, in the
second study, were entered in an online survey at the end of
the pilot. We also extracted direct quotes of clinicians from tran-
scripts of interviews and think-aloud protocols that were com-
pleted as parts of usability evaluation in the remaining two
studies. The findings, in contrast, were formulated entirely by
usability experts as the result of a series of evaluation studies
(third and fourth) and published in technical reports.

Each comment and a finding were assigned to a usability heu-
ristic category independently by two researchers. The classification
scheme was specific to the healthcare domain and its development
is described in detail in a section below. The number of comments
and findings in each category was compared to assess the descrip-
tive power of each data collection method for specific usability
characteristic. For example, we would contrast the different pro-
portion of comments from each source that contributed to the total
number of observations in each category.

The four data collection methods are described in detail below.
Think-aloud studies were conducted by a usability expert at our

institution and walkthroughs and evaluations by independent pro-
fessional evaluators on contract basis.

3.1. Email via an embedded link

The Smart Form was integrated within the outpatient clinical
records system and used by 18 clinicians for 3-months (March to
May, 2006) in the course of their regular clinical work to write visit
notes for patients with coronary artery disease and diabetes. They
had the option of opening a free-text window on their desktops at
any time by clicking on a link embedded in the application and
typing in their comments. The messages were collected in a data-
base and logged with a timestamp and the sender’s name.

3.2. Online survey

Fifteen participants received an email with a link to an online
survey in May 2006. Questions about satisfaction, frequency of
use and problems had multiple-choice responses and were accom-
panied by two open-ended questions, ‘‘What changes could be
made to the Smart Form that would make you more likely to use
it?” and ‘‘What improvements can be made to the Smart Form be-
fore you would recommend it to other clinicians?” Completion was
voluntary and rewarded with a $20 gift certificate.

3.3. Think-aloud study and observations

We recruited six primary care physicians and specialists (four
women) to participate in usability and interaction studies. Evalua-
tions were conducted in the clinicians’ offices at six different clinics
and lasted 30–45 min. Subjects were asked to complete a series of
interactive tasks described in a previously developed clinical sce-
nario. A researcher played the role of a patient during each session
to provide a realistic representation of an office visit. Medical his-
tory, current medications and the presence of diabetes and CAD
were included in a narrative paragraph that was accompanied by

J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790 785

supporting electronic documentation of prior visits, lab results, vi-
tals and demographic information in a simulated patient record.

Subjects were instructed to verbalize their thoughts (to think-
aloud) as they were completing the tasks and interacting with
the Smart Form. Video and audio recordings of each session were
made with Morae [32] usability evaluation software installed on
portable computers. The verbal content was transcribed for analy-
sis to be used together with the resulting screen captures. In a
debriefing period after completion, subjects were asked follow-
up questions to elaborate or elucidate their actions and reasoning.
The results of this study were compiled in a technical report.

3.4. Walkthroughs, expert evaluations and interviews

A team of professional health informatics consultants carried
out independently usability assessment and walkthroughs and
conducted interviews with six primary care physicians and special-
ists (two women) whose experience with the application ranged
from novice to expert. The results of the evaluation were presented
in a technical report.

3.5. The development of heuristic usability assessment scheme

Four sets of usability heuristics with a substantial theoretical
overlap have been generally accepted and are widely used in pro-
fessional evaluations: Nielsen’s 10 usability heuristics [33] (de-
rived from the results of a factor analysis of about 250
problems), Shneiderman’s Eight Golden Rules of Interface Design
[34], Tognazzini’s First Principles of Interaction Design [35], and
a set of principles based on Edward Tufte’s visual display work
[36]. These approaches were recently integrated into a single Mul-
tiple Heuristics Evaluation Table by identifying overlaps and com-
bining conceptually related items [37].

These general heuristics sets have been used to evaluate health-
care-related applications [38–41] and consumer-health websites
[42]. A set of aggregated Nielsen’s and Schneiderman’s heuristics
was proposed by Zhang and colleagues [43] for HIT and applied
to the evaluation of an infusion pump [44] and a clinical web appli-
cation [45]. However, the categories and guidelines do not specifi-
cally address biomedical or clinical concepts. Our goal was to
formulate additional categories to increase their cumulative
explanatory power.

To this end we analyzed all 155 statements about usability
problems collected during the study to identify emergent themes
following the grounded theory principles [46]. Two researchers
then independently assigned the statements into heuristic catego-
ries, either general or modified according to newly identified
themes. Several iterative coding sessions and discussions ensued,
and as a result of extensive comparison and refinement, 12 heuris-
tic categories were formulated (Table 3).

Table 1
Comments by heuristic category and source.

Heuristic
category

Email N
(%)

Survey N
(%)

Evaluation
N (%)

Interview
N (%)

Totals N
(%)

Biomedical 21 (81) 0 1 (4) 4 (15) 26 (17)
Cognition 12 (46) 3 (12) 4 (15) 7 (27) 26 (17)
Control 17 (61) 4 (14) 5 (18) 2 (7) 28 (18)
Customization 7 (29) 5 (28) 1 (6) 5 (28) 18 (12)

3.6. Participants

All data were collected from 45 clinicians within Partners
Healthcare practice network who participated in either part of
the study (with a small overlap). Most were primary care physi-
cians (73%), about half were female (53%), and the mean age of
the group was 48 years.

Fault 16 (94) 1 (6) 0 0 17 (11)
Speed 3 (43) 3 (43) 1 (14) 0 7 (5)
Terminology 4

(100)

0 0 0 4 (3)
Transparency 4 (36) 1 (9) 6 (55) 0 11 (7)
Workflow 1 (6) 3 (17) 8 (44) 6 (33) 18 (12)
Totals 85 (55) 20 (13) 26 (17) 24 (15) 155

(100)

4. Results

Analyses were performed separately on comments by clinicians
and on findings by usability experts. Results are presented in the
following sections and contrasted.

4.1. Comments by clinicians

Results for comments are summarized in Table 1. There were
155 comments from 36 clinicians obtained either in the form of
written communication (email and survey) or transcribed from di-
rect verbal quotes (interview and evaluation). We received 85
emails from nine clinicians (reflecting a 50% response rate), and
20 free-text comments were entered in the online survey by 15 cli-
nicians (54% response). Six clinicians who participated in usability
evaluations made 26 comments and another six clinicians made 24
distinct comments during interviews.

Over a half of all responses (55%) were emails, and about equal
numbers were obtained from the survey, evaluations and inter-
views (15%, 13% and 17%, respectively). The most common form
of a response that constituted about a third of collected data
(N = 54) was an email classified as either a Biomedical, Control or
Fault category. Comments from the other three sources were most
likely to be classified in the following categories: Customization
and Control for survey (N = 9, 45%), Transparency and Workflow
for evaluations (N = 14, 54%), and Cognition and Workflow for
interviews (N = 13, 54%). Overall, the Control, Cognition and Bio-
medical categories described about a half of all data (52%), and
about a third (35%) was classified in the Customization, Workflow
and Technical categories. There were no Consistency or Context
comments.

Although email was the most prevalent form of communication
in the set, its proportion was different within each heuristic cate-
gory (Fig. 1). For example, it added up to 80% or more in three cat-
egories (Terminology, Fault and Biomedical) and to a majority
(61%) in the Control category, but only one was classified as related
to Workflow. Written response was more likely to be used for the
reporting of technical, biomedical and interaction problems (e.g.,
Fault, Biomedical, Terminology, Control), while verbal comments
often related to Workflow or Transparency difficulties. For exam-
ple, almost 90% of comments made during evaluations were clus-
tered in just four categories and similar distribution was found in
data from interviews.

4.2. Findings by usability evaluators

The results are summarized in Table 2. There were 47 findings
extracted from expert reports. Over two thirds were classified into
just three categories: Cognition, Customization and Workflow. In
contrast, none were in the Fault, Speed or Terminology categories
and only one was classified as Biomedical. Technical and biomed-
ical concepts were generally not represented in the evaluations.

4.3. Comments and findings comparison

We contrasted all 47 findings with a subset of 105 comments
that included only email and survey. Findings were derived from

Table 3
Description of Heuristic Evaluation Categories.

Category Description

Consistency Hierarchy, grouping, dependencies and levels of significance
are visually conveyed by systematically applied appearance
characteristics, perceptual cues, spatial layout, text
formatting and pre-defined color sets. Behavior of controls is
predictable. Language in commands, labels and warnings is
standardized

Transparency The current state is apparent and possible future states are
predictable. Action effects, their closure and failure are
indicated

Control The interruption, resumption and non-linear or parallel task
completion is possible. Direct access to data across levels of
hierarchy, backtracking, recovery from unwanted states and
reversal of actions are possible

Cognition Content avoids extraneous information and excessive
density. Representational formats allow perceptual
judgment and unambiguous interpretation. Cognitive effort
is reduced by minimalistic design, formatting and use of
color, allowing fast visual searches. Recognition is preferred
over recall. Conceptual model corresponds to work context
and environment

Context Terms, labels, symbols and icons are meaningful and
unambiguous in different system states. Alerts and
reminders perceptually distinguish between general
(disease, procedure, guidelines) and patient-specific content

Terminology Medical language is meaningful to users in all contexts of
work, compatible with local variations and established
terms

Biomedical Biomedical knowledge used in rules and decision support is
current and accurate, reflecting guidelines and standards. It
is evident how suggestions are derived from data and what
decision logic is followed

Safety Complex combinations of medication doses, frequencies,
units and time durations are disambiguated by appropriate
representational formats and language, entries are audited
for allowed value limits. Omissions are mitigated by goal
and task completion summary views. Errors are prevented
from accumulating and propagating through the system

Customization Preferred data views, organization, sorting, filtering,
defaults, basic screen layout and behavior are persistent
over use sessions and can be defined individually or
according to role

Fault Software failures and functional errors are minimal, do not
compromise safety and prevent the loss of data

Speed Minimal latency of screen loads and high perceived speed of
task completion

Workflow Navigation, data entry and retrieval does not impede clinical
task completion and the flow of events in the environment

Table 2
Findings by Heuristic Category and Source.

Heuristic category Evaluation N Interview N Total findings N (%)

Biomedical 1 0 1 (2)
Cognition 10 6 16 (34)
Control 2 4 6 (13)
Customization 2 7 9 (19)
Consistency 0 1 1 (2)
Context 1 1 2 (4)
Transparency 5 0 5 (11)
Workflow 7 0 7 (15)
Totals 28 19 47 (100)

786 J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790

reports of evaluation and interviews that already contained rein-
terpreted verbal comments of the subjects. We therefore excluded
comments made during evaluations from the comparison.

Comments and findings showed divergent trends in character-
izing usability aspects of the Smart Form (Fig. 3). Comments were
more likely to describe discrete, clearly manifested and highly spe-
cific problems and events, such as software failures or concerns
about medical logic or language (e.g., Control, Biomedical, Fault,

Terminology). Findings derived from usability evaluation, on the
other hand, tended to explain conceptual problems related to over-
all design and the suitability of the electronic tool to clinical work
(e.g., Consistency, Context, Workflow). Both methods contributed
about equally to the description of problems with human interac-
tion (e.g., Cognition, Customization).

4.4. Implementation of design changes to a revised prototype

Individual comments and findings most often referred to single,
discrete problems. Some problems were reported by several clini-
cians or were identified by multiple methods. The 155 analyzed
comments and findings reported 120 unique problems (77% ratio),
and 12 problems were simultaneously described by more than one
method (10% ratio). We have iteratively implemented design
changes into the prototype on the basis of 56 reported problems
(47%). Most of the problems that led to subsequent changes (34)
were reported by email.

5. Discussion

Our data analysis has identified the relative strengths and
weaknesses of the four evaluation approaches, their distinct utility
and appropriateness for characterizing different usability concepts,
and their cumulative explanatory power as a set of complementary
methods used at specific points of the development lifecycle. The
large number of comments that clinicians provided were a rich
source of reports on software failures, slow performance and po-
tential conflicts and inconsistencies in biomedical content, while
usability experts generally gave comprehensive assessments of
problems related to human interaction and workflow, including
characterizations of problems with interface design and layout that
negatively affect cognitive and perceptual saliency of displayed
information. The core principles, attributes and expected results
for each method are summarized in Table 4 and discussed in depth
in the following sections.

5.1. Email

An email link embedded in the application is available to every-
one and at all times, allowing almost instantaneous reporting of
problems as they occur. Informaticians and computer technology
specialists can learn from these comments how the software per-
forms in authentic work conditions and how well it supports clini-
cians in complex scenarios that commonly arise from the
combination of personal workflows and preferences, unexpected
events, and unusual, idiosyncratic, unplanned or non-standard
interaction patterns. The wide range of conditions that affect per-
formance and contribute to errors and failures would not be possi-
ble to anticipate and simulate in the laboratory. Performance
measures in actual settings also give evidence of the technical
and conceptual strengths of the design. Insights from these reports
give designers a unique opportunity to make the application more
robust and tolerant of atypical interaction, more effective in man-
aging and preventing errors, and more appropriate for the clinical
task it supports.

The large number and variety of email reports and their often
fragmentary content make them often hard to interpret. For exam-
ple, it is difficult for clinicians to recall accurately the relevant and
descriptive details of errors that were made or problems that were
encountered during complex interactions with multi-step or inter-
leaving tasks, and to convey a meaningful description of the event.
However, informaticians may need details about the system state,
work context or preceding actions that are often lacking in sponta-
neous and short messages to evaluate how a problem originated

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

C
om

m
en

ts
(

%
)

Evaluation Heuristic

Email Survey Evaluation Interview

Fig. 2. Proportions of comments by heuristic and source.

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
C
om
m
en
ts
(

N
)

Evaluation Heuristic

Findings Comments

Fig. 3. Proportion of comments and findings by heuristic.

J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790 787

and its potential consequences. The usually large volume of emails
accumulated over time also contains repetitive, idiosyncratic and
inaccurate reports that may be of little value and need to be ex-
cluded. A self-selection bias among respondents (e.g., novice users
may be underrepresented) may accentuate marginal problems or
conceal more serious ones. Difficulties of more conceptual charac-
ter may be only rarely reported through comment messages, as
was evident from the analysis of our data (e.g., the distribution
of comments in heuristic categories, Fig. 2).

Among the most significant advantages of embedded email re-
sponse links are their inexpensive implementation, network-wide
availability, real-time response and continuous, active data collec-
tion. These characteristics make email an excellent data collection
method during pilot testing of release candidate versions and after
the release of full versions. There is a high probability of quickly

discovering technical problems, an opportunity to review medical
logic for decision support tools that may not have been tested in
complex scenarios (e.g., a patient with multiple comorbidities
and drug prescriptions), and a likelihood of finding inconsistencies
in terminology or ambiguities in language and expressions. For
example, of the 56 changes and corrections we implemented in
the prototype, 36 (64%) such problems were reported and identi-
fied in emails.

This method requires the software to be in the stage of a fully
functional prototype or in its final release form. It may therefore
be too laborious or expensive to make significant conceptual
changes in design at that point. However, our data suggest (e.g.,
the proportions of comments to specific concepts in Fig. 2) that
most of email-reported problems concern specific biomedical con-
tent, terminology and technical glitches that may be relatively easy

Table 4
Comparison of clinician response and formal usability evaluation results.

Descriptions Email Survey Usability Studies

Heuristic focus Biomedical, Cognition, Control, Customization,
Fault, Speed, Terminology

Control, Customization, Speed Cognition, Context, Consistency, Control,
Customization, Safety Transparency, Workflow

Evaluated aspects Software problems, medical logic, decision
support, use of terms, perceived speed,
interaction difficulties, desired functions

Satisfaction, perceived speed of completion,
qualitative assessments, desired functions,
personal preferences, use context

Design concepts, actual and anticipated errors,
cognitive load, workflow fit, cognitive model,
skilled and novice performance

When to perform Pilot release, shortly after full release After pilot release, after full release,
periodically

Early in design cycle, iteratively during
prototyping, planning stage before new design

Advantages Can identify rare and complex use situations,
immediate response when problem occurs,
everyone can comment

Allows comparison over time, broad reach,
can be web-based, ongoing

Describes human error, mental models, strategy,
structured and reliable, rich detail, insights into
workflow integration

Limitations Often missing context, may not be intelligible,
does not capture human error, self-selection bias

Relatively low reliability, reflective,
subjective, may be hard to interpret

Laborious, expertise is required, describes only
few use cases, needs expensive physician time

Source of data Clinicians Clinicians Usability experts

Timeframe Continuous Periodic Episodic

Sample quotes ‘‘I saved the note, then tried to Sign, and the
system just froze”
‘‘There appears to be a problem with logic when
the creatinine is too low”
‘‘I found it challenging to find signature location.
Spent extra time just looking at the screens”

‘‘Allow for easier management of insulin and
titration”
‘‘Needs a faster medication entry format”
‘‘I find it cumbersome to my workflow”

She did not notice the Save icon and searched for
a Save button at the bottom of the window.
He knew where to look for vitals, but had to
enter new values manually.
Hide non-essential icons. Create a dynamic right-
click context menu

788 J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790

to correct without large-scale changes in the code and screen
layout.

5.2. Survey

Survey is another form of direct clinician response that we used
in this study and it shares several characteristics with email com-
munication, such as a potentially wide reach, economy of adminis-
tration, a tendency for self-selection bias, relatively low response
rate and the brevity of its form. Unlike email links, surveys are
structured and contain a pre-determined set of questions to elicit
responses and opinions on narrow topics of interest. They do not
allow reporting problems in real time, however, and require
respondents to recall and interpret past events at the time the sur-
vey is completed. This may be difficult, as our data suggest that
free-text answers to open-ended questions did not contain refer-
ences to specific and detailed biomedical and technical problems,
the most frequent categories represented in emails (see Table 1).
Rather, clinicians tended to describe more broadly defined difficul-
ties with screen control, navigation and customization.

The content in surveys, as in other direct forms of communica-
tion, is often subjective, reflecting personal opinion, and therefore,
of lower descriptive value and accuracy than data gathered in pro-
fessional evaluations [22]. A substantial period of time needs to be
allowed for potential survey respondents to work with a fully
working prototype or the completed application before they can
form meaningful opinions and gain a measure of proficiency.

Surveys can be administered periodically for comparisons over
time and can be timed to coincide with important events such as
technology or procedures updates that may affect the way the sys-
tem is used. They can also be targeted to specific groups, such as
primary care physicians, pediatricians and other specialists.

5.3. Usability evaluations and interviews

The most telling indicators of conceptual flaws in the design
come from the observation of human interaction errors [47]. They
can provide insights into discrepancies between expected and ac-
tual behavior and identify inappropriate and ambiguous represen-
tational formats of information on the screen that impairs its

accurate interpretation [48]. Errors are rarely reported directly in
emails or in surveys, as the responders are not often aware of their
own mistakes. For example, observation experts in our study re-
ported that a clinician during a simulated task ‘‘could not tell
whether the patient was taking Aspirin, assumed that urinalysis
could only be ordered on paper and did not notice the save button,”
an insight that would not be gained by introspection and recall.

Usability inspection methods in which experts alone evaluate
the interface, such as the cognitive walkthrough and heuristic eval-
uation, provide predominantly normative assessments. In other
words, they report how well the interface supports the completion
of a standardized task that can be reasonably expected to be per-
formed routinely, and measure the extent to which the design ad-
heres to general usability standards. These methods produce
reference models of interaction that can be compared to evidence
from field observations.

Ethnographic and observational methods such as think-aloud
studies, on the other hand, derive data from analyzing unscripted
and natural interactions with the software by non-experts with
various levels of computer and task-domain skills. They are there-
fore inherently descriptive and analytic and allow researchers to
make inferences about the clarity and suitability of the design to
the task from observed competencies and errors. Usability experts
can integrate findings about interaction errors with interface eval-
uations, cognitive walkthroughs and heuristic evaluations into a
comprehensive analysis and formulate optimal strategies for mak-
ing modifications to the interface. Normative and descriptive
methods together constitute a comprehensive evaluation of design
in progress that can be repeated iteratively early in the process to
refine data representation and interaction concepts in each succes-
sive version.

Findings from experts in this study have been clearly focused on
conceptual and interaction-related aspects of the Smart Form
(Table 2). The structured format of think-aloud studies follows
pre-defined clinical scenarios that generally contain validated bio-
medical data and unambiguous terminology that do not represent
potential problems to be reported in evaluations. Comments from
clinicians working with the software in real settings, however, are
more descriptive of specific factual, technical and biomedical er-
rors that observational studies frequently do not capture. The

J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790 789

relative proportions of expert findings and clinicians’ comments in
each heuristic category and their respective tendency to describe
different aspects of the software are clearly evident in Fig. 3.

Experts can also capture more easily positive aspects of the de-
sign and confirm successful trends. For example, an evaluator re-
ported that ‘‘the subject seemed comfortable navigating around
and understands how to update medications in the system.” Email
responses are often initiated at the time of a failure or when an er-
ror is encountered, but rarely when the system is working well. In
effect, successful performance is characterized by uneventful and
well-progressing work which is apparent to observers but not of-
ten reported back to designers by clinicians.

Interviews with clinicians are usually done in conjunction with
observations to elucidate aspects of collected data that require
proper context for interpretation, and also as ‘‘debriefings” at the
end of after think-aloud studies. The results of expert evaluations
commonly incorporate insights and findings from interviews into
comprehensive reports.

Expert evaluations are indispensable during the initial design
stages when even significant corrections and reconceptualizations
are still possible without incurring steep penalties in time and
development effort.

6. Conclusion

This study has been conducted to characterize and compare
four usability evaluation methods that were employed by the re-
search team during the design and pilot testing of new clinical doc-
umentation software. We have also formulated a classification
scheme of heuristic usability concepts that incorporates estab-
lished principles and extends them for evaluations specific to the
clinical software domain.

Our results suggest that no single method describes better than
others all or most usability problems, but rather that each is opti-
mally suited for evaluations at different points of the design and
deployment process, and that they all characterize different as-
pects of the interface and human interaction. The studies and
assessments we have performed were embedded in the design pro-
cess and spanned the entire development cycle.

Heuristic evaluations and ethnographic observations of actual
clinical work by usability experts inform and guide conceptual
and workflow-related changes and need to be performed itera-
tively early in the design cycle so that they can be incorporated
without excessive effort and time. Responses elicited directly from
clinicians and other users through email links and surveys report
mostly technical, biomedical, terminology and control problems
that may occur in a wide variety of workflows and idiosyncratic
use patterns.

The evaluations were conducted on the relatively small scale of
a pilot study. However, the smaller size may be typical of many
software development efforts at large academic and healthcare
centers. The findings and lessons learned in this study may, there-
fore, be of interest to information system designers, developers and
research and development centers affiliated with hospitals and di-
rectly related to their experiences with the design and improve-
ment of clinical information systems. We have outlined a
methodological approach that is applicable to most development
processes of software intended for healthcare information systems.

We plan to formally validate and possibly revise the set of heu-
ristics we formulated and apply it to the evaluation of an informa-
tion system in its entirety that will also include judgments about
safety that were not performed in this pilot study.

Health information technology is still in its nascent state today.
Order entry systems, for example, still represented only a second
generation technology in 2006 and had many limitations that pre-

cluded their meaningful integration into the process of care [49].
Applications not appropriately matched to clinical tasks tend to
be chronically underused and may be eventually abandoned [21].

Acknowledgments

The Smart Form research was supported by Grant
5R01HS015169-03 from the Agency For Healthcare Research And
Quality. We wish to thank Alan Rose, Ruslana Tsurikova, Lynn Volk
and Svetlana Turovsky for their contribution and expertise in data
collection and initial interpretation, and to all clinicians who par-
ticipated in the four studies as subjects.

References

[1] Ammenwerth E, Schnell-Inderst P, Machan C, Siebert U. The effect of electronic
prescribing on medication errors and adverse drug events: a systematic
review. J Am Med Inform Assoc 2008;15:585–600.

[2] Linder JA, Ma J, Bates DW, Middleton B, Stafford RS. Electronic health record
use and the quality of ambulatory care in the United States. Arch Intern Med
2007;167:1400–5.

[3] Chaudhry B, Wang J, Wu S, Maglione M, Mojica W, Roth E, et al. Systematic
review: impact of health information technology on quality, efficiency, and
costs of medical care. Ann Intern Med 2006;144:742–52 [see comment].

[4] Kaushal R, Shojania KG, Bates DW. Effects of computerized physician order
entry and clinical decision support systems on medication safety: a systematic
review. Arch Intern Med 2003;163:1409–16.

[5] Koppel R, Metlay JP, Cohen A, Abaluck B, Localio AR, Kimmel SE, et al. Role of
computerized physician order entry systems in facilitating medication errors.
JAMA 2005;293:1197–203.

[6] Horsky J, Kuperman GJ, Patel VL. Comprehensive analysis of a medication
dosing error related to CPOE. J Am Med Inform Assoc 2005;12:377–82.

[7] Ludwick DA, Doucette J. Adopting electronic medical records in primary care:
lessons learned from health information systems implementation experience
in seven countries. Int J Med Inform 2009;78:22–31.

[8] Kaplan B, Harris-Salamone KD. White paper: Health IT project success and
failure: recommendations from literature and an AMIA workshop. J Am Med
Inform Assoc 2009;16:291–9.

[9] DesRoches CM, Campbell EG, Rao SR, Donelan K, Ferris TG, Jha AK, et al.
Electronic health records in ambulatory care: a national survey of physicians.
N Engl J Med 2008;359:50–60.

[10] Smelcer JB, Miller-Jacobs H, Kantrovich L. Usability of electronic medical
records. J Usability Stud 2009;4:70–84.

[11] Harrison MI, Koppel R, Bar-Lev S. Unintended consequences of information
technologies in health care: an interactive sociotechnical analysis. J Am Med
Inform Assoc 2007;14:542–9.

[12] Pizziferri L, Kittler AF, Volk LA, Honour MM, Gupta S, Wang SJ, et al. Primary
care physician time utilization before and after implementation of an
electronic health record: a time-motion study. J Biomed Inform
2005;38:176–88.

[13] Simon SR, Soran CS, Kaushal R, Jenter CA, Volk LA, Burdick E, et al. Physicians’
use of key functions in electronic health records from 2005 to 2007: a
statewide survey. J Am Med Inform Assoc 2009;16:465–70.

[14] Weir CR, Nebeker JJR, Hicken BL, Campo R, Drews F, LeBar B. A cognitive task
analysis of information management strategies in a computerized provider
order entry environment. J Am Med Inform Assoc 2007;14:65–75.

[15] Vicente KJ. Work domain analysis and task analysis: a difference that matters.
In: Schraagen JM, Chipman SF, editors. Cognitive task analysis. Mahwah,
NJ: Lawrence Erlbaum Associates, Inc.; 2000. p. 101–18.

[16] Zhang J, Butler K. UFuRT: A work-centered framework and process for design
and evaluation of information systems. HCI International Proceedings; 2007.

[17] Schnipper JL, Linder JA, Palchuk MB, Einbinder JS, Li Q, Postilnik A, et al. ‘‘Smart
Forms” in an electronic medical record: documentation-based clinical decision
support to improve disease management. J Am Med Inform Assoc
2008;15:513–23.

[18] Sittig DF, Stead WW. Computer-based physician order entry: the state of the
art. J Am Med Inform Assoc 1994;1:108–23.

[19] HIMSS EHR usability task force. Defining and testing EMR usability: principles
and proposed methods of EMR usability evaluation and rating. HIMSS; 2009.

[20] Ball MJ, Silva JS, Bierstock S, Douglas JV, Norcio AF, Chakraborty J, et al. Failure
to provide clinicians useful IT systems: opportunities to leapfrog current
technologies. Methods Inf Med 2008;47:4–7.

[21] Zheng K, Padman R, Johnson MP, Diamond HS. An interface-driven analysis of
user interactions with an electronic health records system. J Am Med Inform
Assoc 2009;16:228–37.

[22] Schwarz N, Oyserman D. Asking questions about behavior: cognition,
communication, and questionnaire construction. Am J Eval 2001;22:127.

[23] Jaspers MWM. A comparison of usability methods for testing interactive
health technologies: methodological aspects and empirical evidence. Int J Med
Inform 2009;78:340–53.

790 J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790

[24] Uldall-Espersen T, Frokjaer E, Hornbaek K. Tracing impact in a usability
improvement process. Interact Comput 2008;20:48–63.

[25] Peleg M, Shachak A, Wang D, Karnieli E. Using multi-perspective
methodologies to study users’ interactions with the prototype front end of a
guideline-based decision support system for diabetic foot care. Int J Med
Inform 2009;78:482–93.

[26] Sittig DF, Singh H. Eight rights of safe electronic health record use. JAMA
2009;302:1111–3.

[27] Nielsen J. Iterative user interface design. IEEE Comput 1993;26:32–41.
[28] Gould JD, Lewis C. Designing for usability: key principles and what designers

think. Commun. ACM 1985;28:300–11.
[29] Walker JM, Carayon P, Leveson N, Paulus RA, Tooker J, Chin H, et al. EHR safety:

the way forward to safe and effective systems. J Am Med Inform Assoc
2008;15:272–7.

[30] Leveson NG. Intent specifications: an approach to building human-centered
specifications. IEEE Trans Software Eng 2000;26:15–35.

[31] Wachter SB, Agutter J, Syroid N, Drews F, Weinger MB, Westenskow D. The
employment of an iterative design process to develop a pulmonary graphical
display. J Am Med Inform Assoc 2003;10:363–72.

[32] Morae. 3.1 ed., Okemos, MI: TechSmith Corporation; 2009.
[33] Nielsen J, Mack RL. Usability inspection methods. New York: John Wiley &

Sons; 1994.
[34] Shneiderman B. Designing the user interface. Strategies for effective human–

computer-interaction. 4th ed. Reading, MA: Addison Wesley Longman; 2004.
[35] Tognazzini B. Tog on interface. Reading, Mass.: Addison-Wesley; 1992.
[36] Tufte ER. The visual display of quantitative information. 2nd ed. Cheshire,

Conn.: Graphics Press; 2001.
[37] Atkinson BF, Bennet TO, Bahr GS, Nelson MM. Development of a multiple

heuristics evaluation table (MHET) to support software development and
usability analysis. In: Universal access in human–computer interaction: coping
with diversity. Berlin/Heidelberg: Springer; 2007.

[38] Thyvalikakath TP, Schleyer TK, Monaco V. Heuristic evaluation of clinical
functions in four practice management systems: a pilot study. J Am Dent Assoc
2007;138:209–10.

[39] Scandurra I, Hagglund M, Engstrom M, Koch S. Heuristic evaluation performed
by usability-educated clinicians: education and attitudes. Stud Health Technol
Inform 2007:205–16.

[40] Lai TY. Iterative refinement of a tailored system for self-care management of
depressive symptoms in people living with HIV/AIDS through heuristic
evaluation and end user testing. Int J Med Inform 2007;76:S317–24.

[41] Tang Z, Johnson TR, Tindall RD, Zhang J. Applying heuristic evaluation to improve
the usability of a telemedicine system. Telemed J E Health 2006;12:24–34.

[42] Choi J, Bakken S. Heuristic evaluation of a web-based educational resource for
low literacy NICU parents. Stud Health Technol Inform 2006:194–9.

[43] Zhang J, Johnson TR, Patel VL, Paige DL, Kubose TK. Using usability heuristics to
evaluate patient safety of medical devices. J Biomed Inform 2003;36:23–30.

[44] Graham MJ, Kubose TK, Jordan DA, Zhang J, Johnson TR, Patel VL. Heuristic
evaluation of infusion pumps: implications for patient safety in intensive care
units. Int J Med Inform 2004;73:771–9.

[45] Allen M, Currie LM, Bakken S, Patel VL, Cimino JJ. Heuristic evaluation of paper-
based web pages: a simplified inspection usability methodology. J Biomed
Inform 2006;39:412–23.

[46] Corbin JM, Strauss AL. Basics of qualitative research: techniques and
procedures for developing grounded theory. 3rd ed. Los Angeles, Calif.: Sage
Publications, Inc.; 2008.

[47] Hall JG, Silva A. A conceptual model for the analysis of mishaps in human-
operated safety-critical systems. Saf Sci 2008;46:22–37.

[48] Johnson CM, Turley JP. The significance of cognitive modeling in building
healthcare interfaces. Int J Med Inform 2006;75:163–72.

[49] Ford EW, McAlearney AS, Phillips MT, Menachemi N, Rudolph B. Predicting
computerized physician order entry system adoption in US hospitals: can the
federal mandate be met? Int J Med Inform 2008;77:539–45.

Complementary methods of system usability evaluation: Surveys and observations during software design and development cycles

Introduction
Background
Methods
Email via an embedded link
Online survey
Think-aloud study and observations
Walkthroughs, expert evaluations and interviews
The development of heuristic usability assessment scheme
Participants
Results
Comments by clinicians
Findings by usability evaluators
Comments and findings comparison
Implementation of design changes to a revised prototype
Discussion
Email
Survey
Usability evaluations and interviews
Conclusion
Acknowledgments
References

Page 1

InternationalJournal of Performability Engineering Vol. 6, No. 6, November 2010, pp. 531-546.

Printed in India

*
Corresponding author’s email: nschneid@nps.navy.mil 53

Successful Application of Software Reliability: A Case Study

NORMAN F. SCHNEIDEWIND

Fellow of the IEEE

2822 Raccoon Trail

Pebble Beach, California 93953 USA

(Received on July 30, 2009, revised on May 3, 2010

)

Abstract: The purpose of this case study is to help readers implement or improve a

software reliability program in their organizations, using a step-by-step approach based on

the Institute of Electrical and Electronic Engineers (IEEE) and the American Institute of

Aeronautics and Astronautics Recommended (AIAA) Practice for Software Reliability,

released in June 2008, supported by a case study from the NASA Space Shuttle.

This case study covers the major phases that the software engineering practitioner

needs in planning and executing a software reliability-engineering program. These phases

require a number of steps for their implementation. These steps provide a structured

approach to the software reliability process. Each step will be discussed to provide a good

understanding of the entire software reliability process. Major topics covered are: data

collection, reliability risk assessment, reliability prediction, reliability prediction

interpretation, testing, reliability decisions, and lessons learned from the NASA Space

Shuttle software reliability engineering program.

Keywords: software reliability program, Institute of Electrical and Electronic Engineers

and the American Institute of Aeronautics and Astronautics Recommended Practice for

Software Reliability, NASA Space Shuttle application

1. Introduction

The IEEE\AIAA recommended practice provides a foundation on which

practitioners and researchers can build consistent methods [1]. This case study will

describe the SRE process and show that it is important for an organization to have a

disciplined process if it is to produce high reliability software. To accomplish this purpose,

an overview is presented of existing practice in software reliability, as represented by the

recommended practice [1]. This will provide the reader with the foundation to understand

the basic process of Software Reliability engineering (SRE). The Space Shuttle Primary

Avionics Software Subsystem will be used to illustrate the SRE

process.

The reliability prediction models that will be used are based on some key definitions

and assumptions,

follows:

Definitions

Interval: an integer time unit t of constant or variable length defined by t-1

t>0; failures are counted in intervals.

Number of Intervals: the number of contiguous integer time units t of constant or variable

length represented by a positive real number.

Norman F. Schneidewind

Operational Increment (OI): a software system comprised of modules and configured from

a series of builds to meet Shuttle mission functional requirements.

Time: continuous CPU execution time over an interval range.

Assumptions

1. Faults that cause failures are removed.

2. As more failures occur and more faults are corrected, remaining failures will be

reduced.

3. The remaining failures are “zero” for those OI’s that were executed for extremely

long times (years) with no additional failure reports; correspondingly, for these

OI’s, maximum failures equals total observed failures.

1.1 Space Shuttle Flight Software Application

The Shuttle software represents a successful integration of many of the computer

industry’s most advanced software engineering practices and approaches. Beginning in the

late 1970’s, this software development and maintenance project has evolved one of the

world’s most mature software processes applying the principles of the highest levels of the

Software Engineering Institute’s (SEI) Capability Maturity Model (the software is rated

Level 5 on the SEI scale) and ISO 9001 Standards [2]. This software process includes

state-of-the-practice software reliability engineering (SRE) methodologies.

The goals of the recommended practice are to: interpret software reliability

predictions, support verification and validation of the software, assess the risk of

deploying the software, predict the reliability of the software, develop test strategies to

bring the software into conformance with reliability specifications, and make reliability

decisions regarding deployment of the software.

Reliability predictions are used by the developer to add confidence to a formal

software certification process comprised of requirements risk analysis, design and code

inspections, testing, and independent verification and validation. This case study uses the

experience obtained from the application of SRE on the Shuttle project, because this

application is judged by NASA and the developer to be a successful application of SRE

[6]. These SRE techniques and concepts should be of value for other software systems

1.2 Reliability Measurements and Predictions

There are a number of measurements and predictions that can be made of reliability

to verify and validate the software. Among these are remaining failures, maximum

failures, total test time required to attain a given fraction of remaining failures, and time to

next failure. These have been shown to be useful measurements and predictions for: 1)

providing confidence that the software has achieved reliability goals; 2) rationalizing how

long to test a software component (e.g., testing sufficiently long to verify that the measured

reliability conforms to design specifications); and 3) analyzing the risk of not achieving

remaining failures and time to next failure goals [6]. Having predictions of the extent to

which the software is not fault free (remaining failures) and whether a failure it is likely to

occur during a mission (time to next failure) provide criteria for assessing the risk of

deploying the software. Furthermore, fraction of remaining failures can be used as both an

Successful Application of Software Reliability: Case Study

operational quality goal in predicting total test time requirements and, conversely, as an

indicator of operational quality as a function of total test time expended [6].

The various software reliability measurements and predictions can be divided into the

following two categories to use in combination to assist in assuring the desired level of

reliability of the software in mission critical systems like the Shuttle. The two categories

are: 1) measurements and predictions that are associated with residual software faults and

failures, and 2) measurements and predictions that are associated with the ability of the

software to complete a mission without experiencing a failure of a specified severity. In

the first category are: remaining failures, maximum failures, fraction of remaining failures,

and total test time required to attain a given number of fraction of remaining failures. In

the second category are: time to next failure and total test time required to attain a given

time to next failure. In addition, there is the risk associated with not attaining the required

remaining failures and time to next failure goals. Lastly, there is operational quality that is

derived from fraction of remaining failures. With this type of information, a software

manager can determine whether more testing is warranted or whether the software is

sufficiently tested to allow its release or unrestricted use. These predictions provide a

quantitative basis for achieving reliability goals [2].

1.3 Interpretations and Credibility

The two most critical factors in establishing credibility in software reliability

predictions are the validation method and the way the predictions are interpreted. For

example, a “conservative” prediction can be interpreted as providing an “additional margin

of confidence” in the software reliability, if that predicted reliability already exceeds an

established “acceptable level” or requirement. It may not be possible to validate

predictions of the reliability of software precisely, but it is possible with “high confidence”

to predict a lower bound on the reliability of that software within a specified environment.

If there historical failure data were available for a series of previous dates (and there

is actual data for the failure history following those dates), it would be possible to compare

the predictions to the actual reliability and evaluate the performance of the model. Taking

this approach will significantly enhance the credibility of predictions among those who

must make software deployment decisions based on the predictions [9].

1.4 Verification and Validation

Software reliability measurement and prediction are useful approaches to verify and

validate software. Measurement refers to collecting and analyzing data about the observed

reliability of software, for example the occurrence of failures during test. Prediction refers

to using a model to forecast future software reliability, for example failure rate during

operation. Measurement also provides the failure data that is used to estimate the

parameters of reliability models (i.e., make the best fit of the model to the observed failure

data). Once the parameters have been estimated, the model is used to predict the future

reliability of the software. Verification ensures that the software product, as it exists in a

given project phase, satisfies the conditions imposed in the preceding phase (e.g.,

reliability measurements of mission critical software components obtained during test

conform to reliability specifications made during design) [5]. Validation ensures that the

software product, as it exists in a given project phase, which could be the end of the

project, satisfies requirements (e.g., software reliability predictions obtained during test

correspond to the reliability specified in the requirements) [5].

534 Norman F. Schneidewind

Another way to interpret verification and validation is that it builds confidence that

software is ready to be released for operational use. The release decision is crucial for

systems in which software failures could endanger the safety of the mission and crew (i.e.,

mission critical software). To assist in making an informed decision, software risk analysis

and reliability prediction are integrated and provide stopping rules for testing. This

approach is applicable to all mission critical software. Improvements in the reliability of

software, where the reliability measurements and predictions are directly related to mission

and safety, contribute to system safety.

2. Implementing a Software Reliability Engineering Program

In broad terms, implementing a software reliability program is a two-phased

process. It consists of (1) identifying the reliability goals and (2) testing the software to see

whether it conforms to the goals. The reliability goals can be ideal (e.g., zero defects) but

should have some basis in reality based on tradeoffs between reliability and cost. The

testing phase is more complex because it involves collecting raw defect data and using it

for assessment and prediction.

The following are major SRE steps in the recommended practice, keyed to the phases

of the software development life cycle (not necessarily in chronological order):

2.1 State the Reliability Criteria (requirements analysis phase)

This might be stated, for example, as “no failure that would result in loss of life or

mission”.

2.2 Collect Fault and Failure Data (testing and operations phase)

For each system, there should be a brief description of its purpose and functions and

the fault and failure data, as shown below. Days # could be hours, minutes, as appropriate.

Code the Problem Report Identification to indicate Software (S) failure, Hardware (H)

failure, or People (P) failure.

• System Identification

• Purpose

• Functions

• Days # (since start of test)

• Problem Report Identification

• Problem Severity

• Failure Date

• Module with Fault

• Description of Problem

2.3 Establish Problem Severity Levels (requirements analysis phase)

Use a problem severity classification, such as the following:

1. Loss of life, loss of mission, abort mission.

2. Degradation in performance.

3. Operator annoyance.

4. System ok, but documentation in error.

5. Error in classifying a problem (i.e., no problem existed in the first place).

Note: Not all problems result in failures.

Successful Application of Software Reliability: Case Study

2.4 Develop Reliability Assurance Criteria(requirements analysis phase)

Two criteria for software reliability levels will be defined. Then these criteria will

be applied to the risk analysis of mission critical software. In the case of the Shuttle

example, the “risk” represents the degree to which the occurrence of failures does not meet

required reliability levels, regardless of how insignificant the failures may be. Although it

may be counterintuitive to include minor failures in reliability assessments, in reality,

doing so provides a conservative lower bound on assessment. That is, the actual reliability

is highly unlikely to be lower than the assessment.

Next, a variety of equations that are used in reliability prediction and risk analysis

will be defined and derived, including the relationship between time to next failure and

reduction in remaining failures. Then it is shown how the prediction equations can be used

to integrate testing with reliability and quality. An example is shown of how the risk

analysis and reliability predictions can be used to make decisions about whether the

software is ready to deploy. Note that these equation are based on the model in [9] because

this model is used on the Shuttle and is one of the models recommended in the

recommended practice [1]. Other models could be used, such as those in [9].

If the reliability goal is the reduction of failures of a specified severity to an

acceptable level of risk [7], then for software to be ready to deploy, after having been

tested for time t, it must satisfy the following criteria:

1) Predicted mean number of remaining failures r(t) < rc, (1)

where rc is a specified critical value , and

2) predicted mean time to next failure TF(t) > tm, (2)

where tm is mission duration.

For systems that are tested and operated continuously like the Shuttle, tt, TF (t), and

are measured in execution time. Note that, as with any methodology for assuring software

reliability, there is no guarantee that the expected level will be achieved. Rather, with these

criteria, the objective is to reduce the risk of deploying the software to a “desired” level.

2.5 Apply the Remaining Failures Criterion (testing phase)

Criterion (1) sets the threshold on remaining failures that must be satisfied in order to

deploy the software (i.e., no more than a specified number of failures).

If it is predicted that r(t) ≥ rc, then the process is to continue to test for a time t’ > t

that is predicted to achieve r(t’)

be experienced and more faults will be corrected so that the remaining failures will be

reduced by the quantity r(t) – r(t’). If the developer does not have the resources to satisfy

the criterion or is unable to satisfy the criterion through additional testing, the risk of

deploying the software prematurely should be assessed. It is known that it is impossible to

demonstrate the absence of faults [3]; however, the risk of failures occurring can be

reduced to an acceptable level, as represented by rc. This scenario is shown in Figure 1. In

case A, r (t)

and the mission would be postponed until the software is tested for time t’ when r (t’)

predicted. In both cases criterion 2) would also be required for the mission to begin.

536 Norman F. Schneidewind

Figure 1: Remaining Failures Criterion Scenario

2.6 Apply the Time to Next Failure Criterion (testing phase)

Criterion 2 specifies that the software must survive for a time greater than the

duration of the mission. If TF (t) ≤ tm, is predicted, the software is tested for a time t’ that

is predicted to achieve TF (t’) > tm, using assumptions 1and 2 that more failures will be

experienced and faults corrected, so that the mean time to next failure will be increased by

the quantity TF (t’) -TF (t). Again, if it is infeasible for the developer to satisfy the criterion

for lack of resources or failure to achieve test objectives, the risk of deploying the software

prematurely should be assessed. This scenario is shown in Figure 2.

Figure 2: Time to Next Failure Criterion Scenario

Start Test End Test, Begin Mission End

Mission

End

Mission

r(tt)

tt tt
’

Start Test

Continue

Test

End Test

Begin

Mission

r(tt)

TF (tt
’’
)

tt tt
’’

tm
tm

Start Test End Test, Begin Mission End Mission

Start Test
End Test

Begin

Mission
End

Mission
Continue

Test

TF (tt)

Successful Application of Software Reliability: Case Study

In case A, TF (t) > tm is predicted and the mission begins at t. In case B, TF (t) ≤ tm is

predicted, and in this case the mission would be postponed until the software is tested for

time tt’ when TF (t’) > tm is predicted. In both cases criterion 1) would also be required for

the mission to begin. If neither criterion is satisfied, the software is subjected to additional

inspection and testing, to remove more faults, until the desired level of risk is achieved.

2.7 Make a Risk Assessment (pre deployment or launch phase)

Reliability Risk pertains to executing the software of a mission critical system where

there is the chance of injury (e.g., astronaut injury or fatality), damage (e.g., destruction of

the Shuttle), or loss (e.g., loss of the mission) if a serious software failure occurs during a

mission. In the case of the Shuttle, where the occurrence of even trivial failures is rare, the

fraction of those failures that pose any reliability risk is too small to be statistically

significant. As a result, in order to have an adequate sample size for analysis, all failures

(of any severity) over the entire 20-year life of the project have been included in the failure

history database for this analysis. Therefore, the risk criterion metrics to be discussed for

the Shuttle quantify the degree of risk associated with the occurrence of any software

failure, no matter how insignificant it may be. As mentioned previously, this approach

provides a conservative lower bound to reliability predictions.

As an example, the Schneidewind Software Reliability Model (other software

reliability models could be used as well) is used to compute a parameter: fraction of

remaining failures as a function of the archived failure history during test and operation

[6]. The prediction methodology uses this parameter and other reliability quantities to

provide bounds on total test time, remaining failures, operational quality, and time to next

failure that are necessary to meet defined Shuttle software reliability levels.

The test time t can be considered a measure of the degree to which software

reliability goals have been achieved. This is particularly the case for systems like the

Shuttle where the software is subjected to continuous and rigorous testing for several years

in multiple facilities, using a variety of operational and training scenarios (e.g., by the

contractor in Houston, by NASA in Houston for astronaut training, and by NASA at Cape

Canaveral). In Figure 3, t is interpreted as an input to a risk reduction process, and r (t)

and TF (t) as the outputs, with rc and tm as risk thresholds of reliability that control the

process.

Figure 3: Risk Reduction Process

Reliability

Measure

Risk

Reduction

rc tm

r(tt)

TF(tt)

Total Test Time

Risk Criteria Levels

538 Norman F. Schneidewind

While it must be recognized that test time is not the only consideration in developing

test strategies and that there are other important factors, such as the consequences for

reliability and cost in selecting test cases [11], nevertheless, for the foregoing reasons, test

time has been found to be strongly positively correlated with reliability growth for the

Shuttle [9].

2.8 Evaluate Remaining Failures Risk (pre deployment or launch phase)

To obtain the mean value of the risk criterion metric (RCM) in equation (4), first,

the mean remaining failures must be predicted in equation (3).

( )
α

r(t )= exp -β(t -(s-1))
β
 
 

(3)

Then, the mean value of the risk criterion metric (RCM) for criterion 1 is formulated

as follows:

RCM r(t)= (r(t) – rc) / rc = (r(t) / rc) – 1 (4)

Equation (3) is plotted in Figure 4 as a function of t for rc = 1, for the Shuttle software

release OID, a software system comprised of modules and configured from a series of

builds to meet Shuttle mission functional requirements, where positive, zero, and negative

values correspond to r (t) > rc, r (t) = rc, and r (t) < rc, respectively.

Figure 4: RCM for Remaining Failures, (rc = 1), OID

In Figure 4, these values correspond to the following regions: above the X-axis

predicted remaining failures are greater than the specified value; on the X-axis predicted

remaining failures are equal to the specified value; and below the X-axis predicted

remaining failures are less than the specified value, which could represent a “safe”

threshold or in the Shuttle example, an “error-free” condition boundary. In the example it

can be seen that at t = 80 the risk transitions from the high risk region to the low risk

region.

-0.7

33.5 49 64.5 8

1.3

3.3

5.3

7.3

DESIRED

CRITICAL

r(tt)>rc

r(tt) = rc

Total Test Time (30 Day Intervals)

r(tt) < rc

Successful Application of Software Reliability: Case Study

539

2.9 Evaluate Time to Next Failure Risk (pre deployment or launch phase)

The mean value of the risk criterion metric (RCM) for criterion 2 is formulated as

follows:

RCM TF (t) = (tm – TF (t)) / tm=1 – (TF (t)) / tm (4)

Equation (4) is plotted in Figure 5 as a function of test time t for tm = 8 thirty day

intervals, for OID, where there is high risk for TF(tt) < tm. Once TF(tt) > tm, the risk is low.

Figure 5: RCM for Time to Next Failure (tm = 8 days) OIC

3. Make Reliability Predictions (test and operations phases)

In order to support the reliability goal and to assess the risk of deploying the

software, various reliability and quality predictions are made during the test phase to

validate that the software meets requirements. For example, suppose the software

reliability requirements state the following: 1) ideally, after testing the software for time t,

the mean predicted remaining failures shall be less than one; 2) if the ideal of 1) cannot be

achieved due to cost and schedule constraints, mean time to next failure, predicted after

testing for time t, shall exceed the mission duration; and 3) the risk of not meeting 1) and

2) shall be assessed.

3.1 Additional Risk Evaluation (test and operations phases)

In addition to remaining failures and time to failure risk, which have already been

discussed, various other predictions are made in order to provide a comprehensive

assessment of risk. These predictions are based on the Schneidewind Software Reliability

Model [1, 8, 9, 10]. Again, other models recommended in the Recommended Practice for

Software Reliability

[1] could be used. The Statistical Modeling and Estimation of

Reliability Functions for Software (SMERFS) [4] tool is used to support predictions.

In the following equations, parameter α is the failure rate at the beginning of
interval s; parameter β is the negative of the derivative of failure rate divided by failure

rate (i.e., relative failure rate); t is test time or the last interval of observed failure data; s is

the starting interval for using observed failure data in parameter estimation that provides

-73

24 28 32 4

-53

-33

-13

DESIRED TF(tt)>Tm

Tm = 8 days

CRITICAL TF(tt) < Tm

Total Test Time (30 Day Intervals)

TF(tt) =Tm

540 Norman F. Schneidewind

the best estimates of α and β and the most accurate predictions [8]; Xs-1 is the observed
failure count in the range [1,s-1]; Xs, t is the observed failure count in the range [s,t];
and Xt=Xs-1+Xs,t. Failures are counted against operational increments (OIs).

Cumulative Failures: When estimates are obtained for the parameters α and β, with s as

the starting interval for using observed failure data, the predicted failure count in the range

[1,t] is obtained (i.e., cumulative failures) [6]:

F (t)=(α/β)[1-exp (-β ((t-s+1)))]+Xs-1 (6)

Figure 6 provides risk reduction in the sense that the predicted cumulative failures

provide an upper bound on the actual failures (i.e., there is assurance that the actual

failures will ne exceed the predicted values). In addition, risk is mitigated by the fact that

the predictions increase at an increasing rate. Also shown in this figure is the mean relative

error (MRE) between actual and predicted values. The MRE is high due to the fact that

predictions are consistently higher that actual values.

Figure 6: Total Test Time and Remaining Failures vs. Fraction Remaining Failures, OIA

Maximum Failures: Let t→∞ in equation (6) and obtain the predicted failure count

in the range [1,∞] (i.e., maximum failures over the life of the software):

F (∞) = α/β+Xs-1 (7)

Applying equation (7), the predicted maximum failures = 18.4706. Thus, we would

have low risk that the actual cumulative failures will not exceed the value.

Fraction of Remaining Failures: If equation (3) is divided by equation (7), fraction of

remaining failures, predicted at time t is obtained:

p(t)= r(t) /F(∞) (8)

According to the manager of Shuttle software development, equation (8) is an

excellent management tool for providing confidence that the software is ready to deploy,

T
o
ta

l
T

e
st

T
im

e
t

t
(

a
y
I

n
te

rv
a
ls

)
0
0

0.1 0.2 0.3 0.4

120

160

Total Test Time (30 Day Intervals)

0.5

3
4
5

+
+
+

r(tt)

N
u

m
b
e
r

o
f

R
e
m

a
in

in
g

F
a
il

u
re

s
r
(

t t
)

Successful Application of Software Reliability: Case Study

541

as the fraction remaining failures becomes miniscule, with increasing testing, as Figure 7

attests [5].

Figure 7: Operational Quality (Fraction Fault Removal) vs. Total Test Time, OIA

Operational Quality: The operational quality of software is the complement of p(t). It is

the degree to which software is free of remaining faults (failures), using the assumption 1

that the faults that cause failures are removed. It is predicted at time t as follows:

Q (t) = 1-p (t) (9)

This risk metric is useful because some software engineers and managers would

prefer to see things in a positive light — quality growth. Figure 7 demonstrates that after t =

100 the improvement in quality becomes miniscule, and the cost to remove additional

faults would be significant. Thus this figure metrics for risk assessment and a sopping rule

for when to terminate testing.

Total Test Time to Achieve Specified Remaining Failures. The predicted test time

required to achieve a specified number of remaining failures at t, r (t), is obtained from

equation (3) by solving for t:

t =
1 r(t)β

(β(s-1)-log( )
β α

(10)

Equation (10) is another risk reduction metric based on the concept that the

predicted test time to achieve a specified number of remaining failures reveals how much

test time and effort would be required to achieve various levels of risk, as represented by

specified remaining failures, as shown in Figure 8, where, naturally, the test time and cost

becomes significantly high in order to achieve significant reductions in risk.

3.2 Interpret Software Reliability Predictions (pre deployment or launch phase)

Total Test Time (30 Day Intervals)
0

0.67

40 80 120

0.75

0.84

0.92

1.0

160

0.59

542 Norman F. Schneidewind

Successful use of statistical modeling in predicting the reliability of a software

system requires a thorough understanding of precisely how the resulting predictions are to

be interpreted and applied [9]. The Shuttle software (430 KLOC) is frequently modified,

at the request of NASA, to add or change capabilities using a constantly improving

process.

Figure 8: Launch Decision: Remaining Failures vs. Total Test Time, OIA

Each of these successive versions constitutes an upgrade to the preceding software

version. Each new version of the software (designated as an Operational Increment, OI)

contains software code that has been carried forward from each of the previous versions

(“previous-version subset”) as well as new code generated for that new version (“new-

version subset”). We have found that by applying a reliability model independently to the

code subsets we can obtain satisfactory composite predictions for the total version [9].

It is essential to recognize that this approach requires a very accurate code change

history so that every failure can be uniquely attributed to the version in which the defective

line(s) of code were first introduced. In this way, it is possible to build a separate failure

history for the new code in each release. To apply SRE to a software system, it should be

broken your down into smaller elements to which a reliability model can be more

accurately applied. This approach has been successfully applied to predict the reliability of

the Shuttle software for NASA [9].

3.3 Use Software Reliability Tools (test and operations phases)

It is infeasible to do large-scale reliability prediction by hand. Therefore, there are

software reliability tools available to make the model predictions easier to achieve. The

Statistical Modeling and Estimation of Reliability Functions for Software (SMERFS) is a

software package available for this purpose [4]. However, it is important for the user to

understand the capabilities, applicability, and limitations of such tools.

0
1
40 80 120
2
3
4
5
Total Test Time (30 Day Intervals)
160
0

r = Remaining Failures

tt = Total Test Time Until

Launch

EXAMPLE:

(r = 0.6, tt = 52)

Successful Application of Software Reliability: Case Study

543

4. Lessons Learned

Several important lessons have been learned from the experience of twenty years in

developing and maintaining the Shuttle software, which you could consider for adoption in

your SRE process:

1) No one SRE process method is the “silver bullet” for achieving high reliability.

Various methods, including formal inspections, failure modes analysis, verification

and validation, testing, statistical process control, risk analysis, and reliability

modeling and prediction must be integrated and applied.

2) The process must be continually improved and upgraded. For example, recent

experiments with software metrics have demonstrated the potential of using metrics as

early indicators of future reliability problems. This approach, combined with

inspections, allows many reliability problems to be identified and resolved before

testing.

3) The process must have feedback loops so that information about reliability

problems discovered during inspection and testing is fed back not only to

requirements analysis and design for the purpose of improving the reliability of future

products but also to the requirements analysis, design, inspection and testing

processes themselves. In other words, the feedback is designed to improve not only

the product but also the processes that produce the product.

4) Given the current state-of-the-practice in software reliability modeling and

prediction, practitioners should not view reliability models as having the ability to

make highly accurate predictions of future software reliability. Rather, software

managers should interpret these predictions in two significant ways: a) providing

increased confidence, when used as part of an integrated SRE process, that the

software is safe to deploy; and b) providing bounds on the reliability of the deployed

software (e.g., high confidence that in operation the time to next failure will exceed

the predicted value and the predicted value will safely exceed the mission duration).

5. Conclusions

We showed how software reliability predictions can increase confidence in the

reliability of mission critical software such as the NASA Space Shuttle Primary Avionics

Software System. These results are applicable to other mission critical software.

Remaining failures, maximum failures, total test time required to attain a given fraction of

remaining failures, and time to next failure were shown to be useful reliability

measurements and predictions for: 1) providing confidence that the software has achieved

reliability goals; 2) rationalizing how long to test a piece of software; and 3) analyzing the

risk of not achieving remaining failure and time to next failure goals. Having predictions

of the extent that the software is not fault free (remaining failures) and whether it is

likely to survive a mission (time to next failure) provide criteria for assessing the risk of

deploying the software. Furthermore, fraction of remaining failures can be used as both an

operational quality goal in predicting total test time requirements and, conversely, as an

indicator of operational quality as a function of total test time expended.

Software reliability engineering is a tool that software managers can use to provide

confidence that the software meets reliability goals.

544 Norman F. Schneidewind

References

[1]. IEEE/AIAA P1633™, Recommended Practice on Software Reliability, June 2008.

[2]. Billings C., J. Clifton, B. Kolkhorst, E. Lee, and W.B. Wingert. Journey to a Mature

Software Process. IBM Systems Journal 1994; 33 (1): 46-61.

[3]. Dijkstra E. Structured Programming, Software Engineering Techniques. eds. J. N.

Buxton and B. Randell, NATO Scientific Affairs Division, Brussels 39, Belgium April

1970 : 84-88.

[4]. Farr W. and O. Smith. Statistical Modeling and Estimation of Reliability Functions for

Software (SMERFS) Users Guide. NAVSWC TR-84-373, Revision 3, Naval Surface

Weapons Center, Revised September 1993.

[5]. IEEE Standard Glossary of Software Engineering Terminology, IEEE Std 610.12.

1990.

The Institute of Electrical and Electronics Engineers, New York, New York, March 30,

1990.

[6]. Keller T., N. Schneidewind, and P. Thornton. Predictions for Increasing Confidence in

the Reliability of the Space Shuttle Flight Software. Proceedings of the AIAA

Computing in Aerospace 10, San Antonio, TX, March 28, 1995: 1-8.

[7]. Schneidewind N. Reliability Modeling for Safety Critical Software, IEEE Transactions

on Reliability March 1997; 46(1):88-98.

[8]. Schneidewind N. Software Reliability Model with Optimal Selection of Failure Data.

IEEE Transactions on Software Engineering November 1993;19(11):1095-1104.

[9]. Schneidewind N. and T. Keller. Application of Reliability Models to the Space Shuttle.

IEEE Software July 1992; 9(4)28-33.

[10]. Schneidewind N. Analysis of Error Processes in Computer Software. Proceedings of the

International Conference on Reliable Software, IEEE Computer Society, 21-23 April

1975:337-346.

[11]. Weyuker E. Using the Consequences of Failures for Testing and Reliability Assessment,

Proceedings of the Third ACM SIGSOFT Symposium on the Foundations of Software

Engineering, Washington, D.C., October 10-13, 1995:81-91.

Bibliography

1. Boehm B. Software Risk Management: Principles and Practices. IEEE Software

January 1991; 8(1): 32-41.

2. Dalal S. and A. McIntosh. When to Stop Testing for Large Software Systems with

Changing Code. IEEE Transactions on Software Engineering April 1994; 20(4):

318-323.

3. Dalal S. and A. McIntosh. Some Graphical Aids for Deciding When to Stop

Testing. IEEE Journal on Selected Areas in Communications February 1990;

8(2):169-175.

4. Ehrlich W., B. Prasanna, John Stampfel, and Jar Wu. Determining the Cost of a

Stop-Test Decision. IEEE Software, March 1993:10(2) 33-42.

5. Keller T. and N. Schneidewind. A Successful Application of Software Reliability

Engineering for the NASA Space Shuttle. Software Reliability Engineering Case

Studies. International Symposium on Software Reliability Engineering, ,

Albuquerque, New Mexico, November 4, 1997: 71-82.

6. Leveson N. Software Safety: What, Why, and How. ACM Computing Surveys

June 1986; 18(2):125-163.

Successful Application of Software Reliability: Case Study

545

7. Lyu M. (Editor-in-Chief), Handbook of Software Reliability Engineering.

Computer Society Press, Los Alamitos, CA and McGraw-Hill, New York, NY,

1995.

8. Musa J. and A. Ackerman. Quantifying Software Validation: When to Stop

Testing? IEEE Software May 1989; 6(3):19-27.

9. Musa John D., Anthony Iannino, and Kazuhira Okumoto. Software Reliability:

Measurement, Prediction, and Applications. McGraw-Hill, New York 1987.

10. Nikora A., N. Schneidewind, and J. Munson. Practical Issues In Estimating Fault

Content And Location In Software Systems. Proceedings of the AIAA Space

Technology Conference and Exposition, Albuquerque, NM, Sep 29-30, 1999.

11. Nikora A., N. Schneidewind, and J. Munson. IV&V Issues in Achieving High

Reliability and Safety in Critical Control Software. Final Report, Volume 1 –

Measuring and Evaluating the Software Maintenance Process and Metrics-Based

Software Quality Control, Volume 2 – Measuring Defect Insertion Rates and

Risk of Exposure to Residual Defects in Evolving Software Systems, and Volume

3 – Appendices, Jet Propulsion Laboratory, National Aeronautics and Space

Administration, Pasadena, California, January 19, 1998.

12. A. Nikora, N. Schneidewind, and J. Munson. IV&V Issues in Achieving High

Reliability and Safety in Critical Control System Software. Proceedings of the

Third International Society of Science and Applied Technologies Conference on

Quality in Design, Anaheim, California, March 12-14, 1997: 25-30.

13. Schneidewind N. Measuring and Evaluating Maintenance Process Using

Reliability, Risk, and Test Metrics. IEEE Transactions on Software Engineering

November/December 1999; 25(6): 768-781.

14. Schneidewind N. Software Validation for Reliability. Wiley Encyclopedia of

Electrical and Electronics Engineering, John G. Webster, editor, John Wiley &

Sons, Inc., 1999;19: 607-618.

15. Schneidewind N. Reliability Modeling for Safety Critical Software. IEEE

Transactions on Reliability March 1997; 46(1):88-98.

16. Singpurwalla N. Determining an Optimal Time Interval for Testing and

Debugging Software. IEEE Transactions on Software Engineering April 1991;

17(4): 313-319.

17. Voas J. and K. Miller. Software Testability: The New Verification. IEEE

Software May 1995; 12(3):17-28.

Norman F. Schneidewind, Ph.D., is Professor Emeritus of Information Sciences in the

Department of Information Sciences and the Software Engineering Group at the Naval

Postgraduate School. He is now doing research and publishing in software reliability and

metrics with his consulting company Computer Research. Dr. Schneidewind is a Fellow of

the IEEE, elected in 1992 “for contributions to software measurement models in reliability

and metrics, and for leadership in advancing the field of software maintenance”. In

2001, he received the IEEE Reliability Engineer of the Year award from the IEEE

Reliability Society. In 1993 and 1999, he received awards for Outstanding Research

Achievement by the Naval Postgraduate School.

Dr. Schneidewind was selected for an IEEE USA Congressional Fellowship for

2005 and worked with the Committee on Homeland Security and Government Affairs,

United States Senate, focusing on homeland security, cyber security, and privacy. In

March, 2006, he received the IEEE Computer Society Outstanding Contribution Award

546 Norman F. Schneidewind

for “outstanding technical and leadership contributions as the Chair of the Working Group

revising IEEE Standard 982.1”.

He is the developer of the Schneidewind software reliability model that was used by

NASA to assist in the prediction of software reliability of the Space Shuttle, by the Naval

Surface Warfare Center for Tomahawk cruise missile launch and Trident software

reliability prediction, and by the Marine Corps Tactical Systems Support Activity for

distributed system software reliability assessment and prediction. This model is

recommended by the IEEE and the American Institute of Aeronautics and Astronautics

Recommended Practice for Software Reliability. In addition, the model is implemented in

the Statistical Modeling and Estimation of Reliability Functions for Software (SMERFS),

software reliability-modeling tool.

54 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1

practice

I
M

E
B

Y
V

I
T

E
Z

S
L

A
V

V
A

L
K

T H E H E T E R O G E N E I T Y, C O M P L E X I T Y, and scale of cloud
applications make verification of their fault tolerance
properties challenging. Companies are moving away
from formal methods and toward large-scale testing
in which components are deliberately compromised
to identify weaknesses in the software. For example,
techniques such as Jepsen apply fault-injection testing
to distributed data stores, and Chaos Engineering
performs fault injection experiments on production
systems, often on live traffic. Both approaches have
captured the attention of industry and academia alike.

Unfortunately, the search space of distinct fault
combinations that an infrastructure can test is
intractable. Existing failure-testing solutions require
skilled and intelligent users who can supply the faults
to inject. These superusers, known as Chaos Engineers

and Jepsen experts, must study the sys-
tems under test, observe system execu-
tions, and then formulate hypotheses
about which faults are most likely to
expose real system-design flaws. This
approach is fundamentally unscal-
able and unprincipled. It relies on the
superuser’s ability to interpret how
a distributed system employs redun-
dancy to mask or ameliorate faults
and, moreover, the ability to recognize
the insufficiencies in those redundan-
cies—in other words, human genius.

This article presents a call to arms
for the distributed systems research
community to improve the state of
the art in fault tolerance testing.
Ordinary users need tools that au-
tomate the selection of custom-tai-
lored faults to inject. We conjecture
that the process by which superusers
select experiments—observing execu-
tions, constructing models of system
redundancy, and identifying weak-
nesses in the models—can be effec-
tively modeled in software. The ar-
ticle describes a prototype validating
this conjecture, presents early results
from the lab and the field, and identi-
fies new research directions that can
make this vision a reality.

The Future Is Disorder
Providing an “always-on” experience
for users and customers means that
distributed software must be fault tol-
erant—that is to say, it must be writ-
ten to anticipate, detect, and either
mask or gracefully handle the effects
of fault events such as hardware fail-
ures and network partitions. Writing
fault-tolerant software—whether for
distributed data management systems
involving the interaction of a handful
of physical machines, or for Web ap-
plications involving the cooperation of
tens of thousands—remains extremely
difficult. While the state of the art in
verification and program analysis con-
tinues to evolve in the academic world,
the industry is moving very much in
the opposite direction: away from for-
mal methods (however, with some
noteworthy exceptions,41) and toward

Abstracting
the Geniuses
Away from
Failure Testing

D O I : 1 0 . 1 1 4 5 / 3 1 5 2 4 8 3

Article development led by
queue.acm.org

Ordinary users need tools that automate the
selection of custom-tailored faults to inject.

BY PETER ALVARO AND SEVERINE TYMON

http://dx.doi.org/10.1145/3152483

J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 55

56 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1

practice

up the stack and frustrate any attempts
at abstraction.

The Old Guard. The modern myth:
Formally verified distributed compo-
nents. If we cannot rely on geniuses to
hide the specter of partial failure, the
next best hope is to face it head on,
armed with tools. Until quite recently,
many of us (academics in particular)
looked to formal methods such as
model checking16,20,29,39,40,53,54 to assist
“mere mortal” programmers in writ-
ing distributed code that upholds its
guarantees despite pervasive uncer-
tainty in distributed executions. It is
not reasonable to exhaustively search
the state space of large-scale systems
(one cannot, for example, model
check Netflix), but the hope is that
modularity and composition (the next
best tools for conquering complexity)
can be brought to bear. If individual
distributed components could be
formally verified and combined into
systems in a way that preserved their
guarantees, then global fault toler-
ance could be obtained via composi-
tion of local fault tolerance.

Unfortunately, this, too, is a pipe
dream. Most model checkers require
a formal specification; most real-world
systems have none (or have not had one
since the design phase, many versions
ago). Software model checkers and oth-
er program-analysis tools require the
source code of the system under study.
The accessibility of source code is also
an increasingly tenuous assumption.
Many of the data stores targeted by
tools such as Jepsen are closed source;
large-scale architectures, while typical-
ly built from open source components,
are increasingly polyglot (written in a
wide variety of languages).

Finally, even if you assume that spec-
ifications or source code are available,
techniques such as model checking are
not a viable strategy for ensuring that
applications are fault tolerant because,
as mentioned, in the context of time-
outs, fault tolerance itself is an end-to-
end property that does not necessarily
hold under composition. Even if you
are lucky enough to build a system out
of individually verified components, it
does not follow the system is fault toler-
ant—you may have made a critical error
in the glue that binds them.

The Vanguard. The emerging ethos:
YOLO. Modern distributed systems

approaches that combine testing with
fault injection.

Here, we describe the underlying
causes of this trend, why it has been
successful so far, and why it is doomed
to fail in its current practice.

The Old Gods. The ancient myth:
Leave it to the experts. Once upon a
time, distributed systems researchers
and practitioners were confident that
the responsibility for addressing the
problem of fault tolerance could be
relegated to a small priesthood of ex-
perts. Protocols for failure detection,
recovery, reliable communication,
consensus, and replication could be
implemented once and hidden away
in libraries, ready for use by the layfolk.

This has been a reasonable dream.
After all, abstraction is the best tool
for overcoming complexity in com-
puter science, and composing reliable
systems from unreliable components
is fundamental to classical system
design.33 Reliability techniques such
as process pairs18 and RAID45 dem-
onstrate that partial failure can, in
certain cases, be handled at the low-
est levels of a system and successfully
masked from applications.

Unfortunately, these approaches
rely on failure detection. Perfect failure
detectors are impossible to implement
in a distributed system,9,15 in which it
is impossible to distinguish between
delay and failure. Attempts to mask
the fundamental uncertainty arising
from partial failure in a distributed
system—for example, RPC (remote
procedure calls8) and NFS (network file
system49)—have met (famously) with
difficulties. Despite the broad consen-
sus that these attempts are failed ab-
stractions,28 in the absence of better
abstractions, people continue to rely
on them to the consternation of devel-
opers, operators, and users.

In a distributed system—that is, a
system of loosely coupled components
interacting via messages—the failure
of a component is only ever manifested
as the absence of a message. The only
way to detect the absence of a message
is via a timeout, an ambiguous signal
that means either the message will nev-
er come or that it merely has not come
yet. Timeouts are an end-to-end con-
cern28,48 that must ultimately be man-
aged by the application. Hence, partial
failures in distributed systems bubble

While the state
of the art in
verification and
program analysis
continues to evolve
in the academic
world, the industry
is moving in the
opposite direction:
away from formal
methods and
toward approaches
that combine
testing with fault
injection.

J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 57

practice

are simply too large, too heteroge-
neous, and too dynamic for these
classic approaches to software qual-
ity to take root. In reaction, practitio-
ners increasingly rely on resiliency
techniques based on testing and fault
injection.6,14,19,23,27,35 These “black box”
approaches (which perturb and ob-
serve the complete system, rather
than its components) are (arguably)
better suited for testing an end-to-
end property such as fault tolerance.
Instead of deriving guarantees from
understanding how a system works
on the inside, testers of the system
observe its behavior from the outside,
building confidence that it functions
correctly under stress.

Two giants have recently emerged
in this space: Chaos Engineering6 and
Jepsen testing.24 Chaos Engineering,
the practice of actively perturbing pro-
duction systems to increase overall site
resiliency, was pioneered by Netflix,6
but since then LinkedIn,52 Microsoft,38
Uber,47 and PagerDuty5 have developed
Chaos-based infrastructures. Jepsen
performs black box testing and fault
injection on unmodified distributed
data management systems, in search
of correctness violations (for example,
counterexamples that show an execu-
tion was not linearizable).

Both approaches are pragmatic and
empirical. Each builds an understand-
ing of how a system operates under
faults by running the system and observ-
ing its behavior. Both approaches offer
a pay-as-you-go method to resiliency:
the initial cost of integration is low,
and the more experiments that are
performed, the higher the confidence
that the system under test is robust.
Because these approaches represent
a straightforward enrichment of exist-
ing best practices in testing with well-
understood fault injection techniques,
they are easy to adopt. Finally, and
perhaps most importantly, both ap-
proaches have been shown to be effec-
tive at identifying bugs.

Unfortunately, both techniques
also have a fatal flaw: they are manual
processes that require an extremely
sophisticated operator. Chaos Engi-
neers are a highly specialized subclass
of site reliability engineers. To devise
a custom fault injection strategy, a
Chaos Engineer typically meets with
different service teams to build an

understanding of the idiosyncrasies
of various components and their in-
teractions. The Chaos Engineer then
targets those services and interactions
that seem likely to have latent fault tol-
erance weaknesses. Not only is this ap-
proach difficult to scale since it must
be repeated for every new composition
of services, but its critical currency—
a mental model of the system under
study—is hidden away in a person’s
brain. These points are reminiscent
of a bigger (and more worrying) trend
in industry toward reliability priest-
hoods,7 complete with icons (dash-
boards) and rituals (playbooks).

Jepsen is in principle a framework
that anyone can use, but to the best of
our knowledge all of the reported bugs
discovered by Jepsen to date were dis-
covered by its inventor, Kyle Kingsbury,
who currently operates a “distributed
systems safety research” consultancy.24
Applying Jepsen to a storage system
requires the superuser carefully read
the system documentation, generate
workloads, and observe the externally
visible behaviors of the system under
test. It is then up to the operator to
choose—from the massive combina-
torial space of “nemeses,” including
machine crashes and network parti-
tions—those fault schedules that are
likely to drive the system into returning
incorrect responses.

A human in the loop is the kiss of
death for systems that need to keep up
with software evolution. Human atten-
tion should always be targeted at tasks
that computers cannot do! Moreover,
the specialists that Chaos and Jepsen
testing require are expensive and rare.
Here, we show how geniuses can be ab-
stracted away from the process of fail-
ure testing.

We Don’t Need Another Hero
Rapidly changing assumptions about
our visibility into distributed system
internals have made obsolete many
if not all of the classic approaches to
software quality, while emerging “cha-
os-based” approaches are fragile and
unscalable because of their genius-in-
the-loop requirement.

We present our vision of automated
failure testing by looking at how the
same changing environments that has-
tened the demise of time-tested resil-
iency techniques can enable new ones.

We argue the best way to automate the
experts out of the failure-testing loop is
to imitate their best practices in soft-
ware and show how the emergence of
sophisticated observability infrastruc-
ture makes this possible.

The order is rapidly fadin.’ For large-
scale distributed systems, the three
fundamental assumptions of tradi-
tional approaches to software quality
are quickly fading in the rearview mir-
ror. The first to go was the belief that
you could rely on experts to solve the
hardest problems in the domain. Sec-
ond was the assumption that a formal
specification of the system is available.
Finally, any program analysis (broadly
defined) that requires that source code
is available must be taken off the ta-
ble. The erosion of these assumptions
helps explain the move away from clas-
sic academic approaches to resiliency
in favor of the black box approaches
described earlier.

What hope is there of understand-
ing the behavior of complex systems
in this new reality? Luckily, the fact
that it is more difficult than ever to
understand distributed systems from
the inside has led to the rapid evolu-
tion of tools that allow us to under-
stand them from the outside. Call-
graph logging was first described by
Google;51 similar systems are in use
at Twitter,4 Netflix,1 and Uber,50 and
the technique has since been stan-
dardized.43 It is reasonable to assume
that a modern microservice-based
Internet enterprise will already have
instrumented its systems to collect
call-graph traces. A number of start-
ups that focus on observability have
recently emerged.21,34 Meanwhile,
provenance collection techniques
for data processing systems11,22,42 are
becoming mature, as are operating
system-level provenance tools.44 Re-
cent work12,55 has attempted to infer
causal and communication structure
of distributed computations from
raw logs, bringing high-level explana-
tions of outcomes within reach even
for uninstrumented systems.

Regarding testing distributed systems.
Chaos Monkey, like they mention, is awe-
some, and I also highly recommend get-
ting Kyle to run Jepsen tests.

—Commentator on HackerRumor

58 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1

practice

of properties that are either maintained
throughout the system’s execution (for
example, system invariants or safety
properties) or established during execu-
tion (for example, liveness properties).
Most distributed systems with which
we interact, though their executions
may be unbounded, nevertheless pro-
vide finite, bounded interactions that
have outcomes. For example, a broad-
cast protocol may run “forever” in a re-
active system, but each broadcast deliv-
ered to all group members constitutes
a successful execution.

By viewing distributed systems in
this way, we can revise the definition:
A system is fault tolerant if it provides
sufficient mechanisms to achieve its
successful outcomes despite the given
class of faults.

Step 3: Formulate experiments that
target weaknesses in the façade. If we
could understand all of the ways in
which a system can obtain its good
outcomes, we could understand which
faults it can tolerate (or which faults it
could be sensitive to). We assert that
(whether they realize it or not!) the
process by which Chaos Engineers
and Jepsen superusers determine, on
a system-by-system basis, which faults
to inject uses precisely this kind of rea-
soning. A target experiment should
exercise a combination of faults that
knocks out all of the supports for an ex-
pected outcome.

Carrying out the experiments turns
out to be the easy part. Fault injection
infrastructure, much like observability
infrastructure, has evolved rapidly in
recent years. In contrast to random,
coarse-grained approaches to distrib-
uted fault injection such as Chaos
Monkey,23 approaches such as FIT
(failure injection testing)17 and Grem-
lin32 allow faults to be injected at the
granularity of individual requests with
high precision.

Step 4. Profit! This process can be ef-
fectively automated. The emergence of
sophisticated tracing tools described
earlier makes it easier than ever to
build redundancy models even from
the executions of black box systems.
The rapid evolution of fault injection
infrastructure makes it easier than
ever to test fault hypotheses on large-
scale systems. Figure 1 illustrates how
the automation described in this here
fits neatly between existing observ-

Away from the experts. While this
quote is anecdotal, it is difficult to
imagine a better example of the fun-
damental unscalability of the current
state of the art. A single person can-
not possibly keep pace with the ex-
plosion of distributed system imple-
mentations. If we can take the human
out of this critical loop, we must; if we
cannot, we should probably throw in
the towel.

The first step to understanding how
to automate any process is to compre-
hend the human component that we
would like to abstract away. How do
Chaos Engineers and Jepsen superus-
ers apply their unique genius in prac-
tice? Here is the three-step recipe com-
mon to both approaches.

Step 1: Observe the system in action.
The human element of the Chaos and
Jepsen processes begins with princi-
pled observation, broadly defined.

A Chaos Engineer will, after study-
ing the external API of services rel-
evant to a given class of interactions,
meet with the engineering teams to
better understand the details of the
implementations of the individual

services.25 To understand the high-
level interactions among services, the
engineer will then peruse call-graph
traces in a trace repository.3

A Jepsen superuser typically begins
by reviewing the product documenta-
tion, both to determine the guarantees
that the system should uphold and to
learn something about the mecha-
nisms by which it does so. From there,
the superuser builds a model of the
behavior of the system based on inter-
action with the system’s external API.
Since the systems under study are typ-
ically data management and storage,
these interactions involve generating
histories of reads and writes.31

The first step to understanding what
can go wrong in a distributed system is
watching things go right: observing the
system in the common case.

Step 2. Build a mental model of how
the system tolerates faults. The com-
mon next step in both approaches is
the most subtle and subjective. Once
there is a mental model of how a dis-
tributed system behaves (at least in the
common case), how is it used to help
choose the appropriate faults to inject?
At this point we are forced to dabble in
conjecture: bear with us.

Fault tolerance is redundancy. Giv-
en some fixed set of faults, we say that
a system is “fault tolerant” exactly if it
operates correctly in all executions in
which those faults occur. What does it
mean to “operate correctly”? Correct-
ness is a system-specific notion, but,
broadly speaking, is expressed in terms

Figure 1. Our vision of automated failure
testing.

explanations
models

of
redundancy

fault

injection

Figure 2. Fault injection and fault-tolerant code.

APP1 APP1 APP2 APP2
caller

fault

callee

API API API API API

J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 59

practice

ability infrastructure and fault injec-
tion infrastructure, consuming the
former, maintaining a model of system
redundancy, and using it to param-
eterize the latter. Explanations of sys-
tem outcomes and fault injection in-
frastructures are already available. In
the current state of the art, the puzzle
piece that fits them together (models of
redundancy) is a manual process. LDFI
(as we will explain) shows that automa-
tion of this component is possible.

A Blast from the Past
In previous work, we introduced a bug-
finding tool called LDFI (lineage-driven
fault injection).2 LDFI uses data prove-
nance collected during simulations of
distributed executions to build deriva-
tion graphs for system outcomes. These
graphs function much like the models
of system redundancy described ear-
lier. LDFI then converts the derivation
graphs into a Boolean formula whose
satisfying assignments correspond to
combinations of faults that invalidate
all derivations of the outcome. An ex-
periment targeting those faults will
then either expose a bug (that is, the ex-
pected outcome fails to occur) or reveal
additional derivations (for example, af-
ter a timeout, the system fails over to a
backup) that can be used to enrich the
model and constrain future solutions.

At its heart, LDFI reapplies well-
understood techniques from data
management systems, treating fault
tolerance as a materialized view main-
tenance problem.2,13 It models a dis-
tributed system as a query, its expect-
ed outcomes as query outcomes, and
critical facts such as “replica A is up at
time t” and “there is connectivity be-
tween nodes X and Y during the inter-
val i . . . j” as base facts. It can then ask
a how-to query:37 What changes to base
data will cause changes to the derived
data in the view? The answers to this
query are the faults that could, accord-
ing to the current model, invalidate the
expected outcomes.

The idea seems far-fetched, but the
LDFI approach shows a great deal of
promise. The initial prototype demon-
strated the efficacy of the approach at
the level of protocols, identifying bugs
in replication, broadcast, and commit
protocols.2,46 Notably, LDFI reproduced
a bug in the replication protocol used by
the Kafka distributed log26 that was first

(manually) identified by Kingsbury.30
A later iteration of LDFI is deployed at
Netflix,1 where (much like the illustra-
tion in Figure 1) it was implemented
as a microservice that consumes traces
from a call-graph repository service and
provides inputs for a fault injection ser-
vice. Since its deployment, LDFI has
identified 11 critical bugs in user-fac-
ing applications at Netflix.1

Rumors from the Future
The prior research presented earlier is
only the tip of the iceberg. Much work
still needs to be undertaken to realize
the vision of fully automated failure
testing for distributed systems. Here,
we highlight nascent research that
shows promise and identifies new di-
rections that will help realize our vision.

Don’t overthink fault injection. In the
context of resiliency testing for distribut-
ed systems, attempting to enumerate
and faithfully simulate every possible
kind of fault is a tempting but dis-
tracting path. The problem of under-
standing all the causes of faults is not
directly relevant to the target, which
is to ensure that code (along with its
configuration) intended to detect and
mitigate faults performs as expected.

Consider Figure 2: The diagram on
the left shows a microservice-based
architecture; arrows represent calls
generated by a client request. The
right-hand side zooms in on a pair of
interacting services. The shaded box
in the caller service represents the
fault tolerance logic that is intended
to detect and handle faults of the cal-
lee. Failure testing targets bugs in this
logic. The fault tolerance logic targeted
in this bug search is represented as the
shaded box in the caller service, while
the injected faults affect the callee.

The common effect of all faults, from
the perspective of the caller, is explicit
error returns, corrupted responses,
and (possibly infinite) delay. Of these
manifestations, the first two can be ad-
equately tested with unit tests. The last
is difficult to test, leading to branches
of code that are infrequently executed.
If we inject only delay, and only at com-
ponent boundaries, we conjecture that
we can address the majority of bugs re-
lated to fault tolerance.

Explanations everywhere. If we can
provide better explanations of system
outcomes, we can build better models

The rapid evolution
of fault injection
infrastructure
makes it easier
than ever to test
fault hypotheses
on large-scale
systems.

60 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1

practice

to embrace (rather than abstracting
away) this uncertainty.

Distributed systems are probabi-
listic by nature and are arguably bet-
ter modeled probabilistically. Future
directions of work include the proba-
bilistic representation of system re-
dundancy and an exploration of how
this representation can be exploited to
guide the search of fault experiments.
We encourage the research community
to join in exploring alternative internal
representations of system redundancy.

Turning the explanations inside
out. Most of the classic work on data
provenance in database research has
focused on aspects related to human-
computer interaction. Explanations of
why a query returned a particular result
can be used to debug both the query
and the initial database—given an un-
expected result, what changes could be
made to the query or the database to fix
it? By contrast, in the class of systems
we envision (and for LDFI concretely),
explanations are part of the internal
language of the reasoner, used to con-
struct models of redundancy in order
to drive the search through faults.

Ideally, explanations should play a
role in both worlds. After all, when a
bug-finding tool such as LDFI identi-
fies a counterexample to a correctness
property, the job of the programmers
has only just begun—now they must un-
dertake the onerous job of distributed
debugging. Tooling around debugging
has not kept up with the explosive pace
of distributed systems development.
We continue to use tools that were de-
signed for a single site, a uniform mem-
ory, and a single clock. While we are not
certain what an ideal distributed debug-
ger should look like, we are quite certain
that it does not look like GDB (GNU Proj-
ect debugger).36 The derivation graphs
used by LDFI show how provenance can
also serve a role in debugging by provid-
ing a concise, visual explanation of how
the system reached a bad state.

This line of research can be pushed
further. To understand the root causes
of a bug in LDFI, a human operator
must review the provenance graphs of
the good and bad executions and then
examine the ways in which they differ.
Intuitively, if you could abstractly
subtract the (incomplete by assump-
tion) explanations of the bad outcomes
from the explanations of the good out-

of redundancy. Unfortunately, a bar-
rier to entry for systems such as LDFI
is the unwillingness of software de-
velopers and operators to instrument
their systems for tracing or provenance
collection. Fortunately, operating sys-
tem-level provenance-collection tech-
niques are mature and can be applied
to uninstrumented systems.

Moreover, the container revolution
makes simulating distributed execu-
tions of black box software within a
single hypervisor easier than ever. We
are actively exploring the collection
of system call-level provenance from
unmodified distributed software in
order to select a custom-tailored fault
injection schedule. Doing so requires
extrapolating application-level causal
structure from low-level traces, iden-
tifying appropriate cut points in an
observed execution, and finally syn-
chronizing the execution with fault
injection actions.

We are also interested in the pos-
sibility of inferring high-level explana-
tions from even noisier signals, such as
raw logs. This would allow us to relax
the assumption that the systems un-
der study have been instrumented to
collect execution traces. While this is
a difficult problem, work such as the
Mystery Machine12 developed at Face-
book shows great promise.

Toward better models. The LDFI
system represents system redundancy
using derivation graphs and treats the
task of identifying possible bugs as a
materialized-view maintenance prob-
lem. LDFI was hence able to exploit
well-understood theory and mecha-
nisms from the history of data man-
agement systems research. But this is
just one of many ways to represent how
a system provides alternative computa-
tions to achieve its expected outcomes.

A shortcoming of the LDFI approach
is its reliance on assumptions of de-
terminism. In particular, it assumes
that if it has witnessed a computation
that, under a particular contingency
(that is, given certain inputs and in the
presence of certain faults), produces
a successful outcome, then any future
computation under that contingency
will produce the same outcome. That
is to say, it ignores the uncertainty in
timing that is fundamental to distrib-
uted systems. A more appropriate way
to model system redundancy would be

The container
revolution makes
simulating
distributed
executions of
black-box software
within a single
hypervisor easier
than ever.

J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 61

practice

36. Matloff, N., Salzman, P.J. The Art of Debugging with
GDB, DDD, and Eclipse. No Starch Press, 2008.

37. Meliou, A., Suciu, D. Tiresias: The database oracle for
how-to queries. Proceedings of the ACM SIGMOD
International Conference on the Management of Data
(2012), 337-348.

38. Microsoft Azure Documentation. Introduction to the
fault analysis service, 2016; https://azure.microsoft.
com/en-us/documentation/articles/ service-fabric-
testability-overview/.

39. Musuvathi, M. et al. CMC: A pragmatic approach to
model checking real code. ACM SIGOPS Operating
Systems Review. In Proceedings of the 5th Symposium
on Operating Systems Design and Implementation 36
(2002), 75–88.

40. Musuvathi, M. et al. Finding and reproducing
Heisenbugs in concurrent programs. In Proceedings
of the 8th Usenix Conference on Operating Systems
Design and Implementation (2008), 267–280.

41. Newcombe, C. et al. Use of formal methods at
Amazon Web Services. Technical Report, 2014; http://
lamport.azurewebsites.net/tla/formal-methods-
amazon .

42. Olston, C., Reed, B. Inspector Gadget: A framework
for custom monitoring and debugging of distributed
data flows. In Proceedings of the ACM SIGMOD
International Conference on the Management of Data
(2011), 1221–1224.

43. OpenTracing. 2016; http://opentracing.io/.
44. Pasquier, T.F. J.-M., Singh, J., Eyers, D.M., Bacon, J.

CamFlow: Managed data-sharing for cloud services,
2015; https://arxiv.org/pdf/1506.04391 .

45. Patterson, D.A., Gibson, G., Katz, R.H. A case for
redundant arrays of inexpensive disks (RAID). In
Proceedings of the 1988 ACM SIGMOD International
Conference on Management of Data, 109–116;
http://web.mit.edu/6.033/2015/wwwdocs/papers/
Patterson88 .

46. Ramasubramanian, K. et al. Growing a protocol. In
Proceedings of the 9th Usenix Workshop on Hot Topics
in Cloud Computing (2017).

47. Reinhold, E. Rewriting Uber engineering: The
opportunities microservices provide. Uber Engineering,
2016; https: //eng.uber.com/building-tincup/.

48. Saltzer, J. H., Reed, D.P., Clark, D.D. End-to-end
arguments in system design. ACM Trans. Computing
Systems 2, 4 (1984): 277–288.

49. Sandberg, R. The Sun network file system: design,
implementation and experience. Technical report, Sun
Microsystems. In Proceedings of the Summer 1986
Usenix Technical Conference and Exhibition.

50. Shkuro, Y. Jaeger: Uber’s distributed tracing system.
Uber Engineering, 2017; https://uber.github.io/jaeger/.

51. Sigelman, B.H. et al. Dapper, a large-scale distributed
systems tracing infrastructure. Technical report.
Research at Google, 2010; https://research.google.
com/pubs/pub36356.html.

52. Shenoy, A. A deep dive into Simoorg: Our open source
failure induction framework. Linkedin Engineering,
2016; https://engineering.linkedin.com/blog/2016/03/
deep-dive-Simoorg-open-source-failure-induction-
framework.

53. Yang, J. et al.L., Zhou, L. MODIST: Transparent
model checking of unmodifed distributed systems.
In Proceedings of the 6th Usenix Symposium on
Networked Systems Design and Implementation
(2009), 213–228.

54. Yu, Y., Manolios, P., Lamport, L. Model checking TLA+
specifications. In Proceedings of the 10th IFIP WG
10.5 Advanced Research Working Conference on
Correct Hardware Design and Verification Methods
(1999), 54–66.

55. Zhao, X. et al. Lprof: A non-intrusive request flow
profiler for distributed systems. In Proceedings of the
11th Usenix Conference on Operating Systems Design
and Implementation (2014), 629–644.

Peter Alvaro is an assistant professor of computer
science at the University of California Santa Cruz,
where he leads the Disorderly Labs research group
(disorderlylabs.github.io).

Severine Tymon is a technical writer who has written
documentation for both internal and external users
of enterprise and open source software, including for
Microsoft, CNET, VMware, and Oracle.

comes,10 then the root cause of the dis-
crepancy would be likely to be near the
“frontier” of the difference.

Conclusion
A sea change is occurring in the tech-
niques used to determine whether
distributed systems are fault tolerant.
The emergence of fault injection ap-
proaches such as Chaos Engineering
and Jepsen is a reaction to the erosion
of the availability of expert program-
mers, formal specifications, and uni-
form source code. For all of their prom-
ise, these new approaches are crippled
by their reliance on superusers who
decide which faults to inject.

To address this critical shortcom-
ing, we propose a way of modeling and
ultimately automating the process
carried out by these superusers. The
enabling technologies for this vision
are the rapidly improving observabil-
ity and fault injection infrastructures
that are becoming commonplace in
the industry. While LDFI provides con-
structive proof that this approach is
possible and profitable, it is only the
beginning. Much work remains to be
done in targeting faults at a finer grain,
constructing more accurate models of
system redundancy, and providing bet-
ter explanations to end users of exactly
what went wrong when bugs are identi-
fied. The distributed systems research
community is invited to join in explor-
ing this new and promising domain.