As part of your doctoral seminar for this set of weeks, you are participating in a seminar-style discussion about the weekly topics. Recall that you were asked to address 5 of the Required Resources and at least 5 additional resources from the Walden Library and to incorporate them into your posting. As a related exercise, submit an annotated bibliography of the 10 resources you referred to this week. For each entry, be sure to address the following as a minimum:
- Include the full APA citation
- Discuss the scope of the resource
- Discuss the purpose and philosophical approach
- Discuss the underlying assumptions
- If referring to a research reporting article, present the methodology
- Relate the resource to the body of resources you have consulted in this course
- Discuss any evident limitations and opportunities for further inquiry
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 1, May 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
11
Different Forms of Software Testing Techniques for Finding
Errors
Mohd. Ehmer Khan
Department of Information Technology
Al Musanna College of Technology, Sultanate of Oman
Abstract
Software testing is an activity which is aimed for evaluating an
attribute or capability of a program and ensures that it meets
the required result. There are many approaches to software
testing, but effective testing of complex product is essentially a
process of investigation, not merely a matter of creating and
following route procedure. It is often impossible to find all the
errors in the program. This fundamental problem in testing
thus throws open question, as to what would be the strategy
that we should adopt for testing. Thus, the selection of right
strategy at the right time will make the software testing
efficient and effective. In this paper I have described software
testing techniques which are classified by purpose.
Keywords: Correctness Testing, Performance Testing,
Reliability Testing, Security
Testing
1. Introduction
Software testing is a set of activities conducted with the
intent of finding errors in software. It also verifies and
validate whether the program is working correctly with
no bugs or not. It analyzes the software for finding bugs.
Software testing is not just used for finding and fixing of
bugs but it also ensures that the system is working
according to the specifications. Software testing is a
series of process which is designed to make sure that the
computer code does what it was designed to do.
Software testing is a destructive process of trying to find
the errors. The main purpose of testing can be quality
assurance, reliability estimation, validation or
verification. The other objectives or software testing
includes. [6][7][8]
The better it works the more efficiently it can
be tested.
Better the software can be controlled more the
testing can be automated and optimized.
The fewer the changes, the fewer the disruption
to testing.
A successful test is the one that uncovers an
undiscovered error.
Testing is a process to identify the correctness
and completeness of the software.
The general objective of software testing is to
affirm the quality of software system by
systematically exercising the software in
carefully controlled circumstances.
Classified by purpose software testing can be divided
into [4]
1. Correctness Testing
2. Performance Testing
3. Reliability Testing
4. Security Testing
2. Software Testing Techniques
Software testing is a process which is used to measure
the quality of software developed. It is also a process of
uncovering errors in a program and makes it a feasible
task. It is useful process of executing program with the
intent of finding bugs. The diagram below represents
some of the most prevalent techniques of software
testing which are classified by purpose. [4]
Fig. 1 Represent different software testing techniques which are
classified by purpose
SOFTWARE
TESTING
Correctness
Testing
Security
Testing
Performance
Testing
Reliability
Testing
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 1, May 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
12
2.1 Correctness Testing
The most essential purpose of testing is correctness
which is also the minimum requirement of software.
Correctness testing tells the right behavior of system
from the wrong one for which it will need some type of
Oracle. Either a white box point of view or black box
point of view can be taken in testing software as a tester
may or may not know the inside detail of the software
module under test. For e.g. Data flow, Control flow etc.
The ideas of white box, black box or grey box testing are
not limited to correctness testing only. [4]
Fig. 2 Represent various form of correctness testing
2.1.1 White Box Testing
White box testing based on an analysis of internal
working and structure of a piece of software. White box
testing is the process of giving the input to the system
and checking how the system processes that input to
generate the required output. It is necessary for a tester
to have the full knowledge of the source code. White
box testing is applicable at integration, unit and system
levels of the software testing process. In white box
testing one can be sure that all parts through the test
objects are properly executed. [2][10]
Fig. 3 Represent working process of White Box Testing
Some synonyms of white box testing are [5]
Logic Driven Testing
Design Based Testing
Open Box Testing
Transparent Box Testing
Clear Box Testing
Glass Box Testing
Structural Testing
Some important types of white box testing techniques
are:
1. Control Flow Testing
2. Branch Testing
3. Path Testing
4. Data flow Testing
5. Loop Testing
There are some pros & cons of white box testing-
Pros-
1. Side effects are beneficial.
2. Errors in hidden codes are revealed.
3. Approximate the partitioning done by execution
equivalence.
4. Developer carefully gives reason about
implementation.
Cons-
1. It is very expensive.
2. Missed out the cases omitted in the code.
2.1.2 Black Box Testing
Basically Black box testing is an integral part of
‘Correctness testing’ but its ideas are not limited to
correctness testing only. Correctness testing is a method
which is classified by purpose in software testing.
Black box testing is based on the analysis of the
specifications of a piece of software without reference to
its internal working. The goal is to test how well the
component conforms to the published requirement for
the component. Black box testing have little or no regard
to the internal logical structure of the system, it only
examines the fundamental aspect of the system. It makes
sure that input is properly accepted and output is
correctly produced. In black box testing, the integrity of
external information is maintained. The black box
testing methods in which user involvement is not
required are functional testing, stress testing, load
testing, ad-hoc testing, exploratory testing, usability
testing, smoke testing, recovery testing and volume
testing, and the black box testing techniques where user
involvement is required are user acceptance testing,
White Box
Testing
Black Box
Testing
GREY
BOX
TESTING
CORRECTNESS
TETSING
INPUT System PROCESS OUTPUT
Analyze
Internal
Working
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 1, May 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
13
alpha testing and beta testing. Other types of Black box
testing methods includes graph based testing method,
equivalence partitioning, boundary value analysis,
comparison testing, orthogonal array testing, specialized
testing, fuzz testing, and traceability metrics. [2]
Fig. 4 Represent working process of Black Box Testing
There are various pros and cons of black box testing- [5]
Pros-
1. Black box tester has no “bond” with the code.
2. Tester perception is very simple.
3. Programmer and tester both are independent of
each other.
4. More effective on larger units of code than
clear box testing.
Cons-
1. Test cases are hard to design without clear
specifications.
2. Only small numbers of possible input can
actually be tested.
3. Some parts of the back end are not tested at all.
2.1.3 Grey Box Testing
Grey box testing techniques combined the testing
methodology of white box and black box. Grey box
testing technique is used for testing a piece of software
against its specifications but using some knowledge of
its internal working as well. [2]
Grey box testing may also include reverse engineering to
determine, for instance, boundary values or error
messages. Grey box testing is a process which involves
testing software while already having some knowledge
of its underline code or logic. The understanding of
internals of the program in grey box testing is more than
black box testing, but less than clear box testing. [11]
2.2 Performance Testing
‘Performance Testing’ involve all the phases as the
mainstream testing life cycle as an independent
discipline which involve strategy such as plan, design,
execution, analysis and reporting. This testing is
conducted to evaluate the compliance of a system or
component with specified performance requirement. [2]
Evaluation of a performance of any software system
includes resource usage, throughput and stimulus
response time.
By performance testing we can measure the
characteristics of performance of any applications. One
of the most important objectives of performance testing
is to maintain a low latency of a website, high
throughput and low utilization. [5]
Fig. 5 Represent two types of performance testing
Some of the main goals of performance testing are: [5]
Measuring response time of end to end
transactions.
Measurement of the delay of network between
client and server.
Monitoring of system resources which are
under various loads.
Some of the common mistakes which happen during
performance testing are: [5]
Ignoring of errors in input.
Analysis is too complex.
Erroneous analysis.
Level of details is inappropriate.
Ignore significant factors.
Incorrect Performance matrix.
Important parameter is overlooked.
Approach is not systematic.
There are seven different phases in performance testing
process: [5]
Phase 1 – Requirement Study
Phase 2 – Test plan
Phase 3 – Test Design
Phase 4 –
Scripting
Phase 5 – Test
Execution
Phase 6 – Test
Analysis
PERFORMANCE
TESTING
Load
Testing
Stress
Testing
INPUT System PROCESS OUTPUT
Analyze only
fundamental
aspects
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 1, May 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
14
Phase 7 – Preparation of Report
Fig. 6 Represent Performance Testing Process
Typically to debug applications, developers would
execute their applications using different execution
stream. Which are completely exercised the applications
in an attempt to find errors. Performance testing is
secondary issue when looking for errors in the
applications but, however, it is still an issue.
There are two kinds of performance testing:
2.2.1 Load Testing
Load Testing is an industry term for the effort of
performance testing. The main feature of the load testing
is to determine whether the given system is able to
handle the anticipated no. of users or not. This can be
done by making the virtual user to exhibit as real user so
that it will be easy to perform load testing. It is carried
only to check whether the system is performing well or
not. The main objective of load testing is to check
whether the system can perform well for specified user
or not. Load testing increases the up time for critical
web applications by helping us to spot the bottle necks
in the system which is under large user stress.
Load testing is also used for checking an application
against heavy load or inputs such as testing of website in
order to find out at what point the website or
applications fails or at what point its performance
degrades. [2][5]
Two ways for implementing load testing are
1. Manual Testing: It is not a very practical option as it
is very iterative in nature and it involves [5]
Measure response time
Compare results
2. Automated Testing: As compared to manual load
testing the automated load testing tools provide
more efficient and cost effective solutions. Because
with automated load testing, tools test can easily be
rerun any number of times and decreases the
chances of human error during testing. [5]
2.2.2 Stress Testing
We can define stress testing as performing random
operational sequence, at larger than normal volume, at
faster than normal speed and for longer than normal
periods of time, as a method to accelerate the rate of
finding defects and verify the robustness of our product,
or we can say stress testing is a testing, which is
conducted to evaluate a system or component at or
beyond the limits of its specified requirements to
determine the load under which it fails and how. Stress
testing also determines the behaviour of the system as
user base increases. In stress testing the application is
tested against heavy loads such as large no. of inputs,
large no. of queries, etc. [2] [5]
There are some weak and strong points of stress testing.
Weak Points
1. Not able to test the correctness of a system.
2. Defects are reproducible.
3. Not representing real world situation.
Strong Points
1. No other type of test can find defect as stress
testing.
2. Robustness of application is tested.
Requirement
Collection
Preparation of
Plan
Designing
Scripting
Preparation of
Report
Execution
Analysis
Is Goal
Achieved?
NO
YES
Final Report is
Prepared
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 1, May 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
15
3. Very helpful in finding deadlocks.
2.3 Reliability Testing
Fig. 7 Represent Reliability testing
‘Reliability Testing’ is very important, as it discover all
the failures of a system and removes them before the
system is deployed. Reliability testing is related to many
aspects of software in which testing process is included;
this testing process is an effective sampling method to
measure software reliability. Estimation model is
prepared in reliability testing which is used to analyze
the data to estimate the present and predict future
reliability of software. [4][2]
Depending on that estimation, the developers can decide
whether to release the software or not and the end user
will decide whether to adopt that software or not.
Based on reliability information, the risk of using
software can also be assessed. Robustness testing and
stress testing are the variances of reliability testing. By
Robustness we mean how software component works
under stressful environmental conditions. Robustness
testing only watches the robustness problem such as
machine crashes, abnormal terminations etc. Robustness
testing is very portable and scalable. [4]
2.4 Security Testing
Security Testing: ‘Security testing’ makes sure that only
the authorized personnel can access the program and
only the authorized personnel can access the functions
available to their security level. Security testing of any
developed system or (system under development) is all
about finding the major loopholes and weaknesses of a
system which can cause major harm to the system by an
authorized user. [1][2]
Security testing is very helpful for the tester for finding
and fixing of problems. It ensures that the system will
run for a ling time without any major problem. It also
ensures that the systems used by any organization are
secured from any unauthorized attack. In this way,
security testing is beneficial for the organization in all
aspects. [1][2]
Five major concepts which are covered by security
testing are
Confidentiality: By security testing, we will ensure
the confidentiality of the system i.e. no disclosure of
the information to the unknown party other than
intended recipient.
Integrity: By security testing, we will maintain the
integrity of the system by allowing the receiver to
determine that the information which he is getting is
correct.
Authentication: Security testing maintains the
authentications of the system and WPA, WPA2,
WEP are several forms of authentication.
Availability: Information is always kept available
for the authorized personnel whenever they needed
and assures that information services will be ready
for use whenever expected.
Authorization: Security testing ensures that only the
authorized user can access the information or
particular service. Access control is an example of
authorization.
Fig. 8 Represent various type of security testing
Different types of security testing in any organization
are as follows: [3]
1. Security Auditing and Scanning: Security
Auditing includes direct inspection of the
operating system and of the system on which
it is developed. In Security Scanning the
auditor scan the operating system and then
tries to find out the weaknesses in the
operating and network.
SECURITY
TESTING
Security
Scanning
Security
Auditing
Vulnerability
Scanning
Posture
Assessment
& Security
Testing
Risk
Assessment
Penetration
Testing
Ethical
Hacking
Reliability
Testing
Robustness
Testing
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 1, May 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
16
2. Vulnerability Scanning: Various vulnerability
scanning software performs Vulnerability
Scanning, which involves the scanning of the
program for all known vulnerability.
3. Risk Assessment: Risk Assessment is a
method in which the auditors analyze the risk
involved with any system and all the
probability of loss which occurs because of
that risk. It is analyzed through interviews,
discussions, etc.
4. Posture Assessment and Security Testing:
Posture Assessment and Security Testing help
the organization to know where it stands in
context of security by combining the features
of security scanning, risk assessment and
ethical hacking.
5. Penetration Testing: Penetration Testing is an
effective way to find out the potential
loopholes in system and it is done by a tester
which forcibly enters into the application
under test. A tester enters into the system with
the help of combination of loopholes that the
application has kept open unknowingly.
6. Ethical Hacking: Ethical Hacking involves
large no. of penetration test on a system under
test. To stop the forced entry of any external
elements into a system which is under security
testing.
3. Conclusion
Software testing is an important technique for the
improvement and measurement of a software system
quality. But it is really not possible to find out all the
errors in the program. So, the fundamental question
arises, which strategy we would adopt to test. In my
paper, I have described some of the most prevalent and
commonly used strategies of software testing which are
classified by purpose and they are classified into [5]
1. Correctness testing, which is used to test the
right behavior of the system and it is further
divided into black box, white box and grey box
testing techniques (combines the features of
black box and white box testing).
2. Performance testing, which is an independent
discipline and involves all the phases as the
main stream testing life cycle i.e. strategy, plan,
design, execution, analysis and reporting.
Performance testing is further divided into load
testing and stress testing.
3. Reliability testing, which discovers all the failure of
the system and removes them before the system
deployed.
4. Security testing makes sure that only the authorized
personnel can access the system and is further
divided into Security Auditing and Scanning,
Vulnerability Scanning, Risk Assessment, Posture
Assessment and Security Testing, Penetration
Testing and Ethical Hacking.
The successful use of these techniques in industrial
software development will validate the results of the
research and drive future research. [8]
References:
[1] Software testing-Brief introduction to security
testing by Nilesh Parekh published on 14-07-2006
available at http://www.buzzle.com/editorial/7-14-
2006-102344.asp
[2] Software testing glossary available at
http://www.aptest.com/glossary.html#performance
testing
[3] Open source security testing methodology manual of
PETE HERZOG and the institute for security and
open methodology-ISECOM.
[4] Software testing by Jiantao Pan available at
http://www.ece.cmu.edu/~roopman/des-
899/sw_testing/
[5] Software Testing by Cognizant Technology Solution.
[6] Introduction to software testing available at
http://www.onestoptetsing.com/introduction/
[7] Software testing techniques available at
http://pesona.mmu.edu.my/~wruslan/SE3/Readings/
GB1/pdf/ch14-GB1
[8] Paper by Lu Luo available at
http://www.cs.cmu.edu/~luluo/Courses/17939Report
[9] Security testing-wikipedia the free encyclopedia
available at http://en.wikipedia.org/wiki/security-
tetsing.
[10] White box testing from wikipedia, the free
encyclopedia.
[11] Software testing for wikipedia available at
http://en.wikipedia.org/wiki/grey_box_testing#grey_
box_tetsing
Mohd. Ehmer Khan
I completed my B.Sc in 1997 and M.C.A. in 2001 from Aligarh
Muslim University, Aligarh, India, and pursuing Ph.D (Computer
Science) from Singhania University, Jhunjhunu, India. I have
worked as a lecturer at Aligarh College Engineering &
Management, Aligarh, India from 1999 to 2003. From 2003 to
2005 worked as a lecturer at Institute of Foreign Trade &
Management, Moradabad, India. From 2006 to present working
as a lecturer in the Department of Information Technology, Al
Musanna College of Technology, Ministry of Manpower,
Sultanate of Oman. I am recipient of PG Merit Scholarship in
MCA. My research area is software engineering with special
interest in driving and monitoring program executions to find
bugs, using various software testing techniques.
Computer and Information Science www.ccsenet.org/cis
30
4Ps of Business Requirements Analysis for Software Implementation
Mingtao Shi
FOM Fachhochschule für Oekonomie & Management
University of Applied Science
Bismarckstr. 107, 10625 Berlin, Germany
Tel: 49-171-2881-169 E-mail: Consulting_Shi@yahoo.de
Abstract
Introduction of new software applications to achieve significant improvement of business performance is a
general phenomenon that can be observed in a variety of firms and industries. While carrying out such complex
activities, firms are frequently struggling with quality and time, which, as this paper argues, can be achieved by
basing the implementation upon 4Ps of business requirements analysis: Process, Product, Parameter and Project.
Keywords: Requirements analysis, Software implementation, Process analysis, Product analysis,
Parameterisation, Project management
1. Business Requirements and Software Implementation
Industrial firms today are achieving significant scale and scope advantages by introducing new software
applications tailored to firm-specific value chain activities. The pervasive deployment of software and
burgeoning growth of specialist vendors has fostered the emergence of industrial applications based upon core
systems that can be individualised by parameterisation and customisation. Flexible core systems have become
the general pattern of dominant software products in a variety of industries. This trend is especially favourable
for small and medium-sized firms that necessarily concentrate on software application rather than on software
production. These firms are typically in possession of an own low-budget IT department or are outsourcing their
software-related activities to external system integrators or software consultants.
The internal IT department is organically integrated and may be provided with business knowledge quickly, but
it is in most cases too small to conduct software development from scratch for comprehensive business activities.
The external IT specialists may have remarkable software knowledge, but it is organisationally more difficult for
them to assimilate business information of the potential software user firm. Core software systems that can be
parameterised and customised smoothly fit in such business and technical scenarios that are common in a wide
range of industries, such as wholesales, logistics and banking.
How to implement such core systems in accordance with the firm-specific business requirements therefore has
become an area of common interest for practitioners as well as for academicians because of its huge financial
implication. Most of such software systems are expensive. Software itself must be paid. Hardware must be
purchased to accommodate the software. Personnel are to be trained for configuring and using the applications,
most probably, on a short-term run. Furthermore, unsuccessful implementation would lead to inefficient or even
insufficient business performance, causing further unmanageable costs. Under such circumstances, software
requirements analysis for system implementation undoubtedly is crucial for a firm to further prosper or even
survive in the marketplace.
Classical approaches in the area of requirements engineering are rigorously defined and extensively discussed in
the literature. Probably the techniques advanced by Pressman (2004) and Pressman (2009) are most systematic,
which aver that requirements analysis centres on a few key elements, including scenario-based, flow-based,
behaviour-based, and class-based system views, and the data modelling. Other authors have argued more or less
in a similar manner (see Wiegers, 2003; Robertson & Robertson, 2006; Pohl, 2007). However, these rather
technical methods are beneficial for software development from scratch. On the one hand, most of small and
medium-sized business firms lack the capability or are reluctant to spend much resource to conduct such kind of
highly technical analysis. On the other hand, firms adopting software applications would not be able to touch the
codes and there is no need for them to manipulate the underlying technical design in the core. The essence of the
challenge for them is rather the match between business requirements and software application through
parameterisation and customisation.
Computer and Information Science Vol. 3, No. 2; May 2010
31
2. Analysis of Business Processes
The depictions of business processes are vital notations to describe, examine, streamline a firm’s value activities.
The business requirements analysis may begin with process mapping. The typical notation in this context is the
activity diagram. Although textual use-cases are also widely used for this purpose, the activity diagram is less
time-consuming and more powerful in terms of ease of use and intelligibility, especially for implementation
projects with stringent time demand.
Professionals dealing with Software Engineering frequently use Unified Modelling Language (UML) as a
unified standard platform to map the business processes for requirements analysis purposes. However, in order to
gain the necessary skill, potential analysts have to take part in formal trainings. Furthermore, commercial UML
tools equipped with regular upgrades and updates are likely to be expensive. Microsoft Excel, on the contrary, is
available to most firms and can also be used for process mapping. Whichever tool is used, a number of essential
aspects must be taken into account to sufficiently decipher important information in business processes.
A role is a group of system users performing the same functions in the business processes. The description in the
activity diagram needs to unambiguously define the role conducting a certain activity (process step). If necessary,
detailed explanations can be given to each activity, in order to map the essential content of the activity. It is
highly recommended that the system-related activities are highlighted. By doing so, the business specialists and
the analysts can subsequently defined the data fields of system inputs and outputs at a particular system process
step. It is beneficial that these inputs and outputs are streamlined at a later stage to make the data structure of
future application more meaningful. System printouts need to be indicated, analysed and carefully defined at
relevant process steps. Although the activity diagram is not supposed to be a system specification, it is
recommended that the analysts define as much detailed information as possible, in order to gain time advantages
for further implementation. In a business environment, where time is always in shortage, successful process
mapping with detailed information might be a possible substitute of a time-consuming specification.
Importantly, the mapped processes then need to be tested in the pre-configured software to be introduced, so that
the business specialists can witness that the mapped processes can be realised by the system. This kind of test is
not the highly stringent software testing in the traditional sense, but a kind of functional and performance tests at
a high level. The result of the test is a brief but written Gap Analysis, documenting the activities, process steps or
step sequences that cannot be exactly realised by the system. In most cases, the solution to bridge the disparities
can be found by providing workarounds and customisation. Workarounds are techniques that utilise and combine
the existing system features, in order to achieve the missed linkages to the business processes. Customisation
means that programming at the application level is necessary to map the system to the processes identified,
however without the necessity to touch the source codes in the core system. While implementing a new or new
generation of software system, changes in the core system (development efforts by the coding firm) is generally
not recommended, because this would imply intensive time and resource capacity for expertise exchange
between the implementing and coding firm, unless it is unavoidable. Painful costs are foreseeable.
3. Analysis of Business Products
Not all important information is contained in the business processes. In a retail system for example, firms must
certainly comply with governmental pricing regulations, which are normally readily available in the system.
However, these firms may also desire to create their own product-related pricing and charging schemes based
upon proprietary calculations. This kind of product-specific information typically resides in documented product
descriptions.
The realisation of a product may involve different activities in different processes. Process and product views
shed light on the business portfolio of a firm from different angles of perspective. While analysing the business
products, analysts need to pay intensive attention to features, interfaces, ancillary products. Features are
functionalities (e.g. proprietary calculations), delineating what the product should perform and how the added
value is created. Interfaces include system-internal communication with other products within the portfolio and
system-external communication with other systems if necessary. Today, standardised interfaces have made the
industrial value chain operate more smoothly. Ancillary products are resultant aspects of an existing business
portfolio. Businesses not only reply upon on activities and values, but also on the reporting, controlling and audit
of these activities and values.
Similar to the process mapping, product analysis within the context of system implementation should include a
Gap Analysis that documents the vacuum between the wished products and features provided by the software.
Workarounds or customisation should be conceived, designed and carried out subsequently if necessary. Again,
core changes should be held in a minimum scale.
Computer and Information Science www.ccsenet.org/cis
32
4. Determination of Application Parameters
After having analysed the processes and products, the business specialists and system analysts need to focus on
the definition of application parameters. User rights and categorised parameter tables must be discussed in great
detail, before figures, number and ranges can be setup in the system to achieve desired products and processes.
Careful definition of user rights is highly salient for the security and smoothness of the business operations. The
analysis of user rights for each user group must at least deal with access to system masks, level of data inputs
and processing, authorisation of data processing, access to data outputs (e.g. reports), and access to system
administration.
Other parameters can be categorised in individual tables as the basis of discussion. Business specialists from
different product or process background must first understand what the parameters defined by the core software
system mean. System specialists or external consultants familiar with the system are required. The central task
here is to map the system defined parameters to the product parameters used in the businesses, in terms of both
definition and terminology. These discussion sessions are highly important. One major result of the
parameterisation is to enable the purchased system to perform the business content of the purchasing firm,
partially through workarounds. Another major result should be that customisation and external development
tasks are figured out in detail, upon which follow-up resource and budget needs can be based.
Multinational businesses are additionally faced with the difficulty of parameter differences that are necessary for
different national marketplaces. Therefore, system implementation may require the definition of a multinational
parameter mix. Because experts of local parameters are located in the respective local markets, parameterisation
under such circumstances may require intensive communication with subsidiaries and representatives located in
other countries. Analysts, business and software specialists in the central headquarters can use this opportunity to
reside in the multinational sites for information elicitation and analysis, and by doing so, amplify the knowledge
of the local business environment. It is worth mentioning that short stays in countries with lower living standard
may not be highly comfortable, but the learning effect potentially achievable may be comfortably high.
Although parameter difference may exist across the national borders, unified process and product definition may
be beneficial to achieve scales economies in the global strategy of the business firms.
5. Project Management
Tailoring the software system to the firm-specific business requirements is a daunting task of high complexity,
consisting of hundreds or even thousands of work units and packages. Without proper management the whole
project would be a monstrous amount of work without prosperity. Project management is in most cases
inevitable in general. In particular, time, resources and budgets are to be planned and managed delicately.
Time management should include a highly detailed listing of work packages and their interdependencies. The
duration of each work package is estimated carefully. Statistical methods such as beta distribution can be
deployed here. Project professionals use to apply network depictions to illustrate the structure of the project and
identify the critical path, upon which the complete project duration may be estimated. Resource loading diagram
and resource levelling help the project management maintain an overview of deployed resources and avoid
extreme over- or under-occupation. In software implementation projects, human resources must be dealt with
more carefully. Holidays and travel plans must be considered. Furthermore, the costs of workarounds,
parameterisation and customisation mentioned above are certainly also a part of overall costs. Although the
top-down approach is the predominant budgeting policy in most implementation projects, the mangers should
always have an open ear for bottom-up information coming from project members to insure a more realistic
budgeting plan. It is worth mentioning that the project communication with top management and external system
supplier must be honest and transparent. Business requirements analysis and software implementation are not
just about software and hardware, but also about trust and relationship. These soft factors can sometimes even be
decisive for the overall success of the project.
6. Conclusion: 4Ps of business requirements for successful software implementation
Parameterisation-capable and customisable software applications tailored to firm-specific business requirements
have become highly coveted in myriads of industries and firms. This paper argues that 4Ps are most essential for
integrating such systems seamlessly in the firm-individual operational environment:
(P)rocess: An effective process mapping should delineate functional roles, process steps and detailed content
of process steps. It should highlight the system-related activities and, data inputs and outputs at these
activities. It should also define system printouts. Analysts and business specialists must analyse the process a
number of times, thereby moving from less to more detailed levels. By conducting a careful GAP analysis,
Computer and Information Science Vol. 3, No. 2; May 2010
33
analysts can identify the needs for future workarounds and customisation. External development is to be
avoided as much as possible.
(P)roduct: Products must be defined in a written document, which clarifies the product features, internal and
external interfaces and ancillary products. Similarly, a resultant Gap Analysis is strongly recommended. The
software to be purchase must perform what the defined products need, at least through workarounds and
customisation. Important product features should be realised straightaway by the system. External
development is to be held in a minimum scale.
(P)arameter: User rights, process-related and product-related parameters, parameter ranges to be applied in
the system should be defined and reviewed carefully. Sometimes, it is necessary that business parameters
familiar to the business specialists must be mapped to the system parameters familiar to the system specialists.
The analysts should, accompany and intermediate in, this process. Multinational corporations may have to
adapt the parameters to the local operational sites in a parameter mix.
(P)roject: System implementation tailored to firm-specific business requirements consists of complex
activities that can only be effectively and efficiently managed if a proper project management environment is
in place. Completion time, resource allocation and budgeting are most important aspects. Project
management should also take into account the activities necessary after the business requirements analysis
has been carried out. Typical activities are the setup of defined parameters in the system, customisation
activities, acceptance testing, go-live of the system and, screening of results and communication for further
system improvement.
References
Pohl, K. (2007). Requirements Engineering: Grundlagen, Prinzipien, Techniken. Heidelberg: dpunkt.verlag
GmbH.
Pressman, R. S. (2004). Software engineering: A practitioner’s approach. (6th ed.). New York, NY: McGraw
Hill.
Pressman, R. S. (2009). Software engineering: A practitioner’s approach. (7th ed.). New York, NY: McGraw
Hill.
Robertson, S., & Robertson, J. (2006). Mastering the requirements process. (2nd ed.). Westford, MA: Pearson
Education Inc.
Wiegers, K. E. (2003). Software requirements. (2nd ed.). Redmond, Washington: Microsoft Press.
Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=ucis20
Journal of Computer Information Systems
ISSN: 0887-4417 (Print) 2380-2057 (Online) Journal homepage: http://www.tandfonline.com/loi/ucis20
Improving Open Source Software Maintenance
Vishal Midha, Rahul Singh, Prashant Palvia & Nir Kshetri
To cite this article: Vishal Midha, Rahul Singh, Prashant Palvia & Nir Kshetri (2010) Improving
Open Source Software Maintenance, Journal of Computer Information Systems, 50:3, 81-90
To link to this article: https://doi.org/10.1080/08874417.2010.11645410
Published online: 11 Dec 2015.
Submit your article to this journal
Article views: 13
View related articles
Citing articles: 2 View citing articles
http://www.tandfonline.com/action/journalInformation?journalCode=ucis20
http://www.tandfonline.com/loi/ucis20
https://doi.org/10.1080/08874417.2010.11645410
http://www.tandfonline.com/action/authorSubmission?journalCode=ucis20&show=instructions
http://www.tandfonline.com/action/authorSubmission?journalCode=ucis20&show=instructions
http://www.tandfonline.com/doi/mlt/10.1080/08874417.2010.11645410
http://www.tandfonline.com/doi/mlt/10.1080/08874417.2010.11645410
http://www.tandfonline.com/doi/citedby/10.1080/08874417.2010.11645410#tabModule
http://www.tandfonline.com/doi/citedby/10.1080/08874417.2010.11645410#tabModule
Spring 2010 Journal of Computer Information Systems 81
ImprovIng open SourCe
Software maIntenanCe
vIShal mIdha praShant palvIa
The University of Texas – Pan American The University of North Carolina at Greensboro
Edinburg, TX 78539 Greensboro, NC 27402
rahul SIngh nIr KShetrI
The University of North Carolina at Greensboro The University of North Carolina at Greensboro
Greensboro, NC 27402 Greensboro, NC 27402
Received: June 22, 2009 Revised: August 17, 2009 Accepted: September 9, 2009
aBStraCt
Maintenance is inevitable for almost any software. Software
maintenance is required to fix bugs, to add new features, to
improve performance, and/or to adapt to a changed environment.
In this article, we examine change in cognitive complexity and its
impacts on maintenance in the context of open source software
(OSS). Relationships of the change in cognitive complexity with
the change in the number of reported bugs, time taken to fix the
bugs, and contributions from new developers are examined and
are all found to be statistically significant. In addition, several
control variables, such as software size, age, development status,
and programmer skills are included in the analyses. The results
have strong implications for OSS project administrators; they
must continually measure software complexity and be actively
involved in managing it in order to have successful and sustainable
OSS products.
Keywords: OSS, Complexity, Software Maintenance
IntroduCtIon
The importance of software maintenance in today’s software
industry cannot be underestimated. Maintenance is inevitable
for almost any software. Software maintenance is required to
fix bugs, to add new features, to improve performance, and/or to
adapt to a changed environment. Pigoski [39] illustrated that the
portion of industry’s expenditures used for maintenance purposes
was 40% in the early 1970s, 55% in the early 1980s, 75% in the
late 1980s, and 90% in the early 1990s. Over 75% of software
professionals perform program maintenance of some sort [24].
Given the numbers, the understanding of software maintenance is
prudent.
It is not unusual that a developer modifying the source code
has not participated in the development of the original program
[31]. As a consequence, a large amount of the developer’s efforts
goes into understanding and comprehending the existing source
code [46]. Comprehending existing source code, which involves
identifying the logic in and between various segments of the
source code and understanding their relationships, is essentially
a mental pattern-recognition by the software developer and
involves filtering and recognizing enormous amount of data
[43]. As software is becoming increasingly complex, the task of
comprehending existing software is becoming increasingly tough
[43]. Fjelstad and Hamlen [17] reported that more than 50% of
all software maintenance effort is devoted to comprehension. The
comprehension of source code, thus, plays a prominent role in
software development.
In this article, we examine software complexity and its impacts
in the context of open source software (OSS). Past efforts have
been piecemeal or based on limited information. For example,
comprehension of the source code has been linked with source
code complexity. The empirical evidence on the magnitude of the
link is relatively weak [29]. However, many such attempts are
based on experiments involving small pieces of code or analysis of
software written by students [2]. In order to remedy this situation,
we analyze real world software written by the OSS developer
community. A number of studies has examined the impact of
complexity on maintainability and made recommendations to
reduce the complexity [30][31]. But, no study, to the best of our
knowledge, has tested if the reduced complexity was actually
beneficial to the developers performing software maintenance.
This study specifically examines the impact of change in software
complexity on maintenance efforts.
open Source Software development
A typical open source project starts when an individual (or
group) feels a need for a new feature or entirely new software, and
someone in that group, eventually writes one. In order to share it
with others who have similar needs, the individual/ group releases
the software under a license that allows the community to use, and
to see and modify the source code to meet local needs and improve
the product by fixing bugs. Making software available widely on
an open network, e.g., the Internet, allows developers around the
world to contribute code, add new features, improve the present
code, report bugs, and submit fixes to the current version. The
developers of the project incorporate the features and fixes into
the main source code and a new version of the software is made
available to the public. This process of code contribution and bug
fixing is continued in an iterative manner as shown in Fig 1.
OSS supporters often claim that OSS has faster software
evolution. The idea is that multiple contributors can be writing,
testing, or debugging the product in parallel. Raymond [42]
mentioned that more people looking at the code will result in more
bugs found, which is likely to accelerate software improvement.
The OSS model claims that the rapid evolution produces better
82 Journal of Computer Information Systems Spring 2010
software than the traditional closed model because in the latter
“only a very few programmers can see the source and everybody
else must blindly use an opaque block of bits” [38].
One interpretation of the OSS development process is that of
a perpetual maintenance task. Developing an OSS system implies
a series of frequent maintenance efforts for bugs reported by
various users. As most of the OSS projects are results of voluntary
work [14][48], it is crucial to ensure that such volunteers are able
to work with minimal effort. The motivation for why developers
contribute to a source code has received a great deal of attention
from researchers [34]. However, the factors that can make the
OSS community to not contribute to a source code have received
limited attention.
In this light, von Hippel and von Krogh [51] noted that
the major concern among developers was the complexity of
the source code and the level of difficulty of the embedded
algorithms. Fitzgerald [15] pointed that increasing complexity
posits a barrier in the OSS development and may trigger the need
for either substantial software reengineering or the entire system
replacement. Therefore, it is vital to understand the complexity of
the source code and its impact on software development, and even
more importantly, on OSS development.
oSS and Complexity
A complex project, in general, demands a large share of
resources to modify and correct. When the source code is easy,
it is easier to maintain it. On the contrary, when a source code
is complex, developers have to expend a large portion of their
limited time and resources to become familiar with the source
code. In OSS, where the developers seek to gain personal
satisfaction and value from peer review and are not bound to
projects by employment relationships, they have the option to
leave the project at any time and join other projects where their
resources could be used more efficiently. Therefore controlling
complexity in OSS projects may have several benefits, including
facilitation of new developers’ learning. Feller and Fitzgerald
[14] pointed that if new contributors are to have any chance at
contributing to OSS projects, they should be able to do so with
minimal effort. Controlled complexity helps achieve that; thus
being indispensable for OSS [14].
Much of what we know about software complexity comes
from analyses of closed source development (e.g., [5]). As noted
by Stewart et al [49], even though the results from those findings
have been applied to OSS (e.g., study of Debian 2.2 development
[21]), there remains a relative scarcity of academic research on
the subject. More importantly, these studies were limited to a
small number of projects.
The remainder of the paper is organized as follows: The next
section draws on relevant literature to develop a theoretical model.
It is followed by a description of the methods and measures used
in the study. The following sections present the evaluation of the
model and discussion of the results. The paper is concluded by
acknowledging its limitations and highlighting its contributions
to both research and practice.
model development
Basili and Hutchens [4] define complexity as a measure of the
resources expended by a system while interacting with a piece
of software to perform a given task. It is important to clearly
understand the term “system” in this definition. If the interacting
system is a computer, then complexity is defined by the execution
time and storage required to perform the computation. For
example, as the number of distinct control paths through a
program increases, the complexity may increase. This kind of
complexity is called “Computational Complexity” [11]. If the
interacting system is a programmer, then complexity is defined by
the difficulty of performing tasks. This complexity comes from
“the organization of program elements within a program” [22], for
example, tasks such as coding, debugging, testing, or modifying
the software. This kind of complexity is known as “Cognitive
Complexity”. Cognitive complexity refers to the characteristics
of the software which make it difficult to understand and work
with [11]. It is our primary concern.
The notion of cognitive complexity is linked with the
limitations of short term memory. According to the cognitive
load theory, all information processed for comprehension must at
some time occupy short-term memory [43]. Short term memory
is described as the capacity of information that brain can hold
in an active, highly available state. Short term memory can be
thought of as a container, where a small finite number of concepts
can be stored. If data are presented in such a way that too many
concepts must be associated in order to make a correct decision,
fIgure 1 — oSS development
Spring 2010 Journal of Computer Information Systems 83
then the risk of error increases. In OSS, a voluntary developer
must retain the existing source code in short term memory in
order to successfully modify the existing code. The capacity
of holding information may vary depending on the individual
and may limit the capability of developers to comprehend and
modify the existing source code. Kearney et al [29] suggested
that the difficulty of understanding depends, in part, on structural
properties of the source code. As we are concerned with the
impact of complexity on source code comprehension, we focus
on properties related to the source code. This argument forms the
basis for theorizing the impact of complexity on various aspects
on OSS development, as described below.
number of Bugs
The main idea behind the relationship between complexity
and number of bugs is that when comparing two different
solutions to the same problem, all other things being equal, the
more complex solution will generate more bugs. This relationship
is one of the most analyzed by software metrics’ researchers and
previous studies and experiments have found this relationship to
be statistically significant [11][27].
In order for a programmer to understand the existing source
code, he needs to understand the flow of logic. And, when a
programmer has to deal with a source code with high cognitive
complexity, he has to frequently search among dispersed pieces
of code to determine the flow of logic [40]. Understanding and
recollecting such dispersed pieces increase the cognitive load on
the programmer making complex code maintenance more liable to
human errors. Complex software, hence, need more maintenance
efforts. Gill and Kemerer [20] reported that the number of bugs
in a program is positively associated with maintenance effort and
recommended further empirical testing with a larger data set.
Therefore OSS projects which experience increase in complexity
over its previous version also would experience an increase in the
number of bugs (over its previous version). Based on above, we
propose:
H1: An increase in the source code’s cognitive complexity
is positively associated with an increase in the number
of bugs in the OSS source code.
Contributions from new developers
Because of the important role of volunteer developers in the
OSS development, attracting new developers and keeping them
motivated is crucial to OSS development. Keeping the developers
motivated is especially important during the early development
stage so that the number of developers can reach a critical mass.
Some of the cited developers’ motivations include intellectual
gratification, career future incentives, learning and enjoyment,
ego-boosting, and peer recognition [6][8][35][37].
Once a new developer is motivated to voluntarily contribute,
he needs to first spend a large amount of time and resources to
understand the existing source code. When the source code is easy
to comprehend, it is easier to modify. However, when the source
code is complex, a developer is required to invest additional
effort and resources to understand it. Devoting such effort and
resources may pose a barrier to the developer’s motivation to
contribute. Such a barrier may lead the potential developer to
not contribute to the project at all, or, in worst case, to leave the
project. Hence,
H2: An increase in the source code’s cognitive complexity
is negatively associated with an increase in the
number of contributions to the OSS source code from
new developers.
time to fix Bugs
More complex source code adds to a programmer’s cognitive
load [12]. High cognitive load requires more time-consuming and
resource-demanding effort to familiarize oneself with the code.
It is even possible that a source code is so complex that it cannot
be comprehended at all. In such a scenario, the programmer may
spend time and resources on other activities, thereby further
lowering the productivity of the project.
In other words, a source code with lesser cognitive complexity
does not need as much effort or resources, thus reducing the
turnaround time required to fix repairs. This leads to the next
hypothesis that OSS projects which experience increase in
cognitive complexity over its previous version require longer time
to fix the bugs. Hence, we hypothesize,
H3: An increase in the source code’s cognitive complexity
is positively associated with an increase in the average
time taken to fix the bugs in OSS source code.
Combining all the preceding conceptual arguments gives
the research model shown in Fig. 2. Note that several control
variables have been included in the model in order to increase
the robustness of our findings. The specific variables will be
described in the next section.
methodS
The following explanation is helpful in understanding the
research design and methods. This study investigates the impact
of change in complexity. To compute the change in complex-
ity, the complexity of two consecutive versions of software
must be looked at. It is important to note that the complexity
of the source code of a software version can only be measured
after it has been released to the OSS community. Only after it
has been used, the discovered bugs are reported and the code
is modified to fix these bugs. Once significant amount of modi-
fications have been made, a new version is released to the pub-
lic. Due to the modifications in the source code, the complexity
of the source code changes. In order to compute the change in
complexity of the current version (say Nth) from its previous
version (N-1th), one needs to measure the complexity of both the
current (Nth) and the previous version (N-1th). As the modifications
and contributions made to the current version (Nth) are available
in the next version (N+1th), one needs to also look at the next
version (N+1th) to find these modifications and contributions.
As a consequence, for each project, we need to study three
releases, referred to as the first (N-1th), the second (Nth), and the
third (N+1th
).
OSS projects hosted at SourceForge were examined in this
study. SourceForge is the primary hosting place for OSS projects
which houses about 90% of all OSS projects. It has been argued
SourceForge is the most representative of the OSS movement, in
part because of its popularity and the large number of developers
and projects registered [23][54]. Researchers interested in
investigating issues related to the OSS phenomenon have
predominantly used SourceForge data [23][51][54].
84 Journal of Computer Information Systems Spring 2010
Studying all projects hosted on SourceForge was unfeasible
and impractical due to resource limitations. Data selection
was limited to projects that were targeted to either end users
or developers. In order to avoid ambiguity, projects that were
targeted to both end users and developers were excluded.
Further selection was made by controlling for the programming
language and the operating system. Past literature suggests that
programming language has an explicit impact on complexity
[52] and program size [28]. It is also difficult to compare lines
of code between “high” and “low” level programming lan-
guages. Lower level programming languages have more lines
of code and take longer to develop than higher level program-
ming languages. As C family of languages is the most preferred
by the OSS developers [45], only projects written in C/C++
or multiple languages including C/C++ were selected. Sec-
ondly, operating system of the project impacts the complexity
of the software and the development effort required. To en-
compass majority of the projects targeted for developers and end
users, all projects in the data set were designed either for the
Windows or the Linux/Unix operating system.
As the data was collected from three different versions of
software, the sample was further restricted to the projects that
had at least 3 versions. A version released between first 3-months
of the registration date is considered First release, another major
version released between 3 to 6 months of its registration date is
considered Second release, and yet another major version released
within 6 to 12 months of its registration date is considered the
Third release for this study. Therefore, to be able to get the data
for three different versions, we considered all projects that were
registered between SourceForge between January 2003 and
August 2006 so that the third release for the projects that were
registered in August 2006 was released by August 2007. The final
data collection was completed in August 2007. Lastly, projects
were chosen for which the required data were publicly available
(not all projects allow public access to the bug tracking system
).
Following the above criteria, the final sample size was limited to
450 projects.
meaSureS
Cognitive Complexity
McCabe’s cyclomatic complexity (CC) assesses the diffi-
culty faced by the maintainer in order to follow the flow con-
trol of the program. It is considered an indicator of the effort
needed to understand and test the source code [47]. Kemerer
and Slaughter [30] used McCabe’s cyclomatic metric to eval-
uate decision density, which represents the cognitive burden
on a programmer in understanding the source code. In order
to compute cyclomatic complexity, each source code file was
subjected to a commercial software code analysis tool. To ac-
count for the effects of size, the complexity metric was normalized
by dividing it by the number of lines of code for each software
project. This procedure also reduces collinearity problems when
size is included in the regression models [20]. The Change in
Cognitive Complexity (ChgCC) was calculated by subtracting
cyclomatic complexity measure of the first version from the
cyclomatic complexity measure of the second version, i.e.,
CC
2nd
– CC
1st
.
Change in number of Bugs and time taken to fix Bugs
Various elements of data were extracted from the bug tracking
system and the Concurrent Versioning System (CVS) reports,
including the bugs reported, the date on which the bugs were
reported, the date on which the bugs were fixed, and the version
number. One problem faced was that all the bugs in the current
version were not closed at the time of the study. To overcome
fIgure 2 — the research model
Spring 2010 Journal of Computer Information Systems 85
the problem, earlier versions that had more than 90% of the bugs
closed at the time of study were included. From these extracted
elements, the number of bugs reported and the time taken to fix
them for different software versions were computed. From the
number of bugs and the time to fix these bugs for each version,
the change in the number of bugs (ChgBugsReported) over
previous version and the change in the average time to fix the
bugs (ChgFixTime) were computed (i.e., BugsReported
3rd
–
BugsReported
2nd
).
Contributions from new developers
Software developers use CVS to manage the software
development process. CVS stores the current version(s) of the
project and its history. A developer can check-out the complete
copy of the code, work on this copy and then check back the
changes. The modifications are peer reviewed ensuring quality.
CVS updates the modified file automatically and registers it as a
commit. CVS keeps track of what change was made, who made
the change, and when the change was made. This information
can be gathered from the log files of the CVS repository of a
project. As CVS commits provide a measure of novel invention
that is internally validated by peers [10][23], the number of
CVS commits is used as a measure of contributions of devel-
opers. A commit is considered as ‘contribution from a new
developer’, when the developer has not contributed to the pre-
vious version. The number of contributions made by new
developers is represented as ChgNewDevs (i.e., ChgNewDevs
3rd
– ChgNewDevs
2nd
).
Control variables
Age
Brook’s Law [7] states that “adding more programmers to a
late project makes it later”. Based on this, adding new developers
at later stages will increase the average time taken to fix bugs. On
the other hand, age may indicate the legitimacy and popularity
of the software. Popular software attracts more developers and
thus older software will have higher number of contributions
from developers. To control for age, the Age variable is defined
as the number of months till the second release since a project’s
inception at SourceForge.
Size
Size is the oldest measure of software complexity and is
believed to be a major driver of software maintenance effort [53].
Larger software is likely to receive more enhancements and more
repairs than smaller software, ceteris paribus, as larger software
embodies greater amount of functionality subject to change. The
larger the software, the more difficult it is to test and validate its
functionality. This implies that larger software tend to incorporate
more errors. Keeping the above in mind, Size is used as a control
variable and is captured by the number of lines of code of the
second release.
Number of downloads
OSS developers can leverage the law of large numbers to
identify and fix the bugs [41]. Given enough eyeballs, all bugs
are shallow. A huge user base for the software implies that the
software will be tested in numerous different environments, more
bugs will surface, these will be communicated efficiently to more
bug fixers, the fix being obvious to someone, and the fix will
be communicated effectively back and integrated into the core
of the product. To isolate this effect, the number of cumulative
downloads (Downloads) of till second release of the project is
used as a control variable.
New Developer Knowledge and Skills
The literature on performance has identified individual
characteristics such as knowledge and skills as antecedents.
Such characteristics are, however, difficult to measure, and are
frequently measured through the use of surrogate measures like the
level of education and experience. Curtis et al. [11] reported that
in a series of experiments involving professional programmers,
the number of years of experience was not a significant predictor
of comprehension, debugging, or modification time, but that
number of languages known was. They suggest that the breadth
of experience may be a more reliable guide to ability than length
of programming experience. In this work, we also use the breadth
of the experience as a surrogate for developer’s knowledge and
skills. So, to control for the effect of new developers’ skills,
the variable SkillsChg (i.e. Skills
2nd
–Skills
1st
) was used and was
measured by the change in team skills with the addition of new
developers to the team.
Sponsorship
An increasing number of open source projects have opted
to receive monetary donations from organizations and users.
Although some developers and projects choose to allocate part or
all of the incoming donations to SourceForge, most recipients of
the donations rely on monetary support to fund development time
and other key resources that are necessary for the continuation of
the projects. It is expected that developers receiving additional
monetary benefits will devote extra effort and time into
comprehending and fixing the source code. The control variable
AcceptSponsors is used to capture whether a project is accepting
external funds and using monetary compensation as part of its
incentive mechanism. It takes the value of 1 if the project is
accepting donations and 0 otherwise.
Development Status and Maturity
To capture the development stage of a project, which is
typically determined by the developer in charge of the project on
SourceForge, the control variable DevStatus takes values ranging
from 1 to 6 representing development stages of Planning, Pre-
Alpha, Alpha, Beta, Production/Stable, and mature respectively.
DevStatus was also measured at second release. The larger the
value of DevStatus, the more mature the project is.
transformations
Initial investigations indicated that the dependent variable and
many of the independent variables were not normally distributed.
In such case, linear regression analysis might yield biased and non
interpretable parameter estimates [19]. Therefore, as suggested
by Gelman and Hill [19], a logarithmic transformation on the
dependent and the not-normally distributed independent variables
was performed.
86 Journal of Computer Information Systems Spring 2010
reSultS
The Variance Inflation Factor (VIF) was computed for all
variables in order to test for multicollinearity. VIF is one measure
of the effect other independent variables might have on the
variance of a regression coefficient. Large VIF values indicate
high multicollinearity. Studenmund [50] recommends a cut
of 10 for VIF. The VIF values for the different variables in the
regression analyses are reported in Table 1, and in no case exceed
1.2. The low VIF values indicate that multicollinearity is not a
serious problem.
As we are interested in studying the impact of change of
complexity on three dependent variables which are largely
distinct, we formulate three separate regression equations
analyzing each of the dependent variables. For the dependent
measure, ChgBugsReported, the impact of change in complexity
on the number of bugs (Hypothesis H1) was found by estimating
the parameters in the following regression model:
ChgBugsReported = α + β
1
ChgCC + β
2
lnSize + β3lnDownloads
+ β
4
AcceptSponsors + β
5
DevStatus + β
6
lnAge + β
7
lnSkillsChg
A positive and significant estimate of parameter β
1
would
indicate that the probability of having bugs in a source code
increases as the cognitive complexity of software increases. The
results of the regression (Hypothesis 1) are presented in Table 1.
The model shows a good fit with the data (F=33.552, p<0.00).
The parameter estimate for ChgCC is positive and significant
(β
1
=0.303, p<0.00). The results suggest that projects with unit
increase in cognitive complexity experience 0.303 units increase
in the number of bugs, and H1 is supported. The studied variables
explained 37.5% of the total variance in the change in bugs
reported (R2=0.375).
Tested next is the impact of complexity on the number of
contributions from new developers (hypothesis H2) by estimating
the parameters for the following regression model:
ChgNewDevCommits = α + β
1
ChgCC + β
2
lnSize +β
3
lnDownloads
+
β
4
Sponsors + β
5
DevStatus + β
6
lnAge + β
7
lnSkillsChg
The results of the regression (Hypothesis 2) are presented
in Table 1. The model shows good fit with the data (F=34.702,
p<0.000). The parameter estimate for ChgCC is significantly
negative (β
1
=-0.359, p<0.000). The results suggest that a unit
increase in cognitive complexity decreases the contributions from
new developers by 0.359 units. Hypothesis H2 is supported. The
studied variables explained 38.5% of the total variance in the
change in new developers’’ commits (R2=0.385).
Finally, examined is the impact of complexity on the time
taken to fix bugs (hypothesis H3) by estimating the parameters
for the following regression model:
Time to fix bugs = α + β
1
ChgCC + β
2
lnSize + β
3
lnDownloads ++
β
4
Sponsors + β
5
DevStatus + β
6
lnAge + β
7
lnSkillsChg
Table 1 shows the results of the regression analysis (Hypothesis
3). The model shows a good fit with the data (F=70.660, p<0.000).
The parameter estimate for ChgCC is significant and positive
(β
1
=0.720, p<0.000), indicating that projects that experience a
unit increase in cognitive complexity takes 0.720 units additional
time to fix bugs. Thus hypothesis H3 supported. The studied
variables explained 56.1% of the total variance in the change in
time taken to fix the reported bugs (R2=0.561).
dISCuSSIon and ImplICatIonS
main effects
The increase in the cognitive complexity of open software as
it evolves over time is of significant concern, as it will make
software maintenance increasingly difficult. In the extreme,
developers may stop making fixes and refinements rendering
the software error-prone and obsolete. Ultimately the open
software may perish its own death, be replaced by another
software project, or may go a major and laborious overhaul; all
options are expensive. In this section, we discuss our findings on
how complexity and control variables influence different aspects
of software maintenance.
The literature shows mixed support for the negative impact
of complexity on software quality. For example, Harter and
Slaughter [25] found a negative association between complex-
ity and quality. However, Gaffney [18] did not find software
complexity to be associated with error rates. Fitzsimmons
and Love [16] reported that the correlation between cognitive
complexity and the reported number of bugs ranges from 0.75
to 0.81. In our data, the correlation between the number of
taBle 1 — regression results
hypothesis 1 hypothesis 2 hypothesis 3 Collinearity
model Statistics
β Sig. β Sig. β Sig. (VIF)
ChgCC .303 .000 -.359 .000 .720 .000 1.150
Size .228 .000 -.157 .000 .010 .775 1.160
Downloads .173 .000 .099 .016 -.100 .004 1.205
AcceptSponsors -.171 .000 .330 .000 -.082 .012 1.053
DevStatus -.067 .083 .097 .011 -.010 .764 1.042
Age .156 .000 .038 .350 -.040 .245 1.177
SkillsChg -.016 .664 -.105 .006 .069 .030 1.014
Adjusted R-Square 0.375 0.385 0.561
Spring 2010 Journal of Computer Information Systems 87
bugs reported and complexity was 0.43. It is interesting to
note that the correlation found in this study was much smaller
than the correlations reported in earlier studies for non-open
source software; however, it is consistent with the literature
on OSS. In the context of OSS, Schröter et al. [44] reported
the correlation value in the range of 0.40. Furthermore, Kem-
erer and Slaughter [30] found that complex software is more
frequently repaired, which has the effect of increasing the
number of bugs. Therefore, it can be said with confidence that
as the complexity of the software increases, the number of re-
ported bugs, and by implication the actual number of bugs
increases.
Another measure of software quality is the time taken to fix
bugs. In fact, by mining software histories of two projects, Kim
and Whitehead [32] recommended to use time taken to fix bugs
as a measure of software quality. In our analysis, we found that
the complexity of software has a strong positive influence on the
time taken to fix bugs. It is common that when a bug is fixed in
one segment of the source code, it usually causes ripple effects
and adjustments in other segments [36]. The more complex the
software is, the more are the adjustments in other segments. As
a consequence, the developer has to simultaneously understand,
and repair related pieces in dispersed segments. Handling all
segments together has a detrimental effect on the time devoted by
the developer because more time is needed to follow the flow of
logic within the code [3]. This is supported by several empirical
studies that have found that time required to fix bugs increases
as complexity increases [5][20]. This result has another spurious
effect on software maintenance. When a developer becomes
conscious of long time needed to fix a bug, there is tendency for
the developer find “quick and dirty” solutions, thereby making
the code even less maintainable. Such half-baked efforts lead to
a vicious cycle in which the complexity, the number of bugs, and
the time taken to fix those bugs feed on each other until a dead end
is reached with the only option of either reengineering the project
or shutting it down completely.
Another reason for the longer time to fix bugs in complex
code can be found in Dymo’s [13] observations. Dymo noted that
most people prefer to work on software enhancements by adding
features rather than working on fixing bugs. This is especially
true, when the source code is more complex. Debugging and
understanding the existing code, written by someone else, takes
more time and resources. As the majority of the work is done on
voluntary basis in open software and developers are not bound by
contracts, developers tend to work on new versions of the software
rather than continue to work on improving the old ones. Although
this has the potential of bringing them more visibility in the OSS
community, the net effect is further delay in fixing bugs.
Another impact of source code complexity analyzed in the
study is on attracting contributions from new developers. Analysis
shows that cognitive complexity has a strong negative influence
on the number of contributions from new developers. As OSS
thrives upon voluntary contributions, the project managers must
actively control the source code complexity in order to attract
contributions from new developers. In a complex piece of code,
it takes longer for a developer to determine the flow of logic
resulting in slower progress of the project [40]. Cavalier [8]
pointed that the willingness of people to continue to contribute
to a project is related to the progress that is made in the project.
If a large number of activities do not seem to be moving forward,
participants lose interest, leading them to leave the project. This
leads to a higher likelihood of activities not being completed, and
ultimately, the death of the project. Such projects become inactive
over time and fail to attract any contributions.
effects of Control variables
Interesting observations can be made based on the effects of
the control variables. Our analysis found strong effects of size on
the number of bugs and the number of contributions from new
developers. It is often argued that complexity and size are strongly
correlated and that could lead to the problem of multicollinearity,
which tends to inflate regression coefficients. As mentioned
earlier, multicollinearity was tested by computing variance
inflation factors and was found to be within permissible limits.
Accordingly the effects of size are independent of the effects of
complexity.
The number of downloads has strong effects on the number of
bugs, time to fix bugs, and the number of contributions from new
developers. The number of downloads indicates the popularity
of a project; popular projects attract more user and developers
[33]. As the number of users and developer community grow,
the number of eyes watching the source code increases. As Eric
Raymond [41] repeatedly mentions “to many eyes, all bugs are
shallow”. When source code is open and freely visible, users can
readily identify flaws. The probability of finding a bug increases
with the increase in the number of eyes. As a result, the number
of hands working on code also increases leading to increased
contributions from new developers.
The continued development of a project, represented by
its age, gives software legitimacy, reputation and attention of
the community. However, in our study, age did not show any
significant effect. The reason could be because a large number of
OSS projects on SourceForge are in early stages of development
and there was not much variance in the data. This could be
attributed to the ease with which new projects can be started.
Such projects become inactive over time and have almost zero
contributions from the developer community. It could be argued
that age can bring legitimacy, reputation, and attention only if the
project is active. Therefore, a more reliable indicator of continued
development is the development status of a project, which was also
studied and was found to have a significant positive impact on the
number of commits from new developers. In the OSS literature,
development status has been shown to have a positive impact on
project’s popularity. Al Marzouq et al. [1] argue that a project
attracts more developers as the software becomes more stable.
In turn, these new developers bring effort and contribution that
improves the software. A growth cycle begins a network effect
that feeds both the community and development of the software.
Lakhani and Wolf [34] showed that developers receiving
money in any form spend more time working on OSS than their
peers. Similar results are shown by this study. We found that the
projects that have any form of sponsorship have higher number
of contributions from new developers. Such projects also had less
number of bugs and took lesser time to fix the bugs. This clearly
indicates that developers are receptive to external stimuli such
as a monetary reward. Henkel [26] illustrated a similar impact
of external sponsorship on the development of applications for
Linux, one of the most successful OSS project. Henkel noticed
that most contributors in the field of embedded Linux are salaried
or contract developers working for commercial firms.
The change in team skills with the addition of new developers
was found to have significant influence on the number of
contributions from new developers and the time taken to fix
88 Journal of Computer Information Systems Spring 2010
bugs. However, both relationships were in a direction opposite to
what was expected. The expectation was that as new developers
increase, the number of contributions will increase and the time
taken to fix bugs will reduce. The opposite directions of the
relationships indicate that with the increase in number of skills,
the overall time to fix bugs increases and the new contributions
decrease. A logical explanation is that either the developers are
just joining the development team without actually contributing
towards project development or the amount of contributions is
not proportionate to the number of skills they possess. Possibly
the same core group of developers are largely responsible for
the majority of contributions, and new developers do not add
anything substantive. This logic is consistent with the commonly
held belief in OSS that development follows Pareto’s law, where
a small number of developers (~20%) are responsible for the
majority of the work accomplished (~80%).
lImItatIonS and ContrIButIonS
Some limitations of the study need to be pointed out. The first
limitation is the sample frame. While SourceForge has data about
a vast collection of OSS projects, it does not capture all OSS
projects, which is the ultimate population of interest. While the
sample size is by far large enough to ensure statistical validity,
the choice of the sample frame may have some bearing on the
outcomes of the study. Additionally, it can be argued that the
change log only records the committer; whether the developer of
the code is ever acknowledged is uncertain. And, do all bugs get
reported? There could be bugs that are probably fixed but never
reported.
In spite of the limitations, this study makes important
contributions to both the literature and practice. The results are
robust as the hypotheses regarding cognitive complexity were
supported after having controlled for various factors. In other
words, our conclusions cannot be seen as artificial due to possible
correlation with other factors. The most important contribution
is the strong support for the relationships between cognitive
complexity and software quality, and cognitive complexity and
contributions from new developers. Our models indicate that,
on the average, OSS development projects with high cognitive
complexity are significantly associated with increased bugs and
repair time and decreased contributions from new developers.
These findings have at least two immediate implications for
software managers and project administrators. First, they must
measure software complexity on a continual basis, at least once for
each release or at regular intervals. Second, they need to implement
guidelines for upper bounds of complexity and recommend that
software versions at no stage exceed these guidelines. However,
no standard guidelines are probably universally applicable for all
software development projects. Developers and administrators
may want to set their own standards for their specific projects, like
the NSA (National Security Agency) standard, which is derived
from an analysis of 25 million lines of software code written for
NSA.
Furthermore, project administrators for OSS projects need to
learn the importance of controlling complexity. As recommended
by Lehman [35], strategies need to be developed not only to
control complexity, but also to actively reduce it. As a software
project progresses, it becomes increasingly complex making it
difficult to understand and manage [14]. Project administrators
need to be careful about subsequent changes between different
versions. Such changes can have strong debilitating impacts on
projects. If changes are not well monitored, they can lead to a
ripple effect. Ripple effect refers to the phenomenon of changes
made to one part of the software affecting and propagating to
other parts of the software. Lehman’s operating system example
clearly shows the ripple effect since the percentage of modules
changed in Release 15 is 33% while the percentage of modules
changed in Release 19 is 56%. The OSS development, thriving on
voluntary contributions, must keep a close watch on the cognitive
complexity of the software in order to attract contributions from
new developers.
Another important contribution of this research is for
organizations involved in or interested in getting involved in
OSS development. Our results indicate that, contrary to OSS
ideological beliefs, offering a monetary reward for participation
may successfully attract increased contributions from the OSS
community.
referenCeS
[1]. AlMarzouq, M., Zheng, L., Rong, G., and Grover, V.
“Open Source: Concepts, Benefits, and Challenges,” Com-
munications of the AIS, 16, Article-37, 2005, pp.756:784.
[2]. Banker, R. D., Datar, S., Kemerer, C., and Zweig, D.
“Software Complexity and Maintenance Costs,” Com-
munications of the ACM, 36(11), 1993, pp.81-94.
[3]. Banker, R., Davis, G., and Slaughter, S. “Software
development practices, software complexity, and software
maintenance effort: a field study,” Management Science,
44(4), 1998, pp.433-450.
[4]. Basili, V. and Hutchens, D. “An Empirical Study of a
Syntactic Complexity Family,” IEEE Trans. Software
Engineering, 9, 1983, pp.664-672.
[5]. Boehm, B. Software Engineering Economics, Prentice-
Hall, New York, 1981.
[6]. Bonaccorsi, A., and Rossi, C. “Why open source software
can succeed,” Research Policy, 32(7), 2003, pp.1243-1258
[7]. Brooks, F. The Mythical Man-Month, Addison-Wesley,
Reading, Mass., 1975.
[8]. Carillo, K. and Okuli, C. “The Open Source Movement:
A Revolution in Software Development,” Journal of
Computer Information Systems, 49(2), Winter2008/2009,
pp.1-9.
[9]. Cavalier, F. “Some Implications of Bazaar Size,” 1998,
available at http://www.mibsoftware.com/bazdev/ accessed
8 May 2006.
[10]. Crowston, K., Annabi, H, Howison, J. “Defining Open
Source Software Project Success,” Proceedings of ICIS,
Seattle, WA, 2003.
[11]. Curtis, B., Sheppard, S., Milliman, P., Borst, M., and Love,
T. “Measuring the Psychological Complexity of Software
Maintenance Tasks with the Halstead and McCabe
Metrics,” IEEE Transactions on Software Engineering,
5(2), 1979, pp.96-104.
[12]. Darcy, D., Kemerer, C., Slaughter, S., and Tomayko, J.
“The Structural Complexity of Software: An Experimental
Test,” IEEE Transactions on Software Engineering, 31(11),
2005, pp.982-995.
[13]. Dymo, A. “Open Source Software Engineering,” II Open
Source World Conference, Málaga, 2006.
[14]. Feller, J. and Fitzgerald, B. Understanding open source
software development, London: Addison-Wesley, 2002.
[15]. Fitzgerald, B. “Has Open Source Software a Future?,”
Spring 2010 Journal of Computer Information Systems 89
Perspectives on Free and Open Source Software, MIT
Press, 2005, pp.93-106.
[16]. Fitzsimmons, A. and Love, T. “A review and evaluation
of software science,” Computer Survey, 10(1), 1978, pp.3-
18.
[17]. Fjeldstad, R. and Hamlen, W. “Application program
maintenance-report to our respondents,” Tutorial on
Software Maintenance, 1983, pp. 13-27.
[18]. Gaffney, J. “Estimating the Number of Faults in Code,”
IEEE Transactions on Software Engineering, 10(4), 1984,
pp. 13-27.
[19]. Gelman, A., and Hill, J. Data Analysis Using Regression
and Multilevel/Hierarchical Models, Cambridge University
Press, 2007.
[20]. Gill, G. and Kemerer, C. “Cyclomatic complexity density
and software maintenance productivity,” Transactions on
Software Engineering, 17(12), 1991, pp. 1284-1288.
[21]. González-Barahona, J., Miguel A, Pérez, O, Quirós, P.,
González, J., and Olivera, V. “Counting potatoes. The size
of Debian 2.2,” Upgrade, 2(6), 2001, pp. 60-66.
[22]. Gorla, N., and Ramakrishnan, R. “Effect of Software
Structure Attributes Software Development Productivity,”
Journal of Systems and Software, 36(2), 1997, pp. 191-
199.
[23]. Grewal, R., Lilien, G., Mallapragada, G. “Location,
Location, Location: How Network Embeddedness Affects
Project Success in Open Source Systems,” Management
Science 52(7), 2006, pp. 1043-1056.
[24]. Harrison, W. and Cook, C. “Insights on improving the
maintenance process through software measurement,”
Proceedings of Conference on Software Maintenance, San
Diego,CA, 1990, pp. 37-44.
[25]. Harter, D. and Slaughter, S. “Process maturity and
software quality: a field study,” International Conference
on Information Systems, Brisbane, Australia, 2000, pp.
407-411.
[26]. Henkel, J. “Selective Revealing in Open Innovation
Processes: The Case of Embedded Linux,” Research
Policy, 35(7), 2006, pp. 953-969.
[27]. Henry, S., Kafura, D., and Harris, K. “On the Relationship
among Three Software Metrics,” ACM SIGMETRICS:
Performance Evaluation Review, 10(1), 1981, pp. 81-88.
[28]. Jones, T. Programming Productivity, McGraw-Hill, Inc.,
New York, 1986.
[29]. Kearney, J., Sedlmeyer, R., Thompson, W., Gray, M., and
Adler, M. “Software Complexity Measurement,” Com-
munications of the ACM, 29(11), 1986, pp. 1044-1050.
[30]. Kemerer, C. and. Slaughter, S. “Determinants Of Software
Maintenance Profiles: An Empirical Investigation,” Soft-
ware Maintenance: Research And Practice, 9(4), 1997, pp.
235-251.
[31]. Kemerer, C. F. “Software complexity and software main-
tenance: A survey of empirical research,” Annals of
Software Engineering, 1(1), 1995, pp. 1-22.
[32]. Kim, S., Whitehead, E, and Bevan, J. “Analysis of signature
change patterns,” Proceedings of the 2005 international
workshop on Mining software repositories, St.Louis,MO,
2005, pp. 1-5.
[33]. Krishnamurthy, S. “Cave or Community? An Empirical
Examination of 100 Mature Open Source Projects,” First
Monday, 7(6), 2002.
[34]. Lakhani, K., and Wolf, B. “Why Hackers Do What They
Do: Understanding Motivation and Effort in Free/Open
Source Software Projects,” Perspectives on Free and Open
Source Software, MIT Press, Cambridge, 2005.
[35]. Lerner, J., and Tirole, J. “Some Simple Economics of Open
Source,” The Journal of Industrial Economics, 1(2), 2002,
pp. 197-234.
[36]. Loch, C., Mihm, J., and Huchzermeier, A. “Concurrent
Engineering and Design Oscillations in Complex
Engineering Projects,” Concurrent Engineering, 11(3),
2003, pp. 187-199.
[37]. Markus, M., Manville, B., and Agres, C. “What makes a
virtual organization work?,” Sloan Management Review,
42(1), 2000, pp. 13-26.
[38]. Opensource.org, “The Open Source Definition (Version
1.9)”, 2002, at http://www.opensource.org/ docs/definition.
html, accessed 5 May 2006.
[39]. Pigoski, T. Practical Software Maintenance. Wiley com-
puter publishing, 1997.
[40]. Ramanujan, S. and Cooper, R. “A human information
processing approach to software maintenance,” Omega,
22(2), 1994, pp. 85-203.
[41]. Raymond, E. “The Cathedral and the Bazaar,” 1999, at
http://tuxedo.org/~esr/writings/cathedral-bazaar/
[42]. Raymond, E. The cathedral and the bazaar: musings on
Linux and open source by an accidental revolutionary,
Sebastopol, CA, O’Reilly, 2001.
[43]. Rilling, J. and Klemola, T. “Identifying Comprehension
Bottlenecks Using Program Slicing and Cognitive
Complexity Metrics,” Proceedings of the 11th IEEE
International Workshop on Program Comprehension,
2003, pp. 115.
[44]. Schröter, A., Zimmermann, T., Premraj, R., and Zeller, A.
“If Your Bug Database Could Talk . . . ,” Proceedings of
ACM-IEEE 5th International Symposium on Empirical
Software Engineering, Volume II: Short Papers and
Posters, Brazil, 2006.
[45]. Sen, R, Subramaniam, C, and Nelson, M. “Determinants of
the Choice of Open Source Software License,” Journal of
Management Information Systems, 25(3), 2008-9, pp. 207-
240.
[46]. Smith, N., Capiluppi, A., and Ramil, J. “Agent-based
Simulation of Open Source Evolution,” Software Process
Improvement and Practice, 11(4), 2006, pp. 423-434.
[47]. Stamelos, I.; Angelis, L.; Oikonomou, A.; and Bleris,
G. “Code Quality Analysis in Open Source Software
Development,” Information Systems Journal, 12(1), 2002,
pp. 43-60.
[48]. Stewart, K., Ammeter, A., Maruping, L. “A Preliminary
Analysis of the Influences of Licensing and Organizational
Sponsorship on Success in Open Source Projects,”
Proceedings of the 38th Hawaii International Conference
on System Sciences, 2005, pp. 197-203.
[49]. Stewart, K., Darcy, D., Daniel, S. “Observations on Patterns
of Development in Open Source Software Projects, Open
Source Application Spaces,” Fifth Workshop on Open
Source Software Engineering, 2005, St Louis, MO, pp. 1-
5.
[50]. Studenmund, A. Using Econometrics: A Practical Guide,
Harper Collins, New York, NY, 1992.
[51]. vonHippel, E., G. vonKrogh. “Open Source Software and
the “Private-Collective” Innovation Model: Issues for
Organization Science,” Organization Science, 14(2), 2003,
90 Journal of Computer Information Systems Spring 2010
pp. 209-225.
[52]. Weyuker, E. “Evaluating software complexity measures,”
IEEE Transactions on Software Engineering, 14(9), 1988,
pp. 1357-1365.
[53]. Withrow, C. “Error Density and Size in Ada Software,”
IEEE Software, 7(1), 1990, pp. 26-30.
[54]. Xu, J., Gao, Y, S. Christley, G. Madey. “A Topological
Analysis of the Open Source Software Development
Community,” Proceedings of the 38th HICSS, 2005, pp.
198.
Journal of Theoretical and Applied Information Technology
© 2005 – 2010 JATIT. All rights reserved.
www.jatit.org
30
MODEL BASED OBJECT-ORIENTED SOFTWARE TESTING
1SANTOSH KUMAR SWAIN, 2SUBHENDU KUMAR PANI, 3DURGA PRASAD MOHAPATRA
1School of Computer Engineering, KIIT University, Bhubaneswar, Orissa, India-751024
2 Department of Computer Application, RCM Autonomous, Bhubaneswar, Orissa, India -751021
3Department of Computer Science & Engineering, NIT, Rourkela, Orissa, India
ABSTRACT
Testing is an important phase of quality control in Software development. Software testing is necessary to
produce highly reliable systems. The use of a model to describe the behavior of a system is a proven and
major advantage to test. In this paper, we focus on model-based testing. The term model-based testing
refers to test case derivation from a model representing software behavior. We discuss model-based
approach to automatic testing of object oriented software which is carried out at the time of software
development. We review the reported research result in this area and also discuss recent trends. Finally, we
close with a discussion of where model-based testing fits in the present and future of software engineering.
Keywords: Testing, Object-oriented Software, UML, Model-based testing.
1. INTRODUCTION
The IEEE definition of testing is “the process of
exercising or evaluating a system or system
component by manual or automated means to verify
that it satisfies specified requirements or to identify
differences between expected and actual results.”
[16]. Software testing is the process of executing a
software system to determine whether it matches its
specification and executes in its intended
environment. A software failure occurs when a
piece of software does not perform as required and
expected. In testing, the software is executed with
input data, or test cases, and the output data is
observed. As the complexity and size of software
grows, the time and effort required to do sufficient
testing grows. Manual testing is time consuming,
labor-intensive and error prone. Therefore it is
pressing to automate the testing effort. The testing
effort can be divided into three parts: test case
generation, test execution, and test evaluation.
However, the problem that has received the highest
attention is test-case selection. A test case is the
triplet [S, I, O] where I is the data input to the
system, S is the state of the system at which the
data is input, and O is the expected output of the
system [17]. The output data produced by the
execution of the software with a particular test case
provides a specification of the actual program
behavior. Test case generation in practice is still
performed manually most of the time, since
automatic test case generation approaches require
formal or semi-formal specification to select test
case to detect faults in the code implementation.
Code based testing not an entirely satisfactory
approach to generate guarantee acceptably thorough
testing of modern software products. Source code is
no longer the single source for selecting test cases,
and nowadays, we can apply testing techniques all
along the development process, by basing test
selection on different pre-code artifacts, such as
requirements, specifications and design models
[2],[3]. Such a model may be generated from a
formal specification [7, 14] or may be designed by
software engineers through diagrammatic tools
[15]. Code based testing has two important
disadvantages. First, certain aspects of behavior of
a system are difficult to extract from code but are
easily obtained from design models. The state
based behavior captured in a state diagram and
message paths are simple examples of this. It is
very difficult to extract the state model of a class
from its code. On the other hand, it is usually
explicitly available in the design model. Similarly,
all different sequences in which messages may be
interchanged among classes during the use of a
software is very difficult to extract from the code,
but is explicitly available in the UML sequence
diagrams. Another prominent disadvantage of code
based testing is very difficult to automate and code
based testing overwhelmingly depends on manual
test case design.
Journal of Theoretical and Applied Information Technology
© 2005 – 2010 JATIT. All rights reserved.
www.jatit.org
31
An alternative approach is to generate test cases
from requirements and specifications. These test
cases are derived from analysis and design stage
itself. Test case generation from design
specifications has the added advantage of allowing
test cases to be available early in the software
development cycle, thereby making test planning
more effective. Model based testing (MBT), as
implied by the name itself, is the generation of test
cases and evaluation of test results based on design
and analysis models. This type of testing is in
contrast to the traditional approach that is based
solely on analysis of code and requirements
specification. In traditional approaches to software
testing, there are specific methodologies to select
test cases based on the source code of the program
to be tested. Test case design from the requirements
specification is a black box approach [14], where as
code-based testing is typically referred to as white
box testing. Model based testing, on the other hand
is referred to as the gray box testing approach.
Modern software products are often large and
exhibit very complex behavior. The Object-oriented
(OO) paradigm offers several benefits, such as
encapsulation, abstraction, and reusability to
improve the quality of software. However, at the
same time, OO features also introduce new
challenges for testers: interactions between objects
may give rise to subtle errors that could be hard to
detect. Object-oriented environment for design and
implementation of software brings about new issues
in software testing. This is because the above
important features of an object oriented program
create several testing problems and bug hazards [3].
Last decade has witnessed a very slow but steady
advancement made to the testing of object-oriented
systems. One of the main problems in testing
object-oriented programs is test case selection.
Models being simplified representations of systems
are more easily amenable for use in automated test
case generation. Automation of software
development and testing activities on the basis of
models can result in significant reductions in fault-
removal, development time and the overall cost
overheads.
The concept of model-based testing was originally
derived from hardware testing, mainly in the
telecommunications and avionics industries. Of
late, the use of MBT has spread to a wide variety of
software product domains. The practical
applications of MBT are referred to [18]. A model
is a simplified depiction of a real system. It
describes a system from a certain viewpoint. Two
different models of the same system may appear
entirely different since they describe the system
from different perspectives. For example control
flow, data flow, module dependencies and program
dependency graphs express very different aspects
of the behavior of an implementation. A wide range
of model types using a variety of specification
formats, notations and languages ,such as UML,
state diagrams, data flow diagrams, control flow
diagrams, decision table, decision tree etc, have
been established. We can roughly classify these
models into formal, semiformal and informal
models. Formal models have been constructed
using mathematical techniques such as theory,
calculus, logic, state machines, markov chains,
petrinets etc. Formal models have been successfully
used to automatically generate test cases. However,
at present formal models are very rarely constructed
in industry. Most of the models of software systems
constructed in industry are semiformal in nature. A
possible reason for this may be that the formal
models are very hard to construct. Our focus
therefore in this paper is the use of semiformal
models in testing object-oriented systems.
Pretschner et al. [3] present a detailed discussion
reviewing model based test generators. Barsel et al.
[20] study the relationship between model and
implementation coverage. The studies by
Heimadahl and George[19] indicate that different
test suites with the same coverage may detect
fundamentally different number of errors.
This paper has been organized as follows. The next
section presents an overview of various models
used in object-oriented software testing. The key
activities in an MBT process are discussed in
section 3. Section 4 discusses the key benefits and
pitfall of MBT. Section 5 focuses use of model-
based testing in the present and future of software
engineering. Section 6 concludes the paper.
2. MODELS USED IN SOFTWARE
TESTING
In this section, we briefly review the important
software models that have been used in object-
oriented software testing.
2.1 UML Based Testing
Unified modeling language (UML) has over the last
decade turned out to be immensely popular in both
industry and academics and has been very widely
used for model based testing. Since being reported
in 1997, UML has undergone successive
refinements. UML 2.0, the latest release of UML
allows a designer to model a system using a set of
nine diagrams to capture five views of the system.
The use case model is the user’s view of the
system. A static /structural view (i.e. class diagram)
Journal of Theoretical and Applied Information Technology
© 2005 – 2010 JATIT. All rights reserved.
www.jatit.org
32
is used to model the structural aspects of the
system. The behavioral views depict various types
of behavior of a system. For example, the state
charts are used to describe the state based behavior
of a system. The sequence and collaboration
diagrams are used to describe the interactions that
occur among various objects of a system during the
operation of the system. The activity diagram
represents the sequence, concurrency, and
synchronization of various activities performed by
the system. Behavioral models are very important
in test case design, since most of the testing detect
bugs that manifest during specific run of the
software i.e. during a specific behavior of the
software. Besides the behavioral models, it is
possible to construct the implementation and
environmental views of the system. The object
constraint language (OCL) makes it possible to
have precise models.
The work reported in [1-3, 5, 8] discuss various
aspects of UML-based model testing. A vast
majority of work examining MBT of object –
oriented systems focuses on the use of either class
or state diagrams. Both these categories of work
overwhelmingly address unit testing. Class
diagrams provide information about public
interfaces of classes, method signatures, and the
various types of relationships among classes. The
state diagram-based testing focuses on making the
objects all possible states and undertake all possible
transitions. Several work reported recently address
use of sequence diagrams, activity diagrams and
collaboration diagrams in testing [9].
2.2 Finite State Machines
FSM (Finite State machines) have been used since
long to capture the state –based behavior of
systems. Finite state machines (also known as finite
automata) have been around even before the
inception of software engineering. There is a stable
and mature theory of computing at the center of
which are finite state machines and other variations.
Using finite state models in the design and testing
of computer hardware components has been long
established and is considered a standard practice
today. [13] was one of the earliest, generally
available articles addressing the use of finite state
models to design and test software components.
Finite state models are an obvious fit with software
testing where testers deal with the chore of
constructing input sequences to supply as test data;
state machines (directed graphs) are ideal models
for describing sequences of inputs. This, combined
with a wealth of graph traversal algorithms, makes
generating tests less of a burden than manual
testing. On the downside, complex software implies
large state machines, which are nontrivial to
construct and maintain. However, FSMs being flat
representations are handicapped by the state
explosion problem. State charts are an extension of
FSMs that has been proposed specifically to
address the shortcomings of FSMs [13].State charts
are hierarchical models. Each state of a state chart
may consist of lower-level state machines.
Moreover they support specifications of state-level
concurrency. Testing using state charts has been
discussed in[21].
2.2 Markov Chains
Markov chains are stochastic models [24]. A
specific class of Markov chains, the discrete-
parameter, finite-state, time-homogenous,
irreducible Markov chain, has been used to model
the usage of software. They are structurally similar
to finite state machines and can be thought of as
probabilistic automata. Their primary worth has
been, not only in generating tests, but also in
gathering and analyzing failure data to estimate
such measures as reliability and mean time to
failure. The body of literature on Markov chains in
testing is substantial and not always easy reading.
Work on testing particular systems can be found in
[22] and [23].
2.2 Grammars
Grammars have mostly been used to describe the
syntax of programming and other input languages.
Functionally speaking, different classes of
grammars are equivalent to different forms of state
machines. Sometimes, they are much easier and
more compact representation for modeling certain
systems such as parsers. Although they require
some training, they are, thereafter, generally easy to
write, review, and maintain. However, they may
present some concerns when it comes to generating
tests and defining coverage criteria, areas where not
many articles have been published.
3. A TYPICAL MODEL-BASED TESTING
PROCESS
In this section, we discuss the different activities
constituting a typical MBT process.Fig.1 displays
the main activities in a life cycle of a MBT process
.the rectangles in Fig. 1 represent specific artifacts
developed used during MBT. The ovals represent
activities processes during MBT.
Journal of Theoretical and Applied Information Technology
© 2005 – 2010 JATIT. All rights reserved.
www.jatit.org
33
Figure 1. A Typical Model Based Testing Process
3.1 Construction of intermediate model
Several strategies have been reported to generate
test cases using a variety of models. However in
many cases the test cases based on more than one
model type. In such cases ,it becomes necessary to
first construct an integrated model based on the
information present in different models.
3.2 Generation of test scenarios
The test cases generated from models are in form of
sequences of test scenarios. Test scenarios specify a
high level test case rather than the exact data to be
input to the system. For example, in the case of
FSMs, it can be the sequence in which specifies
states and transitions must be undertaken to test the
system-called a transition path. The sequences of
different transition labels along the generated paths
form the required test scenarios. Similarly from
the sequence diagrams the message paths can be
generated. The exact sequence messages in which
the classes must interact for testing the system is
shown.
3.3 Test Generation
The difficulty of generating tests from a model
depends on the nature of the model. Models that are
useful for testing usually possess properties that
make test generation effortless and, frequently,
automatable. For some models, all that is required
is to go through combinations of conditions
described in the model, requiring simple knowledge
of combinatory. There are a variety of constraints
on what constitutes a path to meet the criteria for
tests. It includes having the path start and end in the
starting state, restricting the number of loops or
cycles in a path, and restricting the states that a path
can visit.
3.4 Automatic test case execution
In certain cases the tests can even be performed
manually. Manual testing is labor-intensive and
time consuming. However, the generated test suite
is usually too large for a manual execution.
Moreover, a key point in MBT is the frequent
regeneration and re-running of the test suite
whenever the underlying model is changed.
Accordingly achieving the full potential of MBT
requires automated test execution. Usually, using
the available testing interface for the software, the
abstract test suite is translated into an executable
test script. Automatic test case execution also
involves test coverage analysis. Based on the test
coverage analysis, the tests generation step may be
fine tuned or different strategies may be tried out.
Software
Model(s)
Test
Scenarios
Transform
Intermediate
Testing
Representation
Test Case
Generator
Coverage
Criteria
Analysis
Test Results
Test
Execution
Test Data Test Cases
Journal of Theoretical and Applied Information Technology
© 2005 – 2010 JATIT. All rights reserved.
www.jatit.org
34
3.5 Test Coverage Analysis
Each test generation method targets certain specific
features of the system to be tested. The extent to
which the targetted features are tested can be
determined using test coverage analysis[10,12]. The
important coverage analysis based on a model can
be the following: all model parts(or test
scenarios)coverage is achieved when the test
reaches every part in the model at least once.
Important test coverage required based on UML
models can be the following: path coverage,
message path coverage, transition path coverage,
scenario coverage, dataflow coverage, polymorphic
coverage, inheritance coverage. Scenarios coverage
is achieved when the test executes every scenario
identifiable in the model at least once.
4. A CRITIQUE OF MBT
Some important MBT advantages can be
summarized in the following points. It allows
achieving higher test coverage. This is especially
true of certain behavioral aspects which are difficult
to identify in the code. Another important
advantage of model–based testing is that when a
code change occurs to fix a coding error, the test
cases generated from the model need not change.
As an example, changing the behavior of a single
control in the user interface of the software makes
all the test cases using that control outdated. In
traditional testing scenarios, the tester has to
manually search the affected test cases and update
them. As even when code changes, the changed
code still confirms to the model. Model based test
suite generation often overcomes this problem.
However MBT does have certain restrictions and
limitations. Needless to say, as with several other
approaches, to reap the most benefit from MBT,
substantial investment needs to be made. Skills,
time, and other resources need to be allocated for
making preparations, overcoming common
difficulties, and working around the major
drawbacks. Therefore, before embarking on a MBT
endeavor, this overhead needs to be weighed
against potential rewards in order to determine
whether a model-based technique is sensible to the
task at hand.
MBT demands certain skills of testers. They need
to be familiar with the model and its underlying and
supporting mathematics and theories. In the case of
finite state models, this means a working
knowledge of the various forms of finite state
machines and a basic familiarity with formal
languages, automata theory, and perhaps graph
theory and elementary statistics. They need to
possess expertise in tools, scripts, and programming
languages necessary for various tasks. For example,
in order to simulate human user input, testers need
to write simulation scripts in a specialized
language.
In order to save resources at various stages of the
testing process, MBT requires sizeable initial effort.
Selecting the type of model, partitioning system
functionality into multiple parts of a model, and
finally building the model are all labor-intensive
tasks that can become prohibitive in magnitude
without a combination of careful planning, good
tools, and expert support. Finally, there are
drawbacks of models that cannot be completely
avoided, and workarounds need to be devised. The
most prominent problem for state models (and most
other similar models) is state space explosion.
Briefly, models of almost any non-trivial software
functionality can grow beyond management even
with tool support. State explosion propagates into
almost all other model-based tasks such as model
maintenance, checking and review, non-random test
case generation, and achieving coverage criteria.
The generated test cases may in many cases get
irrevalent due to the disparity between a model
and its corresponding code.MBT can never
displace code based testing, since models
constructed during the development process lack
several details of implementation that are required
to generate test cases.
Fortunately, many of these problems can be
resolved one way or the other with some basic skill
and organization. Alternative styles of testing need
to be considered where insurmountable problems
that prevent productivity are encountered.
5. MBT IN SOFTWARE ENGINEERING:
TODAY AND TOMORROW
Good software testers cannot avoid models. MBT
calls for explicit definition of the testing endeavor.
However, software testers of today have a difficult
time planning such a modeling effort. They are
victims of the ad hoc model, either in advance or
throughout the nature of the development process
where requirements change drastically and the rule
of the day is constant ship mode. Today, the scene
seems to be changing. Modeling in general seems
to be gaining favor; particularly in domains where
quality is essential and less-than-adequate software
is not an option. When modeling occurs as a part of
the specification and design process, these models
can be leveraged to form the basis of MBT.
There is promising future for MBT as software
becomes even more ubiquitous and quality
Journal of Theoretical and Applied Information Technology
© 2005 – 2010 JATIT. All rights reserved.
www.jatit.org
35
becomes the only distinguishing factor between
brands. When all vendors have the same features,
the same ship schedules and the same
interoperability, the only reason to buy one product
over another is quality. MBT, of course, cannot and
will not guarantee or even assure quality. However,
its very nature, thinking through uses and test
scenarios in advance while still allowing for the
addition of new insights, makes it a natural choice
for testers concerned about completeness,
effectiveness and efficiency.
The real work that remains for the near future is
fitting specific models (finite state machines,
grammars or language-based models) to specific
application domains. Perhaps, special purpose
models will be made to satisfy very specific testing
requirements and models that are more general will
be composed from any number of pre-built special-
purpose models. However, to achieve these goals,
models must evolve from mental understanding to
artifacts formatted to achieve readability and
reusability. We must form an understanding of how
we are testing and be able to sufficiently
communicate that understanding so that testing
insight can be encapsulated as a model for any and
all to benefit from.
6. CONCLUSION
Good software testers cannot avoid models. MBT
has emerged as a useful and efficient testing
method for realizing adequate test coverage of
systems. The usage of MBT reveals substantial
benefit in terms of increase productivity and
reduced development time and costs. On the other
hand MBT can’t replace code based testing since
models are abstract higher level representations and
lack of several details present in the code. It is
expected that in future models shall be constructed
by extracting relevant information both from the
design which can automate the test case design
process to a great deal.
Not surprisingly, there are no software models
today that fit all intents and purposes.
Consequently, for each situation decisions need to
be made as to what model (or collection of models)
are most suitable. There are some guidelines to be
considered that are derived from earlier
experiences. The choice of a model also depends on
aspects of the system under test and skills of user.
However, there is little or no data published that
conclusively suggests that one model outstands
others when more than one model is intuitively
appropriate.
REFRENCES:
[1]. W. Prenninger, A. Pretschner, Abstractions for
Model-Based Testing, ENTCS 116 (2005) 59–
71.
[2]. A. Pretschner, J. Philipps, Methodological
Issuesin Model-Based Testing, in: [29], 2005,
pp. 281–291.
[3]. J. Philipps, A. Pretschner, O. Slotosch,E.
Aiglstorfer, S. Kriebel, K. Scholl, Model based
test case generation for smart cards, in:Proc.
8th Intl. Workshop on Formal Meth. For
Industrial Critical Syst., 2003, pp. 168–192.
[4]. G. Walton, J. Poore, Generating transition
probabilities to support model-based software
testing,Software: Practice and Experience 30
(10) (2000) 1095–1106.
[5]. A. Pretschner, O. Slotosch, E. Aiglstorfer,S.
Kriebel, Model based testing for real–the
inhouse card case study, J. Software Tools for
Technology Transfer 5 (2-3) (2004) 140–157.
[6]. A. Pretschner, W. Prenninger, S. Wagner, C.
K¨uhnel, M. Baumgartner, B. Sostawa, R.
Z¨olch, T. Stauner, One evaluation of model
based testing and its automation, in: Proc.
ICSE’05, 2005, pp. 392–401.
[7]. E. Bernard, B. Legeard, X. Luck, F. Peureux,
Generation of test sequences from formal
specifications:GSM 11.11 standard case-study,
SW Practice and Experience 34 (10) (2004)
915 – 948.
[8]. E. Farchi, A. Hartman, S. S. Pinter, Using a
model-based test generator to test for standard
conformance, IBM Systems Journal 41 (1)
(2002) 89–110.
[9]. D. Lee, M. Yannakakis, Principles and
methods of testing finite state machines — A
survey, Proceedings of the IEEE 84 (2) (1996)
1090–1126.
[10]. H. Zhu, P. Hall, J. May, Software Unit Test
Coverage and Adequacy, ACM Computing
Surveys 29 (4) (1997) 366–427.
[11]. B. Beizer, Black-Box Testing : Techniques for
Functional Testing of Software and Systems,
Wiley, 1995.
[12]. C. Gaston, D. Seifert, Evaluating Coverage-
Based Testing, in: [29], 2005, pp. 293–322.
[13]. A. Offutt, S. Liu, A. Abdurazik, P. Ammann,
Generating test data from state-based
specifications,J. Software Testing, verification
and Reliability 13 (1) (2003) 25–53.
[14]. A. Pretschner, Model-Based Testing in
Practice,in: Proc. Formal Methods, Vol. 3582
of SpringerLNCS, 2005, pp. 537–541.
Journal of Theoretical and Applied Information Technology
© 2005 – 2010 JATIT. All rights reserved.
www.jatit.org
36
[15]. R. V. Binder, Testing Object-Oriented
Systems:Models, Patterns, and Tools,
Addison-Wesley,1999.
[16]. R. Helm, I. M. Holland, and D.
Gangopadhyay.Contracts: specifying
behavioral compositions in object-oriented
systems. In Proceedings of the 5th Annual
Conference on Object-OrientedProgramming
Systems, Languages, and Applications
(OOPSLA ’ 90), ACM SIGPLAN Notices,
25(10):169–180, 1990.
[17]. R. Mall, Fundamentals of Software
Engineering, Second ed., Prentice-Hall,
Englewood Cliffs, NJ, 2003.
[18]. Ilan Gronau.Alan Hartman.Andrei
Kirshin.Kenneth Nagin and Sergey
Olvovsky.A methodology and architecture for
automated software testing.Haifa technical
report IBM Research Laborotory.MATAM
,Advanced Technology Center, Haifa
31905,Israel.2000.
[19]. M.Heimadahl and D.George, “Test suite
Reduction for Model Based Tests:Effects on
Test Quality and Implecations for testing” In
proceedings of the 19th International
Conference on Automated Software
Engineering pp.176-185,2004.
[20]. A.Baresel,M.Conrad, S.Sadeghipour and
j.wegener.” the interplay between model
coverage and code coverage” in Eurocast,
DEC2003.
[21]. D.harel”Statecharts:A visual formalism for
complex systems science of computer
programming,8(3):231-274,1987.
[22]. K. Agrawal and James A. Whittaker.
Experiences in applying statistical testing to a
real-time, embedded.software system.
Proceedings of the Pacific Northwest Software
Quality Conference, October 1993.
[23]. Alberto Avritzer and Brian Larson. Load
testing software using deterministic state
testing.” Proceedings of the 1993 International
Symposium on Software Testing and Analysis
(ISSTA 1993), pp. 82-88, ACM, Cambridge,
MA, USA, 1993.
[24]. J. G. Kemeny and J. L. Snell. Finite Markov
chains. Springer-Verlag, New York 1976.
BIOGRAPHY: (Optional)
Santosh Kumar Swain is
presently working as
teaching faculty in School of
Computer Engineering,
KIIT University, KIIT,
Bhubaneswar, Orissa, India.
He has acquired his M.Tech
degree from Utkal
University, Bhubaneswar. He
has contributed more than four papers to Journals
and Proceedings. He has written one book on
“Fundamentals of Computer and Programming in
C”. He is a research student of KIIT University,
Bhubaneswar. His interests are in Software
Engineering, Object Oriented Systems, Sensor
Network and Compiler Design etc.
Dr. Durga Prasad Mohapatra
studied his M.Tech at National
Institute of Technology,
Rourkela, India. He has
received his Ph. D from Indian
Institute of Technology,
Kharagpur,India. Currently, he
is working as Associate
Professor at National Institute
of Technology, Rourkela. His special fields of
interest include Software Engineering, Discrete
Mathematical Structure, slicing Object-Oriented
Programming. Real-time Systems and distributed
computing.
Journal of Biomedical Informatics 43 (2010) 782–790
Contents lists available at ScienceDirect
Journal of Biomedical Informatics
j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / y j b i n
Complementary methods of system usability evaluation: Surveys and observations
during software design and development cycles
Jan Horsky a,b,c,*, Kerry McColgan a, Justine E. Pang a, Andrea J. Melnikas a, Jeffrey A. Linder a,b,c,
Jeffrey L. Schnipper a,b,c, Blackford Middleton a,b,c
a Clinical Informatics Research and Development, Partners HealthCare, Boston, USA
b Division of General Medicine and Primary Care, Brigham and Women’s Hospital, Boston, USA
c Harvard Medical School, Boston, USA
a r t i c l e i n f o a b s t r a c t
Article history:
Received 11 December 2009
Available online 26 May 2010
Keywords:
Health information technology
Clinical information systems
Usability evaluations
Design and development
Adoption of HIT
1532-0464/$ – see front matter � 2010 Elsevier Inc. A
doi:10.1016/j.jbi.2010.05.010
* Corresponding author at: Clinical Informatics
Partners Healthcare, 93 Worcester St., Wellesley, MA
8771.
E-mail address: jhorsky@partners.org (J. Horsky).
Poor usability of clinical information systems delays their adoption by clinicians and limits potential
improvements to the efficiency and safety of care. Recurring usability evaluations are therefore, integral
to the system design process. We compared four methods employed during the development of outpa-
tient clinical documentation software: clinician email response, online survey, observations and inter-
views. Results suggest that no single method identifies all or most problems. Rather, each approach is
optimal for evaluations at a different stage of design and characterizes different usability aspect. Email
responses elicited from clinicians and surveys report mostly technical, biomedical, terminology and con-
trol problems and are most effective when a working prototype has been completed. Observations of clin-
ical work and interviews inform conceptual and workflow-related problems and are best performed early
in the cycle. Appropriate use of these methods consistently during development may significantly
improve system usability and contribute to higher adoption rates among clinicians and to improved qual-
ity of care.
� 2010 Elsevier Inc. All rights reserved.
1. Introduction
There is a broad consensus among healthcare researchers, prac-
titioners and administrators that although health information
technology has the potential to reduce the risk of serious injury
to patients in hospitals, significant differences remain among the
multitude of electronic health record (EHR) systems with respect
to their ability to achieve high safety, quality and effectiveness
benchmarks [1–4]. In many instances, the intrinsic potential of
EHRs for preventing and mitigating errors continues to be only par-
tially realized and some implementations may, paradoxically, ex-
pose clinicians to new risks or add extra time to many routine
interactions [5,6].
Research evidence and published reports on the successes, fail-
ures, best-practices, lessons learned and barriers overcome during
implementation efforts have had only limited effect so far on accel-
erating the adoption of electronic information systems [7]. Accord-
ing to conservative estimates, at least 40% of systems either are
abandoned or fail to meet business requirements, and fewer than
ll rights reserved.
Research and Development,
02481, USA. Fax: +1 781 416
40% of large vendor systems meet their stated goals [8]. A recent
national study reported that only four percent of physicians used
a fully functional, advanced system and that 13% used systems
with only basic functions [9].
Transition from paper records to electronic means of informa-
tion management is an arduous process at large institutions and
private practices alike. It introduces new standards and reshapes
familiar practices often in ways unintended or unanticipated by
the stakeholders. Clinicians object to forced changes in established
workflows and familiar practices, long training times, and exces-
sive time spent serving the computer rather than providing care
[10,11].
Although the initial decline in efficiency generally improves
with increased skills and sufficient time to adjust to new routines
[12], systems themselves rarely evolve to better meet the demands
and requirements of the clinical processes they need to support. A
recent survey found an increase in the availability of EHRs over two
years in one state, but the researchers also reported that routine
use of ten core functions remained relatively low, with more than
one out of five physicians not using each available function regu-
larly [13]. An observational study of 88 primary care physicians
identified key information management goals, strategies, and tasks
in ambulatory practice and found that nearly half were not fully
supported by available information technology [14].
http://dx.doi.org/10.1016/j.jbi.2010.05.010
mailto:jhorsky@partners.org
http://dx.doi.org/10.1016/j.jbi.2010.05.010
http://www.sciencedirect.com/science/journal/15320464
http://www.elsevier.com/locate/yjbin
J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790 783
Developing highly functional, versatile clinical information sys-
tems that can be efficiently and conveniently used without exten-
sive training periods is predicated on incorporating rigorous and
frequent usability evaluations into the design process. Iterative
development methodology for graphical interfaces suggests evalu-
ating and revising successive prototypes in a cyclical fashion until
the product attains required characteristics. There are several com-
mon techniques that can be used to perform the evaluations that
are either carried out entirely by usability experts or involve the in-
put of intended users. Equally important is to see usability evalua-
tion as situated within the context of challenges imposed by
complex socio-technical systems [15] and within broader concep-
tual frameworks for design and evaluation such as those based on
the theory of distributed cognition and work-centered research
[16].
The broad objective of this study was to compare data gathered
by four usability evaluation methods and discuss their respective
utility at different stages of the software development process.
We hypothesized that no single method would be equally effective
in characterizing every aspect of the interface and human interac-
tion. Rather, an approach that employs a set of complementary
methods would increase their cumulative explanatory value by
applying them selectively for specific purposes. Our narrower goal
was to formulate recommendations for designers and evaluators of
health information systems on the effective use of common usabil-
ity inspection methods during the design and development cycle.
This report expands a brief discussion of methods used in the
design, pilot testing, and evaluation of the Smart Form in a previ-
ous publication [17].
2. Background
The reasons why one system may be preferred over another by
clinicians and perform closer to expectations are often complex,
vary with local conditions and almost always include financing,
leadership, prior experience and training. Among the core predic-
tors of quick adoption and successful implementation are the de-
sign quality of the graphical user interface and functionality,
along with socio-technical factors [7]. Usability has a strong, often
direct relationship with clinical productivity, error rate, user fati-
gue and user satisfaction that are critical for adoption. The system
must be fast and easy to use, and the user interface must behave
consistently in all situations [18]. At the same time, the system
must support well all relevant clinical tasks so that a clinician
working with the computer can achieve higher quality of care.
The Healthcare Information and Management Systems Society
(HIMSS) considers poor usability characteristics of current infor-
mation technology as one of the major factors, and ‘‘possibly the
most important factor” hindering its widespread adoption [19].
Historically, developers and designers have failed to tap the
experiential expertise of practicing clinicians [20]. The lack of a
systematic consideration of how clinical and computing tasks are
performed in the situational context of different clinical environ-
ments often results in designs that are off the intended mark and
fail to deliver improvements in safety and efficiency. For example,
in an experiment that examined the interactive behavior of clini-
cians entering a visit note, researchers compared the sequence
and flow of items on an electronic note form that was implied by
the designed structure to actual mouse movements and entry se-
quences recorded by a tracking software and found substantial dif-
ference between the observed behavior and prior assumptions by
the designers [21].
Existing usability studies mainly employ research designs such
as expert inspection, simulated experiments, and self-reported
user satisfaction surveys. Unfortunately, a large body of research
indicates that self-reports can be a highly unreliable source of data,
often context-dependent, and even minor changes in question
wording, format or order can profoundly affect the obtained results
[22].
While analyses that rely predominantly on a single method may
produce incomplete or unreliable results, there is considerable evi-
dence of the effectiveness of comprehensive approaches that com-
bine two or more methods, as important redesign ideas rarely
emerge as sudden insights but may evolve throughout the work
process [23,24]. For example, during the development of a decision
support system, designers employed field observations, structured
interviews, and document analyses to collect and analyze users’
workflow patterns, decision support goals, and preferences regard-
ing interactions with the system, performed think-aloud analyses
and used the technology acceptance model to direct evaluation
of users’ perceptions of the prototype [25]. A careful workflow
analysis could lead to the identification of potential breakdown
points, such as vulnerabilities in hand-offs, and communication
tasks deemed critical could be required to have a traceable elec-
tronic receipt acknowledgment [26]. The advantage of informing
the design from its conception with close insights into local needs
and actual practices the software will support is reflected in the
fact that ‘‘home-grown” systems show a higher relative risk reduc-
tion than commercial systems [1].
Iterative development of user interfaces involves the steady
refinement of the design based on user testing and other evalua-
tion methods [27]. The complexity and variability of clinical work
requires correspondingly complex information systems that are
virtually impossible to design without usability problems in a sin-
gle attempt. Experts need to create a situation in which clinicians
can instill their knowledge and concern into the design process
from the very beginning [28]. Changing or redesigning a software
system as complex as an EHR after it has been developed (or imple-
mented) is enormously difficult, error-prone, and expensive
[29,30]. Iterative evaluations early in the process allow larger con-
ceptual revisions and refinements to be done without excessive ef-
fort and resources [31].
The software developed, tested and deployed in a pilot program
in this study, the Coronary Artery Disease (CAD) and Diabetes Mel-
litus (DM) Smart Form (Fig. 1), was a prototype of an application
intended to assist clinicians with documenting and managing the
care of patients with chronic diseases [17]. Integrated within an
outpatient electronic record, it allowed direct access to laboratory
and other coded data for expedient entry into new visit notes. The
Smart Form also aggregated reviewing of prior notes and labora-
tory results to create disease-relevant context for the planning of
care, and provided actionable decision support and best-practices
recommendations. The anticipated benefit to clinicians includes
savings in time required to look up, collect, interpret and record
clinical data into a note, and an increase in the quality and com-
pleteness of documentation that may contribute to improved pa-
tient care.
In the planning stage of the development, two experts, includ-
ing a physician, conducted focus groups with approximately 25
physicians who described their usual workflows, methods for
acute and chronic disease management, attitudes towards decision
support, and their wants and needs, and summarized emerging
themes [17].
3. Methods
We have conducted four different studies of usability and hu-
man–computer interaction that were intended to collect two types
of data: comments elicited directly from clinicians working with
the Smart Form, and findings derived from formal evaluations by
Fig. 1. Screenshot of Smart Form.
784 J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790
usability experts. We rigorously maintained distinctions between
direct, free-style comments made by clinicians and objective find-
ings by usability experts. Comments were always direct expres-
sions of clinicians that originated either spontaneously or in
response to a question, written or verbal. Findings, on the other
hand, were expert opinions and recommendations based on field
notes, interviews, focus groups and on direct observation of clini-
cians interacting with the Smart Form.
The reason why we chose to count and compare comments and
findings instead of actual problems is the uncertainty in determin-
ing whether any two or more user reports describe identical prob-
lems, as comments may sometimes be vague, too general or
without the proper context to match them to unique problems.
Since we could not differentiate all problems in a consistent man-
ner, we decided to report the comments and findings themselves
as approximations to actual problems.
In the first study, clinicians sent their comments by email dur-
ing a 3-month pilot period in which they used the module for the
documentation of actual visits. Another set of comments, in the
second study, were entered in an online survey at the end of
the pilot. We also extracted direct quotes of clinicians from tran-
scripts of interviews and think-aloud protocols that were com-
pleted as parts of usability evaluation in the remaining two
studies. The findings, in contrast, were formulated entirely by
usability experts as the result of a series of evaluation studies
(third and fourth) and published in technical reports.
Each comment and a finding were assigned to a usability heu-
ristic category independently by two researchers. The classification
scheme was specific to the healthcare domain and its development
is described in detail in a section below. The number of comments
and findings in each category was compared to assess the descrip-
tive power of each data collection method for specific usability
characteristic. For example, we would contrast the different pro-
portion of comments from each source that contributed to the total
number of observations in each category.
The four data collection methods are described in detail below.
Think-aloud studies were conducted by a usability expert at our
institution and walkthroughs and evaluations by independent pro-
fessional evaluators on contract basis.
3.1. Email via an embedded link
The Smart Form was integrated within the outpatient clinical
records system and used by 18 clinicians for 3-months (March to
May, 2006) in the course of their regular clinical work to write visit
notes for patients with coronary artery disease and diabetes. They
had the option of opening a free-text window on their desktops at
any time by clicking on a link embedded in the application and
typing in their comments. The messages were collected in a data-
base and logged with a timestamp and the sender’s name.
3.2. Online survey
Fifteen participants received an email with a link to an online
survey in May 2006. Questions about satisfaction, frequency of
use and problems had multiple-choice responses and were accom-
panied by two open-ended questions, ‘‘What changes could be
made to the Smart Form that would make you more likely to use
it?” and ‘‘What improvements can be made to the Smart Form be-
fore you would recommend it to other clinicians?” Completion was
voluntary and rewarded with a $20 gift certificate.
3.3. Think-aloud study and observations
We recruited six primary care physicians and specialists (four
women) to participate in usability and interaction studies. Evalua-
tions were conducted in the clinicians’ offices at six different clinics
and lasted 30–45 min. Subjects were asked to complete a series of
interactive tasks described in a previously developed clinical sce-
nario. A researcher played the role of a patient during each session
to provide a realistic representation of an office visit. Medical his-
tory, current medications and the presence of diabetes and CAD
were included in a narrative paragraph that was accompanied by
J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790 785
supporting electronic documentation of prior visits, lab results, vi-
tals and demographic information in a simulated patient record.
Subjects were instructed to verbalize their thoughts (to think-
aloud) as they were completing the tasks and interacting with
the Smart Form. Video and audio recordings of each session were
made with Morae [32] usability evaluation software installed on
portable computers. The verbal content was transcribed for analy-
sis to be used together with the resulting screen captures. In a
debriefing period after completion, subjects were asked follow-
up questions to elaborate or elucidate their actions and reasoning.
The results of this study were compiled in a technical report.
3.4. Walkthroughs, expert evaluations and interviews
A team of professional health informatics consultants carried
out independently usability assessment and walkthroughs and
conducted interviews with six primary care physicians and special-
ists (two women) whose experience with the application ranged
from novice to expert. The results of the evaluation were presented
in a technical report.
3.5. The development of heuristic usability assessment scheme
Four sets of usability heuristics with a substantial theoretical
overlap have been generally accepted and are widely used in pro-
fessional evaluations: Nielsen’s 10 usability heuristics [33] (de-
rived from the results of a factor analysis of about 250
problems), Shneiderman’s Eight Golden Rules of Interface Design
[34], Tognazzini’s First Principles of Interaction Design [35], and
a set of principles based on Edward Tufte’s visual display work
[36]. These approaches were recently integrated into a single Mul-
tiple Heuristics Evaluation Table by identifying overlaps and com-
bining conceptually related items [37].
These general heuristics sets have been used to evaluate health-
care-related applications [38–41] and consumer-health websites
[42]. A set of aggregated Nielsen’s and Schneiderman’s heuristics
was proposed by Zhang and colleagues [43] for HIT and applied
to the evaluation of an infusion pump [44] and a clinical web appli-
cation [45]. However, the categories and guidelines do not specifi-
cally address biomedical or clinical concepts. Our goal was to
formulate additional categories to increase their cumulative
explanatory power.
To this end we analyzed all 155 statements about usability
problems collected during the study to identify emergent themes
following the grounded theory principles [46]. Two researchers
then independently assigned the statements into heuristic catego-
ries, either general or modified according to newly identified
themes. Several iterative coding sessions and discussions ensued,
and as a result of extensive comparison and refinement, 12 heuris-
tic categories were formulated (Table 3).
Table 1
Comments by heuristic category and source.
Heuristic
category
Email N
(%)
Survey N
(%)
Evaluation
N (%)
Interview
N (%)
Totals N
(%)
Biomedical 21 (81) 0 1 (4) 4 (15) 26 (17)
Cognition 12 (46) 3 (12) 4 (15) 7 (27) 26 (17)
Control 17 (61) 4 (14) 5 (18) 2 (7) 28 (18)
Customization 7 (29) 5 (28) 1 (6) 5 (28) 18 (12)
3.6. Participants
All data were collected from 45 clinicians within Partners
Healthcare practice network who participated in either part of
the study (with a small overlap). Most were primary care physi-
cians (73%), about half were female (53%), and the mean age of
the group was 48 years.
Fault 16 (94) 1 (6) 0 0 17 (11)
Speed 3 (43) 3 (43) 1 (14) 0 7 (5)
Terminology 4
(100)
0 0 0 4 (3)
Transparency 4 (36) 1 (9) 6 (55) 0 11 (7)
Workflow 1 (6) 3 (17) 8 (44) 6 (33) 18 (12)
Totals 85 (55) 20 (13) 26 (17) 24 (15) 155
(100)
4. Results
Analyses were performed separately on comments by clinicians
and on findings by usability experts. Results are presented in the
following sections and contrasted.
4.1. Comments by clinicians
Results for comments are summarized in Table 1. There were
155 comments from 36 clinicians obtained either in the form of
written communication (email and survey) or transcribed from di-
rect verbal quotes (interview and evaluation). We received 85
emails from nine clinicians (reflecting a 50% response rate), and
20 free-text comments were entered in the online survey by 15 cli-
nicians (54% response). Six clinicians who participated in usability
evaluations made 26 comments and another six clinicians made 24
distinct comments during interviews.
Over a half of all responses (55%) were emails, and about equal
numbers were obtained from the survey, evaluations and inter-
views (15%, 13% and 17%, respectively). The most common form
of a response that constituted about a third of collected data
(N = 54) was an email classified as either a Biomedical, Control or
Fault category. Comments from the other three sources were most
likely to be classified in the following categories: Customization
and Control for survey (N = 9, 45%), Transparency and Workflow
for evaluations (N = 14, 54%), and Cognition and Workflow for
interviews (N = 13, 54%). Overall, the Control, Cognition and Bio-
medical categories described about a half of all data (52%), and
about a third (35%) was classified in the Customization, Workflow
and Technical categories. There were no Consistency or Context
comments.
Although email was the most prevalent form of communication
in the set, its proportion was different within each heuristic cate-
gory (Fig. 1). For example, it added up to 80% or more in three cat-
egories (Terminology, Fault and Biomedical) and to a majority
(61%) in the Control category, but only one was classified as related
to Workflow. Written response was more likely to be used for the
reporting of technical, biomedical and interaction problems (e.g.,
Fault, Biomedical, Terminology, Control), while verbal comments
often related to Workflow or Transparency difficulties. For exam-
ple, almost 90% of comments made during evaluations were clus-
tered in just four categories and similar distribution was found in
data from interviews.
4.2. Findings by usability evaluators
The results are summarized in Table 2. There were 47 findings
extracted from expert reports. Over two thirds were classified into
just three categories: Cognition, Customization and Workflow. In
contrast, none were in the Fault, Speed or Terminology categories
and only one was classified as Biomedical. Technical and biomed-
ical concepts were generally not represented in the evaluations.
4.3. Comments and findings comparison
We contrasted all 47 findings with a subset of 105 comments
that included only email and survey. Findings were derived from
Table 3
Description of Heuristic Evaluation Categories.
Category Description
Consistency Hierarchy, grouping, dependencies and levels of significance
are visually conveyed by systematically applied appearance
characteristics, perceptual cues, spatial layout, text
formatting and pre-defined color sets. Behavior of controls is
predictable. Language in commands, labels and warnings is
standardized
Transparency The current state is apparent and possible future states are
predictable. Action effects, their closure and failure are
indicated
Control The interruption, resumption and non-linear or parallel task
completion is possible. Direct access to data across levels of
hierarchy, backtracking, recovery from unwanted states and
reversal of actions are possible
Cognition Content avoids extraneous information and excessive
density. Representational formats allow perceptual
judgment and unambiguous interpretation. Cognitive effort
is reduced by minimalistic design, formatting and use of
color, allowing fast visual searches. Recognition is preferred
over recall. Conceptual model corresponds to work context
and environment
Context Terms, labels, symbols and icons are meaningful and
unambiguous in different system states. Alerts and
reminders perceptually distinguish between general
(disease, procedure, guidelines) and patient-specific content
Terminology Medical language is meaningful to users in all contexts of
work, compatible with local variations and established
terms
Biomedical Biomedical knowledge used in rules and decision support is
current and accurate, reflecting guidelines and standards. It
is evident how suggestions are derived from data and what
decision logic is followed
Safety Complex combinations of medication doses, frequencies,
units and time durations are disambiguated by appropriate
representational formats and language, entries are audited
for allowed value limits. Omissions are mitigated by goal
and task completion summary views. Errors are prevented
from accumulating and propagating through the system
Customization Preferred data views, organization, sorting, filtering,
defaults, basic screen layout and behavior are persistent
over use sessions and can be defined individually or
according to role
Fault Software failures and functional errors are minimal, do not
compromise safety and prevent the loss of data
Speed Minimal latency of screen loads and high perceived speed of
task completion
Workflow Navigation, data entry and retrieval does not impede clinical
task completion and the flow of events in the environment
Table 2
Findings by Heuristic Category and Source.
Heuristic category Evaluation N Interview N Total findings N (%)
Biomedical 1 0 1 (2)
Cognition 10 6 16 (34)
Control 2 4 6 (13)
Customization 2 7 9 (19)
Consistency 0 1 1 (2)
Context 1 1 2 (4)
Transparency 5 0 5 (11)
Workflow 7 0 7 (15)
Totals 28 19 47 (100)
786 J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790
reports of evaluation and interviews that already contained rein-
terpreted verbal comments of the subjects. We therefore excluded
comments made during evaluations from the comparison.
Comments and findings showed divergent trends in character-
izing usability aspects of the Smart Form (Fig. 3). Comments were
more likely to describe discrete, clearly manifested and highly spe-
cific problems and events, such as software failures or concerns
about medical logic or language (e.g., Control, Biomedical, Fault,
Terminology). Findings derived from usability evaluation, on the
other hand, tended to explain conceptual problems related to over-
all design and the suitability of the electronic tool to clinical work
(e.g., Consistency, Context, Workflow). Both methods contributed
about equally to the description of problems with human interac-
tion (e.g., Cognition, Customization).
4.4. Implementation of design changes to a revised prototype
Individual comments and findings most often referred to single,
discrete problems. Some problems were reported by several clini-
cians or were identified by multiple methods. The 155 analyzed
comments and findings reported 120 unique problems (77% ratio),
and 12 problems were simultaneously described by more than one
method (10% ratio). We have iteratively implemented design
changes into the prototype on the basis of 56 reported problems
(47%). Most of the problems that led to subsequent changes (34)
were reported by email.
5. Discussion
Our data analysis has identified the relative strengths and
weaknesses of the four evaluation approaches, their distinct utility
and appropriateness for characterizing different usability concepts,
and their cumulative explanatory power as a set of complementary
methods used at specific points of the development lifecycle. The
large number of comments that clinicians provided were a rich
source of reports on software failures, slow performance and po-
tential conflicts and inconsistencies in biomedical content, while
usability experts generally gave comprehensive assessments of
problems related to human interaction and workflow, including
characterizations of problems with interface design and layout that
negatively affect cognitive and perceptual saliency of displayed
information. The core principles, attributes and expected results
for each method are summarized in Table 4 and discussed in depth
in the following sections.
5.1. Email
An email link embedded in the application is available to every-
one and at all times, allowing almost instantaneous reporting of
problems as they occur. Informaticians and computer technology
specialists can learn from these comments how the software per-
forms in authentic work conditions and how well it supports clini-
cians in complex scenarios that commonly arise from the
combination of personal workflows and preferences, unexpected
events, and unusual, idiosyncratic, unplanned or non-standard
interaction patterns. The wide range of conditions that affect per-
formance and contribute to errors and failures would not be possi-
ble to anticipate and simulate in the laboratory. Performance
measures in actual settings also give evidence of the technical
and conceptual strengths of the design. Insights from these reports
give designers a unique opportunity to make the application more
robust and tolerant of atypical interaction, more effective in man-
aging and preventing errors, and more appropriate for the clinical
task it supports.
The large number and variety of email reports and their often
fragmentary content make them often hard to interpret. For exam-
ple, it is difficult for clinicians to recall accurately the relevant and
descriptive details of errors that were made or problems that were
encountered during complex interactions with multi-step or inter-
leaving tasks, and to convey a meaningful description of the event.
However, informaticians may need details about the system state,
work context or preceding actions that are often lacking in sponta-
neous and short messages to evaluate how a problem originated
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
C
om
m
en
ts
(
%
)
Evaluation Heuristic
Email Survey Evaluation Interview
Fig. 2. Proportions of comments by heuristic and source.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
C
om
m
en
ts
(
N
)
Evaluation Heuristic
Findings Comments
Fig. 3. Proportion of comments and findings by heuristic.
J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790 787
and its potential consequences. The usually large volume of emails
accumulated over time also contains repetitive, idiosyncratic and
inaccurate reports that may be of little value and need to be ex-
cluded. A self-selection bias among respondents (e.g., novice users
may be underrepresented) may accentuate marginal problems or
conceal more serious ones. Difficulties of more conceptual charac-
ter may be only rarely reported through comment messages, as
was evident from the analysis of our data (e.g., the distribution
of comments in heuristic categories, Fig. 2).
Among the most significant advantages of embedded email re-
sponse links are their inexpensive implementation, network-wide
availability, real-time response and continuous, active data collec-
tion. These characteristics make email an excellent data collection
method during pilot testing of release candidate versions and after
the release of full versions. There is a high probability of quickly
discovering technical problems, an opportunity to review medical
logic for decision support tools that may not have been tested in
complex scenarios (e.g., a patient with multiple comorbidities
and drug prescriptions), and a likelihood of finding inconsistencies
in terminology or ambiguities in language and expressions. For
example, of the 56 changes and corrections we implemented in
the prototype, 36 (64%) such problems were reported and identi-
fied in emails.
This method requires the software to be in the stage of a fully
functional prototype or in its final release form. It may therefore
be too laborious or expensive to make significant conceptual
changes in design at that point. However, our data suggest (e.g.,
the proportions of comments to specific concepts in Fig. 2) that
most of email-reported problems concern specific biomedical con-
tent, terminology and technical glitches that may be relatively easy
Table 4
Comparison of clinician response and formal usability evaluation results.
Descriptions Email Survey Usability Studies
Heuristic focus Biomedical, Cognition, Control, Customization,
Fault, Speed, Terminology
Control, Customization, Speed Cognition, Context, Consistency, Control,
Customization, Safety Transparency, Workflow
Evaluated aspects Software problems, medical logic, decision
support, use of terms, perceived speed,
interaction difficulties, desired functions
Satisfaction, perceived speed of completion,
qualitative assessments, desired functions,
personal preferences, use context
Design concepts, actual and anticipated errors,
cognitive load, workflow fit, cognitive model,
skilled and novice performance
When to perform Pilot release, shortly after full release After pilot release, after full release,
periodically
Early in design cycle, iteratively during
prototyping, planning stage before new design
Advantages Can identify rare and complex use situations,
immediate response when problem occurs,
everyone can comment
Allows comparison over time, broad reach,
can be web-based, ongoing
Describes human error, mental models, strategy,
structured and reliable, rich detail, insights into
workflow integration
Limitations Often missing context, may not be intelligible,
does not capture human error, self-selection bias
Relatively low reliability, reflective,
subjective, may be hard to interpret
Laborious, expertise is required, describes only
few use cases, needs expensive physician time
Source of data Clinicians Clinicians Usability experts
Timeframe Continuous Periodic Episodic
Sample quotes ‘‘I saved the note, then tried to Sign, and the
system just froze”
‘‘There appears to be a problem with logic when
the creatinine is too low”
‘‘I found it challenging to find signature location.
Spent extra time just looking at the screens”
‘‘Allow for easier management of insulin and
titration”
‘‘Needs a faster medication entry format”
‘‘I find it cumbersome to my workflow”
She did not notice the Save icon and searched for
a Save button at the bottom of the window.
He knew where to look for vitals, but had to
enter new values manually.
Hide non-essential icons. Create a dynamic right-
click context menu
788 J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790
to correct without large-scale changes in the code and screen
layout.
5.2. Survey
Survey is another form of direct clinician response that we used
in this study and it shares several characteristics with email com-
munication, such as a potentially wide reach, economy of adminis-
tration, a tendency for self-selection bias, relatively low response
rate and the brevity of its form. Unlike email links, surveys are
structured and contain a pre-determined set of questions to elicit
responses and opinions on narrow topics of interest. They do not
allow reporting problems in real time, however, and require
respondents to recall and interpret past events at the time the sur-
vey is completed. This may be difficult, as our data suggest that
free-text answers to open-ended questions did not contain refer-
ences to specific and detailed biomedical and technical problems,
the most frequent categories represented in emails (see Table 1).
Rather, clinicians tended to describe more broadly defined difficul-
ties with screen control, navigation and customization.
The content in surveys, as in other direct forms of communica-
tion, is often subjective, reflecting personal opinion, and therefore,
of lower descriptive value and accuracy than data gathered in pro-
fessional evaluations [22]. A substantial period of time needs to be
allowed for potential survey respondents to work with a fully
working prototype or the completed application before they can
form meaningful opinions and gain a measure of proficiency.
Surveys can be administered periodically for comparisons over
time and can be timed to coincide with important events such as
technology or procedures updates that may affect the way the sys-
tem is used. They can also be targeted to specific groups, such as
primary care physicians, pediatricians and other specialists.
5.3. Usability evaluations and interviews
The most telling indicators of conceptual flaws in the design
come from the observation of human interaction errors [47]. They
can provide insights into discrepancies between expected and ac-
tual behavior and identify inappropriate and ambiguous represen-
tational formats of information on the screen that impairs its
accurate interpretation [48]. Errors are rarely reported directly in
emails or in surveys, as the responders are not often aware of their
own mistakes. For example, observation experts in our study re-
ported that a clinician during a simulated task ‘‘could not tell
whether the patient was taking Aspirin, assumed that urinalysis
could only be ordered on paper and did not notice the save button,”
an insight that would not be gained by introspection and recall.
Usability inspection methods in which experts alone evaluate
the interface, such as the cognitive walkthrough and heuristic eval-
uation, provide predominantly normative assessments. In other
words, they report how well the interface supports the completion
of a standardized task that can be reasonably expected to be per-
formed routinely, and measure the extent to which the design ad-
heres to general usability standards. These methods produce
reference models of interaction that can be compared to evidence
from field observations.
Ethnographic and observational methods such as think-aloud
studies, on the other hand, derive data from analyzing unscripted
and natural interactions with the software by non-experts with
various levels of computer and task-domain skills. They are there-
fore inherently descriptive and analytic and allow researchers to
make inferences about the clarity and suitability of the design to
the task from observed competencies and errors. Usability experts
can integrate findings about interaction errors with interface eval-
uations, cognitive walkthroughs and heuristic evaluations into a
comprehensive analysis and formulate optimal strategies for mak-
ing modifications to the interface. Normative and descriptive
methods together constitute a comprehensive evaluation of design
in progress that can be repeated iteratively early in the process to
refine data representation and interaction concepts in each succes-
sive version.
Findings from experts in this study have been clearly focused on
conceptual and interaction-related aspects of the Smart Form
(Table 2). The structured format of think-aloud studies follows
pre-defined clinical scenarios that generally contain validated bio-
medical data and unambiguous terminology that do not represent
potential problems to be reported in evaluations. Comments from
clinicians working with the software in real settings, however, are
more descriptive of specific factual, technical and biomedical er-
rors that observational studies frequently do not capture. The
J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790 789
relative proportions of expert findings and clinicians’ comments in
each heuristic category and their respective tendency to describe
different aspects of the software are clearly evident in Fig. 3.
Experts can also capture more easily positive aspects of the de-
sign and confirm successful trends. For example, an evaluator re-
ported that ‘‘the subject seemed comfortable navigating around
and understands how to update medications in the system.” Email
responses are often initiated at the time of a failure or when an er-
ror is encountered, but rarely when the system is working well. In
effect, successful performance is characterized by uneventful and
well-progressing work which is apparent to observers but not of-
ten reported back to designers by clinicians.
Interviews with clinicians are usually done in conjunction with
observations to elucidate aspects of collected data that require
proper context for interpretation, and also as ‘‘debriefings” at the
end of after think-aloud studies. The results of expert evaluations
commonly incorporate insights and findings from interviews into
comprehensive reports.
Expert evaluations are indispensable during the initial design
stages when even significant corrections and reconceptualizations
are still possible without incurring steep penalties in time and
development effort.
6. Conclusion
This study has been conducted to characterize and compare
four usability evaluation methods that were employed by the re-
search team during the design and pilot testing of new clinical doc-
umentation software. We have also formulated a classification
scheme of heuristic usability concepts that incorporates estab-
lished principles and extends them for evaluations specific to the
clinical software domain.
Our results suggest that no single method describes better than
others all or most usability problems, but rather that each is opti-
mally suited for evaluations at different points of the design and
deployment process, and that they all characterize different as-
pects of the interface and human interaction. The studies and
assessments we have performed were embedded in the design pro-
cess and spanned the entire development cycle.
Heuristic evaluations and ethnographic observations of actual
clinical work by usability experts inform and guide conceptual
and workflow-related changes and need to be performed itera-
tively early in the design cycle so that they can be incorporated
without excessive effort and time. Responses elicited directly from
clinicians and other users through email links and surveys report
mostly technical, biomedical, terminology and control problems
that may occur in a wide variety of workflows and idiosyncratic
use patterns.
The evaluations were conducted on the relatively small scale of
a pilot study. However, the smaller size may be typical of many
software development efforts at large academic and healthcare
centers. The findings and lessons learned in this study may, there-
fore, be of interest to information system designers, developers and
research and development centers affiliated with hospitals and di-
rectly related to their experiences with the design and improve-
ment of clinical information systems. We have outlined a
methodological approach that is applicable to most development
processes of software intended for healthcare information systems.
We plan to formally validate and possibly revise the set of heu-
ristics we formulated and apply it to the evaluation of an informa-
tion system in its entirety that will also include judgments about
safety that were not performed in this pilot study.
Health information technology is still in its nascent state today.
Order entry systems, for example, still represented only a second
generation technology in 2006 and had many limitations that pre-
cluded their meaningful integration into the process of care [49].
Applications not appropriately matched to clinical tasks tend to
be chronically underused and may be eventually abandoned [21].
Acknowledgments
The Smart Form research was supported by Grant
5R01HS015169-03 from the Agency For Healthcare Research And
Quality. We wish to thank Alan Rose, Ruslana Tsurikova, Lynn Volk
and Svetlana Turovsky for their contribution and expertise in data
collection and initial interpretation, and to all clinicians who par-
ticipated in the four studies as subjects.
References
[1] Ammenwerth E, Schnell-Inderst P, Machan C, Siebert U. The effect of electronic
prescribing on medication errors and adverse drug events: a systematic
review. J Am Med Inform Assoc 2008;15:585–600.
[2] Linder JA, Ma J, Bates DW, Middleton B, Stafford RS. Electronic health record
use and the quality of ambulatory care in the United States. Arch Intern Med
2007;167:1400–5.
[3] Chaudhry B, Wang J, Wu S, Maglione M, Mojica W, Roth E, et al. Systematic
review: impact of health information technology on quality, efficiency, and
costs of medical care. Ann Intern Med 2006;144:742–52 [see comment].
[4] Kaushal R, Shojania KG, Bates DW. Effects of computerized physician order
entry and clinical decision support systems on medication safety: a systematic
review. Arch Intern Med 2003;163:1409–16.
[5] Koppel R, Metlay JP, Cohen A, Abaluck B, Localio AR, Kimmel SE, et al. Role of
computerized physician order entry systems in facilitating medication errors.
JAMA 2005;293:1197–203.
[6] Horsky J, Kuperman GJ, Patel VL. Comprehensive analysis of a medication
dosing error related to CPOE. J Am Med Inform Assoc 2005;12:377–82.
[7] Ludwick DA, Doucette J. Adopting electronic medical records in primary care:
lessons learned from health information systems implementation experience
in seven countries. Int J Med Inform 2009;78:22–31.
[8] Kaplan B, Harris-Salamone KD. White paper: Health IT project success and
failure: recommendations from literature and an AMIA workshop. J Am Med
Inform Assoc 2009;16:291–9.
[9] DesRoches CM, Campbell EG, Rao SR, Donelan K, Ferris TG, Jha AK, et al.
Electronic health records in ambulatory care: a national survey of physicians.
N Engl J Med 2008;359:50–60.
[10] Smelcer JB, Miller-Jacobs H, Kantrovich L. Usability of electronic medical
records. J Usability Stud 2009;4:70–84.
[11] Harrison MI, Koppel R, Bar-Lev S. Unintended consequences of information
technologies in health care: an interactive sociotechnical analysis. J Am Med
Inform Assoc 2007;14:542–9.
[12] Pizziferri L, Kittler AF, Volk LA, Honour MM, Gupta S, Wang SJ, et al. Primary
care physician time utilization before and after implementation of an
electronic health record: a time-motion study. J Biomed Inform
2005;38:176–88.
[13] Simon SR, Soran CS, Kaushal R, Jenter CA, Volk LA, Burdick E, et al. Physicians’
use of key functions in electronic health records from 2005 to 2007: a
statewide survey. J Am Med Inform Assoc 2009;16:465–70.
[14] Weir CR, Nebeker JJR, Hicken BL, Campo R, Drews F, LeBar B. A cognitive task
analysis of information management strategies in a computerized provider
order entry environment. J Am Med Inform Assoc 2007;14:65–75.
[15] Vicente KJ. Work domain analysis and task analysis: a difference that matters.
In: Schraagen JM, Chipman SF, editors. Cognitive task analysis. Mahwah,
NJ: Lawrence Erlbaum Associates, Inc.; 2000. p. 101–18.
[16] Zhang J, Butler K. UFuRT: A work-centered framework and process for design
and evaluation of information systems. HCI International Proceedings; 2007.
[17] Schnipper JL, Linder JA, Palchuk MB, Einbinder JS, Li Q, Postilnik A, et al. ‘‘Smart
Forms” in an electronic medical record: documentation-based clinical decision
support to improve disease management. J Am Med Inform Assoc
2008;15:513–23.
[18] Sittig DF, Stead WW. Computer-based physician order entry: the state of the
art. J Am Med Inform Assoc 1994;1:108–23.
[19] HIMSS EHR usability task force. Defining and testing EMR usability: principles
and proposed methods of EMR usability evaluation and rating. HIMSS; 2009.
[20] Ball MJ, Silva JS, Bierstock S, Douglas JV, Norcio AF, Chakraborty J, et al. Failure
to provide clinicians useful IT systems: opportunities to leapfrog current
technologies. Methods Inf Med 2008;47:4–7.
[21] Zheng K, Padman R, Johnson MP, Diamond HS. An interface-driven analysis of
user interactions with an electronic health records system. J Am Med Inform
Assoc 2009;16:228–37.
[22] Schwarz N, Oyserman D. Asking questions about behavior: cognition,
communication, and questionnaire construction. Am J Eval 2001;22:127.
[23] Jaspers MWM. A comparison of usability methods for testing interactive
health technologies: methodological aspects and empirical evidence. Int J Med
Inform 2009;78:340–53.
790 J. Horsky et al. / Journal of Biomedical Informatics 43 (2010) 782–790
[24] Uldall-Espersen T, Frokjaer E, Hornbaek K. Tracing impact in a usability
improvement process. Interact Comput 2008;20:48–63.
[25] Peleg M, Shachak A, Wang D, Karnieli E. Using multi-perspective
methodologies to study users’ interactions with the prototype front end of a
guideline-based decision support system for diabetic foot care. Int J Med
Inform 2009;78:482–93.
[26] Sittig DF, Singh H. Eight rights of safe electronic health record use. JAMA
2009;302:1111–3.
[27] Nielsen J. Iterative user interface design. IEEE Comput 1993;26:32–41.
[28] Gould JD, Lewis C. Designing for usability: key principles and what designers
think. Commun. ACM 1985;28:300–11.
[29] Walker JM, Carayon P, Leveson N, Paulus RA, Tooker J, Chin H, et al. EHR safety:
the way forward to safe and effective systems. J Am Med Inform Assoc
2008;15:272–7.
[30] Leveson NG. Intent specifications: an approach to building human-centered
specifications. IEEE Trans Software Eng 2000;26:15–35.
[31] Wachter SB, Agutter J, Syroid N, Drews F, Weinger MB, Westenskow D. The
employment of an iterative design process to develop a pulmonary graphical
display. J Am Med Inform Assoc 2003;10:363–72.
[32] Morae. 3.1 ed., Okemos, MI: TechSmith Corporation; 2009.
[33] Nielsen J, Mack RL. Usability inspection methods. New York: John Wiley &
Sons; 1994.
[34] Shneiderman B. Designing the user interface. Strategies for effective human–
computer-interaction. 4th ed. Reading, MA: Addison Wesley Longman; 2004.
[35] Tognazzini B. Tog on interface. Reading, Mass.: Addison-Wesley; 1992.
[36] Tufte ER. The visual display of quantitative information. 2nd ed. Cheshire,
Conn.: Graphics Press; 2001.
[37] Atkinson BF, Bennet TO, Bahr GS, Nelson MM. Development of a multiple
heuristics evaluation table (MHET) to support software development and
usability analysis. In: Universal access in human–computer interaction: coping
with diversity. Berlin/Heidelberg: Springer; 2007.
[38] Thyvalikakath TP, Schleyer TK, Monaco V. Heuristic evaluation of clinical
functions in four practice management systems: a pilot study. J Am Dent Assoc
2007;138:209–10.
[39] Scandurra I, Hagglund M, Engstrom M, Koch S. Heuristic evaluation performed
by usability-educated clinicians: education and attitudes. Stud Health Technol
Inform 2007:205–16.
[40] Lai TY. Iterative refinement of a tailored system for self-care management of
depressive symptoms in people living with HIV/AIDS through heuristic
evaluation and end user testing. Int J Med Inform 2007;76:S317–24.
[41] Tang Z, Johnson TR, Tindall RD, Zhang J. Applying heuristic evaluation to improve
the usability of a telemedicine system. Telemed J E Health 2006;12:24–34.
[42] Choi J, Bakken S. Heuristic evaluation of a web-based educational resource for
low literacy NICU parents. Stud Health Technol Inform 2006:194–9.
[43] Zhang J, Johnson TR, Patel VL, Paige DL, Kubose TK. Using usability heuristics to
evaluate patient safety of medical devices. J Biomed Inform 2003;36:23–30.
[44] Graham MJ, Kubose TK, Jordan DA, Zhang J, Johnson TR, Patel VL. Heuristic
evaluation of infusion pumps: implications for patient safety in intensive care
units. Int J Med Inform 2004;73:771–9.
[45] Allen M, Currie LM, Bakken S, Patel VL, Cimino JJ. Heuristic evaluation of paper-
based web pages: a simplified inspection usability methodology. J Biomed
Inform 2006;39:412–23.
[46] Corbin JM, Strauss AL. Basics of qualitative research: techniques and
procedures for developing grounded theory. 3rd ed. Los Angeles, Calif.: Sage
Publications, Inc.; 2008.
[47] Hall JG, Silva A. A conceptual model for the analysis of mishaps in human-
operated safety-critical systems. Saf Sci 2008;46:22–37.
[48] Johnson CM, Turley JP. The significance of cognitive modeling in building
healthcare interfaces. Int J Med Inform 2006;75:163–72.
[49] Ford EW, McAlearney AS, Phillips MT, Menachemi N, Rudolph B. Predicting
computerized physician order entry system adoption in US hospitals: can the
federal mandate be met? Int J Med Inform 2008;77:539–45.
- Complementary methods of system usability evaluation: Surveys and observations during software design and development cycles
Introduction
Background
Methods
Email via an embedded link
Online survey
Think-aloud study and observations
Walkthroughs, expert evaluations and interviews
The development of heuristic usability assessment scheme
Participants
Results
Comments by clinicians
Findings by usability evaluators
Comments and findings comparison
Implementation of design changes to a revised prototype
Discussion
Email
Survey
Usability evaluations and interviews
Conclusion
Acknowledgments
References
- JAS
Page 1
InternationalJournal of Performability Engineering Vol. 6, No. 6, November 2010, pp. 531-546.
© RAMS Consultants
Printed in India
*
Corresponding author’s email: nschneid@nps.navy.mil 53
1
Successful Application of Software Reliability: A Case Study
NORMAN F. SCHNEIDEWIND
Fellow of the IEEE
2822 Raccoon Trail
Pebble Beach, California 93953 USA
(Received on July 30, 2009, revised on May 3, 2010
)
Abstract: The purpose of this case study is to help readers implement or improve a
software reliability program in their organizations, using a step-by-step approach based on
the Institute of Electrical and Electronic Engineers (IEEE) and the American Institute of
Aeronautics and Astronautics Recommended (AIAA) Practice for Software Reliability,
released in June 2008, supported by a case study from the NASA Space Shuttle.
This case study covers the major phases that the software engineering practitioner
needs in planning and executing a software reliability-engineering program. These phases
require a number of steps for their implementation. These steps provide a structured
approach to the software reliability process. Each step will be discussed to provide a good
understanding of the entire software reliability process. Major topics covered are: data
collection, reliability risk assessment, reliability prediction, reliability prediction
interpretation, testing, reliability decisions, and lessons learned from the NASA Space
Shuttle software reliability engineering program.
Keywords: software reliability program, Institute of Electrical and Electronic Engineers
and the American Institute of Aeronautics and Astronautics Recommended Practice for
Software Reliability, NASA Space Shuttle application
1. Introduction
The IEEE\AIAA recommended practice provides a foundation on which
practitioners and researchers can build consistent methods [1]. This case study will
describe the SRE process and show that it is important for an organization to have a
disciplined process if it is to produce high reliability software. To accomplish this purpose,
an overview is presented of existing practice in software reliability, as represented by the
recommended practice [1]. This will provide the reader with the foundation to understand
the basic process of Software Reliability engineering (SRE). The Space Shuttle Primary
Avionics Software Subsystem will be used to illustrate the SRE
process.
The reliability prediction models that will be used are based on some key definitions
and assumptions,
as
follows:
Definitions
Interval: an integer time unit t of constant or variable length defined by t-1 t>0; failures are counted in intervals.
Number of Intervals: the number of contiguous integer time units t of constant or variable
length represented by a positive real number. Norman F. Schneidewind
.
53 2
Operational Increment (OI): a software system comprised of modules and configured from
a series of builds to meet Shuttle mission functional requirements.
Time: continuous CPU execution time over an interval range.
Assumptions
1. Faults that cause failures are removed.
2. As more failures occur and more faults are corrected, remaining failures will be
reduced.
3. The remaining failures are “zero” for those OI’s that were executed for extremely
long times (years) with no additional failure reports; correspondingly, for these
OI’s, maximum failures equals total observed failures.
1.1 Space Shuttle Flight Software Application
The Shuttle software represents a successful integration of many of the computer
industry’s most advanced software engineering practices and approaches. Beginning in the
late 1970’s, this software development and maintenance project has evolved one of the
world’s most mature software processes applying the principles of the highest levels of the
Software Engineering Institute’s (SEI) Capability Maturity Model (the software is rated
Level 5 on the SEI scale) and ISO 9001 Standards [2]. This software process includes
state-of-the-practice software reliability engineering (SRE) methodologies.
The goals of the recommended practice are to: interpret software reliability
predictions, support verification and validation of the software, assess the risk of
deploying the software, predict the reliability of the software, develop test strategies to
bring the software into conformance with reliability specifications, and make reliability
decisions regarding deployment of the software.
Reliability predictions are used by the developer to add confidence to a formal
software certification process comprised of requirements risk analysis, design and code
inspections, testing, and independent verification and validation. This case study uses the
experience obtained from the application of SRE on the Shuttle project, because this
application is judged by NASA and the developer to be a successful application of SRE
[6]. These SRE techniques and concepts should be of value for other software systems
1.2 Reliability Measurements and Predictions
There are a number of measurements and predictions that can be made of reliability
to verify and validate the software. Among these are remaining failures, maximum
failures, total test time required to attain a given fraction of remaining failures, and time to
next failure. These have been shown to be useful measurements and predictions for: 1)
providing confidence that the software has achieved reliability goals; 2) rationalizing how
long to test a software component (e.g., testing sufficiently long to verify that the measured
reliability conforms to design specifications); and 3) analyzing the risk of not achieving
remaining failures and time to next failure goals [6]. Having predictions of the extent to
which the software is not fault free (remaining failures) and whether a failure it is likely to
occur during a mission (time to next failure) provide criteria for assessing the risk of
deploying the software. Furthermore, fraction of remaining failures can be used as both an Successful Application of Software Reliability: Case Study
53 3
operational quality goal in predicting total test time requirements and, conversely, as an
indicator of operational quality as a function of total test time expended [6].
The various software reliability measurements and predictions can be divided into the
following two categories to use in combination to assist in assuring the desired level of
reliability of the software in mission critical systems like the Shuttle. The two categories
are: 1) measurements and predictions that are associated with residual software faults and
failures, and 2) measurements and predictions that are associated with the ability of the
software to complete a mission without experiencing a failure of a specified severity. In
the first category are: remaining failures, maximum failures, fraction of remaining failures,
and total test time required to attain a given number of fraction of remaining failures. In
the second category are: time to next failure and total test time required to attain a given
time to next failure. In addition, there is the risk associated with not attaining the required
remaining failures and time to next failure goals. Lastly, there is operational quality that is
derived from fraction of remaining failures. With this type of information, a software
manager can determine whether more testing is warranted or whether the software is
sufficiently tested to allow its release or unrestricted use. These predictions provide a
quantitative basis for achieving reliability goals [2].
1.3 Interpretations and Credibility
The two most critical factors in establishing credibility in software reliability
predictions are the validation method and the way the predictions are interpreted. For
example, a “conservative” prediction can be interpreted as providing an “additional margin
of confidence” in the software reliability, if that predicted reliability already exceeds an
established “acceptable level” or requirement. It may not be possible to validate
predictions of the reliability of software precisely, but it is possible with “high confidence”
to predict a lower bound on the reliability of that software within a specified environment.
If there historical failure data were available for a series of previous dates (and there
is actual data for the failure history following those dates), it would be possible to compare
the predictions to the actual reliability and evaluate the performance of the model. Taking
this approach will significantly enhance the credibility of predictions among those who
must make software deployment decisions based on the predictions [9].
1.4 Verification and Validation
Software reliability measurement and prediction are useful approaches to verify and
validate software. Measurement refers to collecting and analyzing data about the observed
reliability of software, for example the occurrence of failures during test. Prediction refers
to using a model to forecast future software reliability, for example failure rate during
operation. Measurement also provides the failure data that is used to estimate the
parameters of reliability models (i.e., make the best fit of the model to the observed failure
data). Once the parameters have been estimated, the model is used to predict the future
reliability of the software. Verification ensures that the software product, as it exists in a
given project phase, satisfies the conditions imposed in the preceding phase (e.g.,
reliability measurements of mission critical software components obtained during test
conform to reliability specifications made during design) [5]. Validation ensures that the
software product, as it exists in a given project phase, which could be the end of the
project, satisfies requirements (e.g., software reliability predictions obtained during test
correspond to the reliability specified in the requirements) [5]. 534 Norman F. Schneidewind
Another way to interpret verification and validation is that it builds confidence that
software is ready to be released for operational use. The release decision is crucial for
systems in which software failures could endanger the safety of the mission and crew (i.e.,
mission critical software). To assist in making an informed decision, software risk analysis
and reliability prediction are integrated and provide stopping rules for testing. This
approach is applicable to all mission critical software. Improvements in the reliability of
software, where the reliability measurements and predictions are directly related to mission
and safety, contribute to system safety.
2. Implementing a Software Reliability Engineering Program
In broad terms, implementing a software reliability program is a two-phased
process. It consists of (1) identifying the reliability goals and (2) testing the software to see
whether it conforms to the goals. The reliability goals can be ideal (e.g., zero defects) but
should have some basis in reality based on tradeoffs between reliability and cost. The
testing phase is more complex because it involves collecting raw defect data and using it
for assessment and prediction.
The following are major SRE steps in the recommended practice, keyed to the phases
of the software development life cycle (not necessarily in chronological order):
2.1 State the Reliability Criteria (requirements analysis phase)
This might be stated, for example, as “no failure that would result in loss of life or
mission”.
2.2 Collect Fault and Failure Data (testing and operations phase)
For each system, there should be a brief description of its purpose and functions and
the fault and failure data, as shown below. Days # could be hours, minutes, as appropriate.
Code the Problem Report Identification to indicate Software (S) failure, Hardware (H)
failure, or People (P) failure.
• System Identification
• Purpose
• Functions
• Days # (since start of test)
• Problem Report Identification
• Problem Severity
• Failure Date
• Module with Fault
• Description of Problem
2.3 Establish Problem Severity Levels (requirements analysis phase)
Use a problem severity classification, such as the following:
1. Loss of life, loss of mission, abort mission.
2. Degradation in performance.
3. Operator annoyance. 4. System ok, but documentation in error.
5. Error in classifying a problem (i.e., no problem existed in the first place).
Note: Not all problems result in failures.
Successful Application of Software Reliability: Case Study
53 5
2.4 Develop Reliability Assurance Criteria(requirements analysis phase)
Two criteria for software reliability levels will be defined. Then these criteria will
be applied to the risk analysis of mission critical software. In the case of the Shuttle
example, the “risk” represents the degree to which the occurrence of failures does not meet
required reliability levels, regardless of how insignificant the failures may be. Although it
may be counterintuitive to include minor failures in reliability assessments, in reality,
doing so provides a conservative lower bound on assessment. That is, the actual reliability
is highly unlikely to be lower than the assessment.
Next, a variety of equations that are used in reliability prediction and risk analysis
will be defined and derived, including the relationship between time to next failure and
reduction in remaining failures. Then it is shown how the prediction equations can be used
to integrate testing with reliability and quality. An example is shown of how the risk
analysis and reliability predictions can be used to make decisions about whether the
software is ready to deploy. Note that these equation are based on the model in [9] because
this model is used on the Shuttle and is one of the models recommended in the
recommended practice [1]. Other models could be used, such as those in [9].
If the reliability goal is the reduction of failures of a specified severity to an
acceptable level of risk [7], then for software to be ready to deploy, after having been
tested for time t, it must satisfy the following criteria:
1) Predicted mean number of remaining failures r(t) < rc, (1)
where rc is a specified critical value , and
2) predicted mean time to next failure TF(t) > tm, (2)
where tm is mission duration.
For systems that are tested and operated continuously like the Shuttle, tt, TF (t), and tm
are measured in execution time. Note that, as with any methodology for assuring software
reliability, there is no guarantee that the expected level will be achieved. Rather, with these
criteria, the objective is to reduce the risk of deploying the software to a “desired” level.
2.5 Apply the Remaining Failures Criterion (testing phase)
Criterion (1) sets the threshold on remaining failures that must be satisfied in order to
deploy the software (i.e., no more than a specified number of failures).
If it is predicted that r(t) ≥ rc, then the process is to continue to test for a time t’ > t
that is predicted to achieve r(t’) be experienced and more faults will be corrected so that the remaining failures will be
reduced by the quantity r(t) – r(t’). If the developer does not have the resources to satisfy
the criterion or is unable to satisfy the criterion through additional testing, the risk of
deploying the software prematurely should be assessed. It is known that it is impossible to
demonstrate the absence of faults [3]; however, the risk of failures occurring can be
reduced to an acceptable level, as represented by rc. This scenario is shown in Figure 1. In
case A, r (t) and the mission would be postponed until the software is tested for time t’ when r (t’) predicted. In both cases criterion 2) would also be required for the mission to begin.
536 Norman F. Schneidewind
Figure 1: Remaining Failures Criterion Scenario
2.6 Apply the Time to Next Failure Criterion (testing phase)
Criterion 2 specifies that the software must survive for a time greater than the
duration of the mission. If TF (t) ≤ tm, is predicted, the software is tested for a time t’ that
is predicted to achieve TF (t’) > tm, using assumptions 1and 2 that more failures will be
experienced and faults corrected, so that the mean time to next failure will be increased by
the quantity TF (t’) -TF (t). Again, if it is infeasible for the developer to satisfy the criterion
for lack of resources or failure to achieve test objectives, the risk of deploying the software
prematurely should be assessed. This scenario is shown in Figure 2.
Figure 2: Time to Next Failure Criterion Scenario
Start Test End Test, Begin Mission End Mission
End
Mission r(tt) tt
tt tt Start Test
Continue
Test End Test
Begin
Mission r(tt) TF (tt tt tt tt tm Start Test End Test, Begin Mission End Mission
Start Test Begin
Mission Mission Test TF (tt) TF (tt) Successful Application of Software Reliability: Case Study
53 7
In case A, TF (t) > tm is predicted and the mission begins at t. In case B, TF (t) ≤ tm is
predicted, and in this case the mission would be postponed until the software is tested for
time tt’ when TF (t’) > tm is predicted. In both cases criterion 1) would also be required for
the mission to begin. If neither criterion is satisfied, the software is subjected to additional
inspection and testing, to remove more faults, until the desired level of risk is achieved.
2.7 Make a Risk Assessment (pre deployment or launch phase)
Reliability Risk pertains to executing the software of a mission critical system where
there is the chance of injury (e.g., astronaut injury or fatality), damage (e.g., destruction of
the Shuttle), or loss (e.g., loss of the mission) if a serious software failure occurs during a
mission. In the case of the Shuttle, where the occurrence of even trivial failures is rare, the
fraction of those failures that pose any reliability risk is too small to be statistically
significant. As a result, in order to have an adequate sample size for analysis, all failures
(of any severity) over the entire 20-year life of the project have been included in the failure
history database for this analysis. Therefore, the risk criterion metrics to be discussed for
the Shuttle quantify the degree of risk associated with the occurrence of any software
failure, no matter how insignificant it may be. As mentioned previously, this approach
provides a conservative lower bound to reliability predictions.
As an example, the Schneidewind Software Reliability Model (other software
reliability models could be used as well) is used to compute a parameter: fraction of
remaining failures as a function of the archived failure history during test and operation
[6]. The prediction methodology uses this parameter and other reliability quantities to
provide bounds on total test time, remaining failures, operational quality, and time to next
failure that are necessary to meet defined Shuttle software reliability levels.
The test time t can be considered a measure of the degree to which software
reliability goals have been achieved. This is particularly the case for systems like the
Shuttle where the software is subjected to continuous and rigorous testing for several years
in multiple facilities, using a variety of operational and training scenarios (e.g., by the
contractor in Houston, by NASA in Houston for astronaut training, and by NASA at Cape
Canaveral). In Figure 3, t is interpreted as an input to a risk reduction process, and r (t)
and TF (t) as the outputs, with rc and tm as risk thresholds of reliability that control the
process. Figure 3: Risk Reduction Process
Reliability
Measure
Risk
Reduction
rc tm
r(tt)
TF(tt)
tt Total Test Time
Risk Criteria Levels 538 Norman F. Schneidewind
While it must be recognized that test time is not the only consideration in developing
test strategies and that there are other important factors, such as the consequences for
reliability and cost in selecting test cases [11], nevertheless, for the foregoing reasons, test
time has been found to be strongly positively correlated with reliability growth for the
Shuttle [9].
2.8 Evaluate Remaining Failures Risk (pre deployment or launch phase)
To obtain the mean value of the risk criterion metric (RCM) in equation (4), first,
the mean remaining failures must be predicted in equation (3).
( ) r(t )= exp -β(t -(s-1)) (3)
Then, the mean value of the risk criterion metric (RCM) for criterion 1 is formulated
as follows: RCM r(t)= (r(t) – rc) / rc = (r(t) / rc) – 1 (4)
Equation (3) is plotted in Figure 4 as a function of t for rc = 1, for the Shuttle software
release OID, a software system comprised of modules and configured from a series of
builds to meet Shuttle mission functional requirements, where positive, zero, and negative
values correspond to r (t) > rc, r (t) = rc, and r (t) < rc, respectively.
Figure 4: RCM for Remaining Failures, (rc = 1), OID
In Figure 4, these values correspond to the following regions: above the X-axis
predicted remaining failures are greater than the specified value; on the X-axis predicted
remaining failures are equal to the specified value; and below the X-axis predicted
remaining failures are less than the specified value, which could represent a “safe”
threshold or in the Shuttle example, an “error-free” condition boundary. In the example it
can be seen that at t = 80 the risk transitions from the high risk region to the low risk
region.
18
-0.7
33.5 49 64.5 8 0
1.3
3.3
5.3
7.3
DESIRED
CRITICAL
r(tt)>rc r(tt) = rc Total Test Time (30 Day Intervals)
r(tt) < rc
Successful Application of Software Reliability: Case Study
539
2.9 Evaluate Time to Next Failure Risk (pre deployment or launch phase)
The mean value of the risk criterion metric (RCM) for criterion 2 is formulated as
follows: RCM TF (t) = (tm – TF (t)) / tm=1 – (TF (t)) / tm (4)
Equation (4) is plotted in Figure 5 as a function of test time t for tm = 8 thirty day
intervals, for OID, where there is high risk for TF(tt) < tm. Once TF(tt) > tm, the risk is low.
Figure 5: RCM for Time to Next Failure (tm = 8 days) OIC
3. Make Reliability Predictions (test and operations phases)
In order to support the reliability goal and to assess the risk of deploying the
software, various reliability and quality predictions are made during the test phase to
validate that the software meets requirements. For example, suppose the software
reliability requirements state the following: 1) ideally, after testing the software for time t,
the mean predicted remaining failures shall be less than one; 2) if the ideal of 1) cannot be
achieved due to cost and schedule constraints, mean time to next failure, predicted after
testing for time t, shall exceed the mission duration; and 3) the risk of not meeting 1) and
2) shall be assessed.
3.1 Additional Risk Evaluation (test and operations phases)
In addition to remaining failures and time to failure risk, which have already been
discussed, various other predictions are made in order to provide a comprehensive
assessment of risk. These predictions are based on the Schneidewind Software Reliability
Model [1, 8, 9, 10]. Again, other models recommended in the Recommended Practice for
Software Reliability [1] could be used. The Statistical Modeling and Estimation of
Reliability Functions for Software (SMERFS) [4] tool is used to support predictions.
In the following equations, parameter α is the failure rate at the beginning of rate (i.e., relative failure rate); t is test time or the last interval of observed failure data; s is
the starting interval for using observed failure data in parameter estimation that provides
20
-73
24 28 32 4 4
-53
-33
-13
7 DESIRED TF(tt)>Tm
Tm = 8 days
CRITICAL TF(tt) < Tm
Total Test Time (30 Day Intervals) 36 40
TF(tt) =Tm 540 Norman F. Schneidewind
the best estimates of α and β and the most accurate predictions [8]; Xs-1 is the observed Cumulative Failures: When estimates are obtained for the parameters α and β, with s as
the starting interval for using observed failure data, the predicted failure count in the range
[1,t] is obtained (i.e., cumulative failures) [6]:
F (t)=(α/β)[1-exp (-β ((t-s+1)))]+Xs-1 (6)
Figure 6 provides risk reduction in the sense that the predicted cumulative failures
provide an upper bound on the actual failures (i.e., there is assurance that the actual
failures will ne exceed the predicted values). In addition, risk is mitigated by the fact that
the predictions increase at an increasing rate. Also shown in this figure is the mean relative
error (MRE) between actual and predicted values. The MRE is high due to the fact that
predictions are consistently higher that actual values.
Figure 6: Total Test Time and Remaining Failures vs. Fraction Remaining Failures, OIA
Maximum Failures: Let t→∞ in equation (6) and obtain the predicted failure count
in the range [1,∞] (i.e., maximum failures over the life of the software):
F (∞) = α/β+Xs-1 (7)
Applying equation (7), the predicted maximum failures = 18.4706. Thus, we would
have low risk that the actual cumulative failures will not exceed the value.
Fraction of Remaining Failures: If equation (3) is divided by equation (7), fraction of
remaining failures, predicted at time t is obtained:
p(t)= r(t) /F(∞) (8)
According to the manager of Shuttle software development, equation (8) is an
excellent management tool for providing confidence that the software is ready to deploy,
T l e T e t 3
0
D
a n rv ) 0.1 0.2 0.3 0.4
40 80
120
160
tt Total Test Time (30 Day Intervals) 0.5
0 1
2
3 + +
+
+ r(tt) N m o R a in F u s t t Successful Application of Software Reliability: Case Study
541
as the fraction remaining failures becomes miniscule, with increasing testing, as Figure 7
attests [5].
Figure 7: Operational Quality (Fraction Fault Removal) vs. Total Test Time, OIA
Operational Quality: The operational quality of software is the complement of p(t). It is
the degree to which software is free of remaining faults (failures), using the assumption 1
that the faults that cause failures are removed. It is predicted at time t as follows:
Q (t) = 1-p (t) (9)
This risk metric is useful because some software engineers and managers would
prefer to see things in a positive light — quality growth. Figure 7 demonstrates that after t =
100 the improvement in quality becomes miniscule, and the cost to remove additional
faults would be significant. Thus this figure metrics for risk assessment and a sopping rule
for when to terminate testing.
Total Test Time to Achieve Specified Remaining Failures. The predicted test time
required to achieve a specified number of remaining failures at t, r (t), is obtained from
equation (3) by solving for t:
t = (β(s-1)-log( ) (10)
Equation (10) is another risk reduction metric based on the concept that the
predicted test time to achieve a specified number of remaining failures reveals how much
test time and effort would be required to achieve various levels of risk, as represented by
specified remaining failures, as shown in Figure 8, where, naturally, the test time and cost
becomes significantly high in order to achieve significant reductions in risk.
3.2 Interpret Software Reliability Predictions (pre deployment or launch phase)
Total Test Time (30 Day Intervals) 0.67
40 80 120
0.75
0.84
0.92
1.0
160 0.59 542 Norman F. Schneidewind
Successful use of statistical modeling in predicting the reliability of a software
system requires a thorough understanding of precisely how the resulting predictions are to
be interpreted and applied [9]. The Shuttle software (430 KLOC) is frequently modified,
at the request of NASA, to add or change capabilities using a constantly improving
process.
Figure 8: Launch Decision: Remaining Failures vs. Total Test Time, OIA
Each of these successive versions constitutes an upgrade to the preceding software
version. Each new version of the software (designated as an Operational Increment, OI)
contains software code that has been carried forward from each of the previous versions
(“previous-version subset”) as well as new code generated for that new version (“new-
version subset”). We have found that by applying a reliability model independently to the
code subsets we can obtain satisfactory composite predictions for the total version [9].
It is essential to recognize that this approach requires a very accurate code change
history so that every failure can be uniquely attributed to the version in which the defective
line(s) of code were first introduced. In this way, it is possible to build a separate failure
history for the new code in each release. To apply SRE to a software system, it should be
broken your down into smaller elements to which a reliability model can be more
accurately applied. This approach has been successfully applied to predict the reliability of
the Shuttle software for NASA [9].
3.3 Use Software Reliability Tools (test and operations phases)
It is infeasible to do large-scale reliability prediction by hand. Therefore, there are
software reliability tools available to make the model predictions easier to achieve. The
Statistical Modeling and Estimation of Reliability Functions for Software (SMERFS) is a
software package available for this purpose [4]. However, it is important for the user to
understand the capabilities, applicability, and limitations of such tools.
0 r = Remaining Failures
tt = Total Test Time Until
Launch
EXAMPLE:
(r = 0.6, tt = 52)
Successful Application of Software Reliability: Case Study
543
4. Lessons Learned
Several important lessons have been learned from the experience of twenty years in
developing and maintaining the Shuttle software, which you could consider for adoption in
your SRE process:
1) No one SRE process method is the “silver bullet” for achieving high reliability.
Various methods, including formal inspections, failure modes analysis, verification
and validation, testing, statistical process control, risk analysis, and reliability
modeling and prediction must be integrated and applied.
2) The process must be continually improved and upgraded. For example, recent
experiments with software metrics have demonstrated the potential of using metrics as
early indicators of future reliability problems. This approach, combined with
inspections, allows many reliability problems to be identified and resolved before
testing.
3) The process must have feedback loops so that information about reliability
problems discovered during inspection and testing is fed back not only to
requirements analysis and design for the purpose of improving the reliability of future
products but also to the requirements analysis, design, inspection and testing
processes themselves. In other words, the feedback is designed to improve not only
the product but also the processes that produce the product.
4) Given the current state-of-the-practice in software reliability modeling and
prediction, practitioners should not view reliability models as having the ability to
make highly accurate predictions of future software reliability. Rather, software
managers should interpret these predictions in two significant ways: a) providing
increased confidence, when used as part of an integrated SRE process, that the
software is safe to deploy; and b) providing bounds on the reliability of the deployed
software (e.g., high confidence that in operation the time to next failure will exceed
the predicted value and the predicted value will safely exceed the mission duration).
5. Conclusions
We showed how software reliability predictions can increase confidence in the
reliability of mission critical software such as the NASA Space Shuttle Primary Avionics
Software System. These results are applicable to other mission critical software.
Remaining failures, maximum failures, total test time required to attain a given fraction of
remaining failures, and time to next failure were shown to be useful reliability
measurements and predictions for: 1) providing confidence that the software has achieved
reliability goals; 2) rationalizing how long to test a piece of software; and 3) analyzing the
risk of not achieving remaining failure and time to next failure goals. Having predictions
of the extent that the software is not fault free (remaining failures) and whether it is
likely to survive a mission (time to next failure) provide criteria for assessing the risk of
deploying the software. Furthermore, fraction of remaining failures can be used as both an
operational quality goal in predicting total test time requirements and, conversely, as an indicator of operational quality as a function of total test time expended.
Software reliability engineering is a tool that software managers can use to provide
confidence that the software meets reliability goals.
544 Norman F. Schneidewind
References
[1]. IEEE/AIAA P1633™, Recommended Practice on Software Reliability, June 2008.
[2]. Billings C., J. Clifton, B. Kolkhorst, E. Lee, and W.B. Wingert. Journey to a Mature
Software Process. IBM Systems Journal 1994; 33 (1): 46-61.
[3]. Dijkstra E. Structured Programming, Software Engineering Techniques. eds. J. N.
Buxton and B. Randell, NATO Scientific Affairs Division, Brussels 39, Belgium April
1970 : 84-88.
[4]. Farr W. and O. Smith. Statistical Modeling and Estimation of Reliability Functions for
Software (SMERFS) Users Guide. NAVSWC TR-84-373, Revision 3, Naval Surface
Weapons Center, Revised September 1993.
[5]. IEEE Standard Glossary of Software Engineering Terminology, IEEE Std 610.12. 1990.
The Institute of Electrical and Electronics Engineers, New York, New York, March 30,
1990. [6]. Keller T., N. Schneidewind, and P. Thornton. Predictions for Increasing Confidence in
the Reliability of the Space Shuttle Flight Software. Proceedings of the AIAA
Computing in Aerospace 10, San Antonio, TX, March 28, 1995: 1-8.
[7]. Schneidewind N. Reliability Modeling for Safety Critical Software, IEEE Transactions
on Reliability March 1997; 46(1):88-98.
[8]. Schneidewind N. Software Reliability Model with Optimal Selection of Failure Data.
IEEE Transactions on Software Engineering November 1993;19(11):1095-1104.
[9]. Schneidewind N. and T. Keller. Application of Reliability Models to the Space Shuttle.
IEEE Software July 1992; 9(4)28-33.
[10]. Schneidewind N. Analysis of Error Processes in Computer Software. Proceedings of the
International Conference on Reliable Software, IEEE Computer Society, 21-23 April
1975:337-346.
[11]. Weyuker E. Using the Consequences of Failures for Testing and Reliability Assessment,
Proceedings of the Third ACM SIGSOFT Symposium on the Foundations of Software
Engineering, Washington, D.C., October 10-13, 1995:81-91.
Bibliography
1. Boehm B. Software Risk Management: Principles and Practices. IEEE Software
January 1991; 8(1): 32-41.
2. Dalal S. and A. McIntosh. When to Stop Testing for Large Software Systems with
Changing Code. IEEE Transactions on Software Engineering April 1994; 20(4):
318-323.
3. Dalal S. and A. McIntosh. Some Graphical Aids for Deciding When to Stop
Testing. IEEE Journal on Selected Areas in Communications February 1990;
8(2):169-175.
4. Ehrlich W., B. Prasanna, John Stampfel, and Jar Wu. Determining the Cost of a
Stop-Test Decision. IEEE Software, March 1993:10(2) 33-42.
5. Keller T. and N. Schneidewind. A Successful Application of Software Reliability
Engineering for the NASA Space Shuttle. Software Reliability Engineering Case
Studies. International Symposium on Software Reliability Engineering, ,
Albuquerque, New Mexico, November 4, 1997: 71-82.
6. Leveson N. Software Safety: What, Why, and How. ACM Computing Surveys
June 1986; 18(2):125-163. Successful Application of Software Reliability: Case Study
545
7. Lyu M. (Editor-in-Chief), Handbook of Software Reliability Engineering.
Computer Society Press, Los Alamitos, CA and McGraw-Hill, New York, NY,
1995.
8. Musa J. and A. Ackerman. Quantifying Software Validation: When to Stop
Testing? IEEE Software May 1989; 6(3):19-27.
9. Musa John D., Anthony Iannino, and Kazuhira Okumoto. Software Reliability:
Measurement, Prediction, and Applications. McGraw-Hill, New York 1987.
10. Nikora A., N. Schneidewind, and J. Munson. Practical Issues In Estimating Fault
Content And Location In Software Systems. Proceedings of the AIAA Space
Technology Conference and Exposition, Albuquerque, NM, Sep 29-30, 1999.
11. Nikora A., N. Schneidewind, and J. Munson. IV&V Issues in Achieving High
Reliability and Safety in Critical Control Software. Final Report, Volume 1 –
Measuring and Evaluating the Software Maintenance Process and Metrics-Based
Software Quality Control, Volume 2 – Measuring Defect Insertion Rates and
Risk of Exposure to Residual Defects in Evolving Software Systems, and Volume
3 – Appendices, Jet Propulsion Laboratory, National Aeronautics and Space
Administration, Pasadena, California, January 19, 1998.
12. A. Nikora, N. Schneidewind, and J. Munson. IV&V Issues in Achieving High
Reliability and Safety in Critical Control System Software. Proceedings of the
Third International Society of Science and Applied Technologies Conference on
Quality in Design, Anaheim, California, March 12-14, 1997: 25-30.
13. Schneidewind N. Measuring and Evaluating Maintenance Process Using
Reliability, Risk, and Test Metrics. IEEE Transactions on Software Engineering
November/December 1999; 25(6): 768-781.
14. Schneidewind N. Software Validation for Reliability. Wiley Encyclopedia of
Electrical and Electronics Engineering, John G. Webster, editor, John Wiley &
Sons, Inc., 1999;19: 607-618.
15. Schneidewind N. Reliability Modeling for Safety Critical Software. IEEE
Transactions on Reliability March 1997; 46(1):88-98.
16. Singpurwalla N. Determining an Optimal Time Interval for Testing and
Debugging Software. IEEE Transactions on Software Engineering April 1991;
17(4): 313-319.
17. Voas J. and K. Miller. Software Testability: The New Verification. IEEE
Software May 1995; 12(3):17-28.
Norman F. Schneidewind, Ph.D., is Professor Emeritus of Information Sciences in the
Department of Information Sciences and the Software Engineering Group at the Naval
Postgraduate School. He is now doing research and publishing in software reliability and
metrics with his consulting company Computer Research. Dr. Schneidewind is a Fellow of
the IEEE, elected in 1992 “for contributions to software measurement models in reliability
and metrics, and for leadership in advancing the field of software maintenance”. In
2001, he received the IEEE Reliability Engineer of the Year award from the IEEE
Reliability Society. In 1993 and 1999, he received awards for Outstanding Research
Achievement by the Naval Postgraduate School.
Dr. Schneidewind was selected for an IEEE USA Congressional Fellowship for
2005 and worked with the Committee on Homeland Security and Government Affairs,
United States Senate, focusing on homeland security, cyber security, and privacy. In
March, 2006, he received the IEEE Computer Society Outstanding Contribution Award 546 Norman F. Schneidewind
for “outstanding technical and leadership contributions as the Chair of the Working Group
revising IEEE Standard 982.1”.
He is the developer of the Schneidewind software reliability model that was used by
NASA to assist in the prediction of software reliability of the Space Shuttle, by the Naval
Surface Warfare Center for Tomahawk cruise missile launch and Trident software
reliability prediction, and by the Marine Corps Tactical Systems Support Activity for
distributed system software reliability assessment and prediction. This model is
recommended by the IEEE and the American Institute of Aeronautics and Astronautics
Recommended Practice for Software Reliability. In addition, the model is implemented in
the Statistical Modeling and Estimation of Reliability Functions for Software (SMERFS),
software reliability-modeling tool. 54 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice
I A
G
E Y I E S A V L A T H E H E T E R O G E N E I T Y, C O M P L E X I T Y, and scale of cloud Unfortunately, the search space of distinct fault and Jepsen experts, must study the sys- This article presents a call to arms The Future Is Disorder Abstracting D O I : 1 0 . 1 1 4 5 / 3 1 5 2 4 8 3
Article development led by Ordinary users need tools that automate the BY PETER ALVARO AND SEVERINE TYMON http://dx.doi.org/10.1145/3152483 J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 55 56 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice up the stack and frustrate any attempts The Old Guard. The modern myth: Unfortunately, this, too, is a pipe Finally, even if you assume that spec- The Vanguard. The emerging ethos: approaches that combine testing with Here, we describe the underlying The Old Gods. The ancient myth: This has been a reasonable dream. Unfortunately, these approaches In a distributed system—that is, a While the state J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 57
practice are simply too large, too heteroge- Two giants have recently emerged Both approaches are pragmatic and Unfortunately, both techniques understanding of the idiosyncrasies Jepsen is in principle a framework A human in the loop is the kiss of We Don’t Need Another Hero We present our vision of automated We argue the best way to automate the The order is rapidly fadin.’ For large- What hope is there of understand- Regarding testing distributed systems. —Commentator on HackerRumor 58 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice of properties that are either maintained By viewing distributed systems in Step 3: Formulate experiments that Carrying out the experiments turns Step 4. Profit! This process can be ef- Away from the experts. While this The first step to understanding how Step 1: Observe the system in action. A Chaos Engineer will, after study- services.25 To understand the high- A Jepsen superuser typically begins The first step to understanding what Step 2. Build a mental model of how Fault tolerance is redundancy. Giv- Figure 1. Our vision of automated failure explanations of fault
injection
Figure 2. Fault injection and fault-tolerant code.
APP1 APP1 APP2 APP2 fault callee
API API API API API J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 59
practice ability infrastructure and fault injec- A Blast from the Past At its heart, LDFI reapplies well- The idea seems far-fetched, but the (manually) identified by Kingsbury.30 Rumors from the Future Don’t overthink fault injection. In the Consider Figure 2: The diagram on The common effect of all faults, from Explanations everywhere. If we can The rapid evolution 60 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice to embrace (rather than abstracting Distributed systems are probabi- Turning the explanations inside Ideally, explanations should play a This line of research can be pushed of redundancy. Unfortunately, a bar- Moreover, the container revolution We are also interested in the pos- Toward better models. The LDFI A shortcoming of the LDFI approach The container J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 61
practice 36. Matloff, N., Salzman, P.J. The Art of Debugging with 37. Meliou, A., Suciu, D. Tiresias: The database oracle for 38. Microsoft Azure Documentation. Introduction to the 39. Musuvathi, M. et al. CMC: A pragmatic approach to 40. Musuvathi, M. et al. Finding and reproducing 41. Newcombe, C. et al. Use of formal methods at 42. Olston, C., Reed, B. Inspector Gadget: A framework 43. OpenTracing. 2016; http://opentracing.io/. CamFlow: Managed data-sharing for cloud services, 45. Patterson, D.A., Gibson, G., Katz, R.H. A case for 46. Ramasubramanian, K. et al. Growing a protocol. In 47. Reinhold, E. Rewriting Uber engineering: The 48. Saltzer, J. H., Reed, D.P., Clark, D.D. End-to-end 49. Sandberg, R. The Sun network file system: design, 50. Shkuro, Y. Jaeger: Uber’s distributed tracing system. 51. Sigelman, B.H. et al. Dapper, a large-scale distributed 52. Shenoy, A. A deep dive into Simoorg: Our open source 53. Yang, J. et al.L., Zhou, L. MODIST: Transparent 54. Yu, Y., Manolios, P., Lamport, L. Model checking TLA+ 55. Zhao, X. et al. Lprof: A non-intrusive request flow Peter Alvaro is an assistant professor of computer Severine Tymon is a technical writer who has written Copyright held by owners/authors. comes,10 then the root cause of the dis- Conclusion To address this critical shortcom- Related articles Fault Injection in Production The Verification of a Distributed System Injecting Errors for Fun and Profit References at Internet scale. In Proceedings of the 7th ACM 2. Alvaro, P., Rosen, J., Hellerstein, J.M. Lineage-driven 3. Andrus, K. Personal communication, 2016. Twitter Engineering; https://blog.twitter.com/2012/ 5. Barth, D. Inject failure to make your systems more 6. Basiri, A. et al. Chaos Engineering. IEEE Software 33, 3 7. Beyer, B., Jones, C., Petoff, J., Murphy, N.R. Site 8. Birrell, A.D., Nelson, B.J. Implementing remote 9. Chandra, T.D., Hadzilacos, V., Toueg, S. The weakest 10. Chen, A. et al. The good, the bad, and the differences: 11. Chothia, Z., Liagouris, J., McSherry, F., Roscoe, T. 12. Chow, M. et al. The Mystery Machine: End-to-end 13. Cui, Y., Widom, J., Wiener, J.L. Tracing the lineage of 14. Dawson, S., Jahanian, F., Mitton, T. ORCHESTRA: A 15. Fischer, M.J., Lynch, N.A., Paterson, M.S. Impossibility 16. Fisman, D., Kupferman, O., Lustig, Y. On verifying 17. Gopalani, N., Andrus, K., Schmaus, B. FIT: Failure 18. Gray, J. Why do computers stop and what can 19. Gunawi, H.S. et al. FATE and DESTINI: A framework 20. Holzmann, G. The SPIN Model Checker: Primer and 21. Honeycomb. 2016; https://honeycomb.io/. Spark. In Proceedings of the VLDB Endowment 9, 33 23. Izrailevsky, Y., Tseitlin, A. The Netflix Simian Army. 24. Jepsen. Distributed systems safety research, 2016; 25. Jones, N. Personal communication, 2016. org/08/documentation.html. A flexible software-based fault and error injection 28. Kendall, S.C., Waldo, J., Wollrath, A., Wyant, G. A note 29. Killian, C.E., Anderson, J.W., Jhala, R., Vahdat, A. Life, 30. Kingsbury, K. Call me maybe: Kafka, 2013; http:// 31. Kingsbury, K. Personal communication, 2016. Gremlin Inc., 2017; https://blog.gremlininc.com/the- 33. Lampson, B.W. Atomic transactions. In Distributed 34. LightStep. 2016; http://lightstep.com/. general library-level fault injector. In IEEE/IFIP Copyright of Communications of the ACM is the property of Association for Computing contributed articles
142 c o m m u n i c at i o n s o f t h e a c m | J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1
d o i : 1 0 . 1 1 4 5 / 1 6 2 9 1 7 5 . 1 6 2 9 2 0 9
by Paul d. Witman and terry ryan
M a n y o r g a n i z at i o n s a r e s u c c e s s f u l w i t h s o f t wa r e
reuse at fine to medium granularities – ranging from BigFinancial, and the BigFinancial Technology Supporting reuse at a large-grained BTC is a technology development In cooperation with BTC, we selected background – software think big J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1 | c o m m u n i c at i o n s o f t h e a c m 143
contributed articles portal services, and alerts capabilities, Initial findings indicated that sever- While significant effort is required In the late 1990s, BTC was respon- strated success in large-grained reuse Product Line Technology models, BigFinancial has had several in- The authors also identified another online banking and well as automated teller machines. The BigFinancial’s initial forays into The purpose for this approach to In 2002, BigFinancial and BTC rec- BTC and business executives cited table 1. selected reuse results
Project reused in business units
System Infrastructure Consumer Internet banking; all users of BTC’s legacy Internet banking System Infrastructure Internet banking – Small Business approximately 4 business units worldwide
Internet banking Europe > 15 business units
Internet banking asia > 10 business units
Internet banking latin america > 6 business units
Internet banking north america > 4 business units contributed articles 144 c o m m u n i c at i o n s o f t h e a c m | J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1
ularities led to a culture that promoted Starting in late 2002, BTC developed The JBT infrastructure and appli- any other organization with the track The requirements for JBT called Each of these components was de- Such variability was planned for in JBT’s initial high-level requirements
documents included requirements One of BigFinancial’s regional tech- From an economic viewpoint, BigFi- All core banking functionality is BTC implemented JBT on principles figure 1. Java banking toolkit architecture overview
table 2. Jbt reuse results
region business units
Europe > 18 business units
Asia > 14 business units
Latin America > 9 business units
North America > 5 business units J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1 | c o m m u n i c at i o n s o f t h e a c m 145
contributed articles ments to global product capabilities, BigFinancial has demonstrated that Numerous factors were critical to BTC took specific steps, over a period On the technical factors related elements. In addition, transactional JBT includes both the infrastruc- Funding and governance of the results Toolkit (JBT) – Internet banking, portal BigFinancial measures its reuse BTC did not explicitly capture hard figure 2. reuse expectations and outcomes contributed articles 146 c o m m u n i c at i o n s o f t h e a c m | J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1
in actual reuse environments. Some Product vendors, and particularly The research provided an opportu- This study has provided a view of Sabherwal11 notes the criticality of and may only work together on the one Griss notes that culture is one ele- Several other researchers have com- One key participant in the study had figure 3. reuse cost savings ranges J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1 | c o m m u n i c at i o n s o f t h e a c m 147
contributed articles Paul D. Witman, (pwitman@callutheran.edu ) is an Terry Ryan (Terry.Ryan@cgu.edu) is an Associate © 2010 ACm 0001-0782/10/0100 $10.00
that slanted toward fine-grained reuse While BTC’s JBT product does, to Organizational barriers appeared, We noted previously the negative globally reusable products, may have conclusion Key factors contributing to a suc- References programs fail? IEEE Software 11, 5, 114-115. Lines: Practices and Patterns Addison-Wesley 3. Gallivan, m.J. Organizational adoption and assimilation 4. Griss, m.L. Software reuse: From library to factory. 5. Karlsson, E.-A. Software Reuse: A Holistic Approach. 6. Krueger, C.W. New methods in software product line 7. malan, R. and Wentzel, K. Economics of Software 8. morisio, m., Ezran, m. and Tully, C. Success and failure 9. Ramachandran, m. and Fleischer, W., Design for large 10. Ring, P.S. and Van de Ven, A.H. Developmental 11. Sabherwal, R. The Role of Trust in Outsourced IS 12. Szyperski, C., Gruntz, D. and murer, S. Component 13. Witman, P. and Ryan, T., Innovation in large-grained Copyright of Communications of the ACM is the property of Association for Computing Machinery and its
content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder’s
express written permission. However, users may print, download, or email articles for individual use. 54 c o m m u n i c at i o n s o f t h e a c m | n o v e m b e r 2 0 0 9 | v o l . 5 2 | n o . 1 1
practice
e V e r Y o N e K N o W s M a i N T e N a N C e is difficult and boring, and therefore avoids doing it. It doesn’t help “no one needs to do maintenance—that’s a waste of “Get the software out now; we can decide what its “Do the hardware first, without thinking about the “Don’t allow any room or facility for expansion. You These statements are a fair description of development
during the last boom, and not too far During a previous boom, General Today we have a distributed net- What is software maintenance? In software, adding a six-lane au- Is it possible to design software so it the four horsemen of You Don’t D o i : 1 0 . 1 1 4 5 / 1 5 9 2 7 6 1 . 1 5 9 2 7 7 7
Article development led by Long considered an afterthought, software BY PauL stachouR anD DaViD coLLieR-BRoWn
P o o r P b r L h r n W L 56 c o m m u n i c at i o n s o f t h e a c m | n o v e m b e r 2 0 0 9 | v o l . 5 2 | n o . 1 1
practice tems, the specification and designs Discrete. The discrete change ap- In theory, the process accepts (re- maintenance: traditional, never, dis- Traditional (or “everyone’s first Trying to maintain this kind of soft- Never. The second approach is to All you have to do is design per- Even for very simple embedded sys-
item are not recorded, and patching is Furthermore, while official inter- Experience shows that it is com- Vendors try to force discrete chang- Customers use a variant of the Continuous change. At first, this ap- Real-world structure for managing interface changes.
struct item_loc_t { practice n o v e m b e r 2 0 0 9 | v o l . 5 2 | n o . 1 1 | c o m m u n i c at i o n s o f t h e a c m 57
However, that is not the real mean- Software in particular must be writ- The most important thing is to Who Does it Right? With Multics, the developers did all the parameters were versioned by This meant that many different An example of a structure used by char country _ code[4] To identify this, the company incre- In a more modern language, you Update the server.1. the three border-state warehouses. Deploy updated clients to those 3. Update all of the U.S.-based cli-4. Using this approach, there is never software 58 c o m m u n i c at i o n s o f t h e a c m | n o v e m b e r 2 0 0 9 | v o l . 5 2 | n o . 1 1
practice scheduled around a business’s conve- Once the client updates have oc- modern examples Another example is in networking Object languages can also support With AJAX, a reasonably small cli- would need only a simple version- An elegant modern form of contin- Another elegant mechanism is a maintenance isn’t hard, it’s easy Even when all the programs were on Of course, the team did have to their management had to manage The first time the team avoided a Maintenance really is easy. More Related articles The Meaning of Maintenance The Long Road to 64 Bits A Conversation with David Brown Paul Stachour is a software engineer equally at home David Collier-Brown is an author and systems © 2009 aCM 0001-0782/09/1100 $10.00
’
’’
)
’’
tm
End Test
End
Continue
α
β
interval s; parameter β is the negative of the derivative of failure rate divided by failure
failure count in the range [1,s-1]; Xs, t is the observed failure count in the range [s,t];
and Xt=Xs-1+Xs,t. Failures are counted against operational increments (OIs).
o
ta
T
st
im
t
(
y
I
te
a
ls
0
0
4
5
+
+
u
b
e
r
f
e
m
in
g
a
il
re
r
(
)
1 r(t)β
β α
0
1
40 80 120
2
3
4
5
Total Test Time (30 Day Intervals)
160
0
M
B
V
T
Z
L
V
A
K
applications make verification of their fault tolerance
properties challenging. Companies are moving away
from formal methods and toward large-scale testing
in which components are deliberately compromised
to identify weaknesses in the software. For example,
techniques such as Jepsen apply fault-injection testing
to distributed data stores, and Chaos Engineering
performs fault injection experiments on production
systems, often on live traffic. Both approaches have
captured the attention of industry and academia alike.
combinations that an infrastructure can test is
intractable. Existing failure-testing solutions require
skilled and intelligent users who can supply the faults
to inject. These superusers, known as Chaos Engineers
tems under test, observe system execu-
tions, and then formulate hypotheses
about which faults are most likely to
expose real system-design flaws. This
approach is fundamentally unscal-
able and unprincipled. It relies on the
superuser’s ability to interpret how
a distributed system employs redun-
dancy to mask or ameliorate faults
and, moreover, the ability to recognize
the insufficiencies in those redundan-
cies—in other words, human genius.
for the distributed systems research
community to improve the state of
the art in fault tolerance testing.
Ordinary users need tools that au-
tomate the selection of custom-tai-
lored faults to inject. We conjecture
that the process by which superusers
select experiments—observing execu-
tions, constructing models of system
redundancy, and identifying weak-
nesses in the models—can be effec-
tively modeled in software. The ar-
ticle describes a prototype validating
this conjecture, presents early results
from the lab and the field, and identi-
fies new research directions that can
make this vision a reality.
Providing an “always-on” experience
for users and customers means that
distributed software must be fault tol-
erant—that is to say, it must be writ-
ten to anticipate, detect, and either
mask or gracefully handle the effects
of fault events such as hardware fail-
ures and network partitions. Writing
fault-tolerant software—whether for
distributed data management systems
involving the interaction of a handful
of physical machines, or for Web ap-
plications involving the cooperation of
tens of thousands—remains extremely
difficult. While the state of the art in
verification and program analysis con-
tinues to evolve in the academic world,
the industry is moving very much in
the opposite direction: away from for-
mal methods (however, with some
noteworthy exceptions,41) and toward
the Geniuses
Away from
Failure Testing
queue.acm.org
selection of custom-tailored faults to inject.
at abstraction.
Formally verified distributed compo-
nents. If we cannot rely on geniuses to
hide the specter of partial failure, the
next best hope is to face it head on,
armed with tools. Until quite recently,
many of us (academics in particular)
looked to formal methods such as
model checking16,20,29,39,40,53,54 to assist
“mere mortal” programmers in writ-
ing distributed code that upholds its
guarantees despite pervasive uncer-
tainty in distributed executions. It is
not reasonable to exhaustively search
the state space of large-scale systems
(one cannot, for example, model
check Netflix), but the hope is that
modularity and composition (the next
best tools for conquering complexity)
can be brought to bear. If individual
distributed components could be
formally verified and combined into
systems in a way that preserved their
guarantees, then global fault toler-
ance could be obtained via composi-
tion of local fault tolerance.
dream. Most model checkers require
a formal specification; most real-world
systems have none (or have not had one
since the design phase, many versions
ago). Software model checkers and oth-
er program-analysis tools require the
source code of the system under study.
The accessibility of source code is also
an increasingly tenuous assumption.
Many of the data stores targeted by
tools such as Jepsen are closed source;
large-scale architectures, while typical-
ly built from open source components,
are increasingly polyglot (written in a
wide variety of languages).
ifications or source code are available,
techniques such as model checking are
not a viable strategy for ensuring that
applications are fault tolerant because,
as mentioned, in the context of time-
outs, fault tolerance itself is an end-to-
end property that does not necessarily
hold under composition. Even if you
are lucky enough to build a system out
of individually verified components, it
does not follow the system is fault toler-
ant—you may have made a critical error
in the glue that binds them.
YOLO. Modern distributed systems
fault injection.
causes of this trend, why it has been
successful so far, and why it is doomed
to fail in its current practice.
Leave it to the experts. Once upon a
time, distributed systems researchers
and practitioners were confident that
the responsibility for addressing the
problem of fault tolerance could be
relegated to a small priesthood of ex-
perts. Protocols for failure detection,
recovery, reliable communication,
consensus, and replication could be
implemented once and hidden away
in libraries, ready for use by the layfolk.
After all, abstraction is the best tool
for overcoming complexity in com-
puter science, and composing reliable
systems from unreliable components
is fundamental to classical system
design.33 Reliability techniques such
as process pairs18 and RAID45 dem-
onstrate that partial failure can, in
certain cases, be handled at the low-
est levels of a system and successfully
masked from applications.
rely on failure detection. Perfect failure
detectors are impossible to implement
in a distributed system,9,15 in which it
is impossible to distinguish between
delay and failure. Attempts to mask
the fundamental uncertainty arising
from partial failure in a distributed
system—for example, RPC (remote
procedure calls8) and NFS (network file
system49)—have met (famously) with
difficulties. Despite the broad consen-
sus that these attempts are failed ab-
stractions,28 in the absence of better
abstractions, people continue to rely
on them to the consternation of devel-
opers, operators, and users.
system of loosely coupled components
interacting via messages—the failure
of a component is only ever manifested
as the absence of a message. The only
way to detect the absence of a message
is via a timeout, an ambiguous signal
that means either the message will nev-
er come or that it merely has not come
yet. Timeouts are an end-to-end con-
cern28,48 that must ultimately be man-
aged by the application. Hence, partial
failures in distributed systems bubble
of the art in
verification and
program analysis
continues to evolve
in the academic
world, the industry
is moving in the
opposite direction:
away from formal
methods and
toward approaches
that combine
testing with fault
injection.
neous, and too dynamic for these
classic approaches to software qual-
ity to take root. In reaction, practitio-
ners increasingly rely on resiliency
techniques based on testing and fault
injection.6,14,19,23,27,35 These “black box”
approaches (which perturb and ob-
serve the complete system, rather
than its components) are (arguably)
better suited for testing an end-to-
end property such as fault tolerance.
Instead of deriving guarantees from
understanding how a system works
on the inside, testers of the system
observe its behavior from the outside,
building confidence that it functions
correctly under stress.
in this space: Chaos Engineering6 and
Jepsen testing.24 Chaos Engineering,
the practice of actively perturbing pro-
duction systems to increase overall site
resiliency, was pioneered by Netflix,6
but since then LinkedIn,52 Microsoft,38
Uber,47 and PagerDuty5 have developed
Chaos-based infrastructures. Jepsen
performs black box testing and fault
injection on unmodified distributed
data management systems, in search
of correctness violations (for example,
counterexamples that show an execu-
tion was not linearizable).
empirical. Each builds an understand-
ing of how a system operates under
faults by running the system and observ-
ing its behavior. Both approaches offer
a pay-as-you-go method to resiliency:
the initial cost of integration is low,
and the more experiments that are
performed, the higher the confidence
that the system under test is robust.
Because these approaches represent
a straightforward enrichment of exist-
ing best practices in testing with well-
understood fault injection techniques,
they are easy to adopt. Finally, and
perhaps most importantly, both ap-
proaches have been shown to be effec-
tive at identifying bugs.
also have a fatal flaw: they are manual
processes that require an extremely
sophisticated operator. Chaos Engi-
neers are a highly specialized subclass
of site reliability engineers. To devise
a custom fault injection strategy, a
Chaos Engineer typically meets with
different service teams to build an
of various components and their in-
teractions. The Chaos Engineer then
targets those services and interactions
that seem likely to have latent fault tol-
erance weaknesses. Not only is this ap-
proach difficult to scale since it must
be repeated for every new composition
of services, but its critical currency—
a mental model of the system under
study—is hidden away in a person’s
brain. These points are reminiscent
of a bigger (and more worrying) trend
in industry toward reliability priest-
hoods,7 complete with icons (dash-
boards) and rituals (playbooks).
that anyone can use, but to the best of
our knowledge all of the reported bugs
discovered by Jepsen to date were dis-
covered by its inventor, Kyle Kingsbury,
who currently operates a “distributed
systems safety research” consultancy.24
Applying Jepsen to a storage system
requires the superuser carefully read
the system documentation, generate
workloads, and observe the externally
visible behaviors of the system under
test. It is then up to the operator to
choose—from the massive combina-
torial space of “nemeses,” including
machine crashes and network parti-
tions—those fault schedules that are
likely to drive the system into returning
incorrect responses.
death for systems that need to keep up
with software evolution. Human atten-
tion should always be targeted at tasks
that computers cannot do! Moreover,
the specialists that Chaos and Jepsen
testing require are expensive and rare.
Here, we show how geniuses can be ab-
stracted away from the process of fail-
ure testing.
Rapidly changing assumptions about
our visibility into distributed system
internals have made obsolete many
if not all of the classic approaches to
software quality, while emerging “cha-
os-based” approaches are fragile and
unscalable because of their genius-in-
the-loop requirement.
failure testing by looking at how the
same changing environments that has-
tened the demise of time-tested resil-
iency techniques can enable new ones.
experts out of the failure-testing loop is
to imitate their best practices in soft-
ware and show how the emergence of
sophisticated observability infrastruc-
ture makes this possible.
scale distributed systems, the three
fundamental assumptions of tradi-
tional approaches to software quality
are quickly fading in the rearview mir-
ror. The first to go was the belief that
you could rely on experts to solve the
hardest problems in the domain. Sec-
ond was the assumption that a formal
specification of the system is available.
Finally, any program analysis (broadly
defined) that requires that source code
is available must be taken off the ta-
ble. The erosion of these assumptions
helps explain the move away from clas-
sic academic approaches to resiliency
in favor of the black box approaches
described earlier.
ing the behavior of complex systems
in this new reality? Luckily, the fact
that it is more difficult than ever to
understand distributed systems from
the inside has led to the rapid evolu-
tion of tools that allow us to under-
stand them from the outside. Call-
graph logging was first described by
Google;51 similar systems are in use
at Twitter,4 Netflix,1 and Uber,50 and
the technique has since been stan-
dardized.43 It is reasonable to assume
that a modern microservice-based
Internet enterprise will already have
instrumented its systems to collect
call-graph traces. A number of start-
ups that focus on observability have
recently emerged.21,34 Meanwhile,
provenance collection techniques
for data processing systems11,22,42 are
becoming mature, as are operating
system-level provenance tools.44 Re-
cent work12,55 has attempted to infer
causal and communication structure
of distributed computations from
raw logs, bringing high-level explana-
tions of outcomes within reach even
for uninstrumented systems.
Chaos Monkey, like they mention, is awe-
some, and I also highly recommend get-
ting Kyle to run Jepsen tests.
throughout the system’s execution (for
example, system invariants or safety
properties) or established during execu-
tion (for example, liveness properties).
Most distributed systems with which
we interact, though their executions
may be unbounded, nevertheless pro-
vide finite, bounded interactions that
have outcomes. For example, a broad-
cast protocol may run “forever” in a re-
active system, but each broadcast deliv-
ered to all group members constitutes
a successful execution.
this way, we can revise the definition:
A system is fault tolerant if it provides
sufficient mechanisms to achieve its
successful outcomes despite the given
class of faults.
target weaknesses in the façade. If we
could understand all of the ways in
which a system can obtain its good
outcomes, we could understand which
faults it can tolerate (or which faults it
could be sensitive to). We assert that
(whether they realize it or not!) the
process by which Chaos Engineers
and Jepsen superusers determine, on
a system-by-system basis, which faults
to inject uses precisely this kind of rea-
soning. A target experiment should
exercise a combination of faults that
knocks out all of the supports for an ex-
pected outcome.
out to be the easy part. Fault injection
infrastructure, much like observability
infrastructure, has evolved rapidly in
recent years. In contrast to random,
coarse-grained approaches to distrib-
uted fault injection such as Chaos
Monkey,23 approaches such as FIT
(failure injection testing)17 and Grem-
lin32 allow faults to be injected at the
granularity of individual requests with
high precision.
fectively automated. The emergence of
sophisticated tracing tools described
earlier makes it easier than ever to
build redundancy models even from
the executions of black box systems.
The rapid evolution of fault injection
infrastructure makes it easier than
ever to test fault hypotheses on large-
scale systems. Figure 1 illustrates how
the automation described in this here
fits neatly between existing observ-
quote is anecdotal, it is difficult to
imagine a better example of the fun-
damental unscalability of the current
state of the art. A single person can-
not possibly keep pace with the ex-
plosion of distributed system imple-
mentations. If we can take the human
out of this critical loop, we must; if we
cannot, we should probably throw in
the towel.
to automate any process is to compre-
hend the human component that we
would like to abstract away. How do
Chaos Engineers and Jepsen superus-
ers apply their unique genius in prac-
tice? Here is the three-step recipe com-
mon to both approaches.
The human element of the Chaos and
Jepsen processes begins with princi-
pled observation, broadly defined.
ing the external API of services rel-
evant to a given class of interactions,
meet with the engineering teams to
better understand the details of the
implementations of the individual
level interactions among services, the
engineer will then peruse call-graph
traces in a trace repository.3
by reviewing the product documenta-
tion, both to determine the guarantees
that the system should uphold and to
learn something about the mecha-
nisms by which it does so. From there,
the superuser builds a model of the
behavior of the system based on inter-
action with the system’s external API.
Since the systems under study are typ-
ically data management and storage,
these interactions involve generating
histories of reads and writes.31
can go wrong in a distributed system is
watching things go right: observing the
system in the common case.
the system tolerates faults. The com-
mon next step in both approaches is
the most subtle and subjective. Once
there is a mental model of how a dis-
tributed system behaves (at least in the
common case), how is it used to help
choose the appropriate faults to inject?
At this point we are forced to dabble in
conjecture: bear with us.
en some fixed set of faults, we say that
a system is “fault tolerant” exactly if it
operates correctly in all executions in
which those faults occur. What does it
mean to “operate correctly”? Correct-
ness is a system-specific notion, but,
broadly speaking, is expressed in terms
testing.
models
redundancy
caller
tion infrastructure, consuming the
former, maintaining a model of system
redundancy, and using it to param-
eterize the latter. Explanations of sys-
tem outcomes and fault injection in-
frastructures are already available. In
the current state of the art, the puzzle
piece that fits them together (models of
redundancy) is a manual process. LDFI
(as we will explain) shows that automa-
tion of this component is possible.
In previous work, we introduced a bug-
finding tool called LDFI (lineage-driven
fault injection).2 LDFI uses data prove-
nance collected during simulations of
distributed executions to build deriva-
tion graphs for system outcomes. These
graphs function much like the models
of system redundancy described ear-
lier. LDFI then converts the derivation
graphs into a Boolean formula whose
satisfying assignments correspond to
combinations of faults that invalidate
all derivations of the outcome. An ex-
periment targeting those faults will
then either expose a bug (that is, the ex-
pected outcome fails to occur) or reveal
additional derivations (for example, af-
ter a timeout, the system fails over to a
backup) that can be used to enrich the
model and constrain future solutions.
understood techniques from data
management systems, treating fault
tolerance as a materialized view main-
tenance problem.2,13 It models a dis-
tributed system as a query, its expect-
ed outcomes as query outcomes, and
critical facts such as “replica A is up at
time t” and “there is connectivity be-
tween nodes X and Y during the inter-
val i . . . j” as base facts. It can then ask
a how-to query:37 What changes to base
data will cause changes to the derived
data in the view? The answers to this
query are the faults that could, accord-
ing to the current model, invalidate the
expected outcomes.
LDFI approach shows a great deal of
promise. The initial prototype demon-
strated the efficacy of the approach at
the level of protocols, identifying bugs
in replication, broadcast, and commit
protocols.2,46 Notably, LDFI reproduced
a bug in the replication protocol used by
the Kafka distributed log26 that was first
A later iteration of LDFI is deployed at
Netflix,1 where (much like the illustra-
tion in Figure 1) it was implemented
as a microservice that consumes traces
from a call-graph repository service and
provides inputs for a fault injection ser-
vice. Since its deployment, LDFI has
identified 11 critical bugs in user-fac-
ing applications at Netflix.1
The prior research presented earlier is
only the tip of the iceberg. Much work
still needs to be undertaken to realize
the vision of fully automated failure
testing for distributed systems. Here,
we highlight nascent research that
shows promise and identifies new di-
rections that will help realize our vision.
context of resiliency testing for distribut-
ed systems, attempting to enumerate
and faithfully simulate every possible
kind of fault is a tempting but dis-
tracting path. The problem of under-
standing all the causes of faults is not
directly relevant to the target, which
is to ensure that code (along with its
configuration) intended to detect and
mitigate faults performs as expected.
the left shows a microservice-based
architecture; arrows represent calls
generated by a client request. The
right-hand side zooms in on a pair of
interacting services. The shaded box
in the caller service represents the
fault tolerance logic that is intended
to detect and handle faults of the cal-
lee. Failure testing targets bugs in this
logic. The fault tolerance logic targeted
in this bug search is represented as the
shaded box in the caller service, while
the injected faults affect the callee.
the perspective of the caller, is explicit
error returns, corrupted responses,
and (possibly infinite) delay. Of these
manifestations, the first two can be ad-
equately tested with unit tests. The last
is difficult to test, leading to branches
of code that are infrequently executed.
If we inject only delay, and only at com-
ponent boundaries, we conjecture that
we can address the majority of bugs re-
lated to fault tolerance.
provide better explanations of system
outcomes, we can build better models
of fault injection
infrastructure
makes it easier
than ever to test
fault hypotheses
on large-scale
systems.
away) this uncertainty.
listic by nature and are arguably bet-
ter modeled probabilistically. Future
directions of work include the proba-
bilistic representation of system re-
dundancy and an exploration of how
this representation can be exploited to
guide the search of fault experiments.
We encourage the research community
to join in exploring alternative internal
representations of system redundancy.
out. Most of the classic work on data
provenance in database research has
focused on aspects related to human-
computer interaction. Explanations of
why a query returned a particular result
can be used to debug both the query
and the initial database—given an un-
expected result, what changes could be
made to the query or the database to fix
it? By contrast, in the class of systems
we envision (and for LDFI concretely),
explanations are part of the internal
language of the reasoner, used to con-
struct models of redundancy in order
to drive the search through faults.
role in both worlds. After all, when a
bug-finding tool such as LDFI identi-
fies a counterexample to a correctness
property, the job of the programmers
has only just begun—now they must un-
dertake the onerous job of distributed
debugging. Tooling around debugging
has not kept up with the explosive pace
of distributed systems development.
We continue to use tools that were de-
signed for a single site, a uniform mem-
ory, and a single clock. While we are not
certain what an ideal distributed debug-
ger should look like, we are quite certain
that it does not look like GDB (GNU Proj-
ect debugger).36 The derivation graphs
used by LDFI show how provenance can
also serve a role in debugging by provid-
ing a concise, visual explanation of how
the system reached a bad state.
further. To understand the root causes
of a bug in LDFI, a human operator
must review the provenance graphs of
the good and bad executions and then
examine the ways in which they differ.
Intuitively, if you could abstractly
subtract the (incomplete by assump-
tion) explanations of the bad outcomes
from the explanations of the good out-
rier to entry for systems such as LDFI
is the unwillingness of software de-
velopers and operators to instrument
their systems for tracing or provenance
collection. Fortunately, operating sys-
tem-level provenance-collection tech-
niques are mature and can be applied
to uninstrumented systems.
makes simulating distributed execu-
tions of black box software within a
single hypervisor easier than ever. We
are actively exploring the collection
of system call-level provenance from
unmodified distributed software in
order to select a custom-tailored fault
injection schedule. Doing so requires
extrapolating application-level causal
structure from low-level traces, iden-
tifying appropriate cut points in an
observed execution, and finally syn-
chronizing the execution with fault
injection actions.
sibility of inferring high-level explana-
tions from even noisier signals, such as
raw logs. This would allow us to relax
the assumption that the systems un-
der study have been instrumented to
collect execution traces. While this is
a difficult problem, work such as the
Mystery Machine12 developed at Face-
book shows great promise.
system represents system redundancy
using derivation graphs and treats the
task of identifying possible bugs as a
materialized-view maintenance prob-
lem. LDFI was hence able to exploit
well-understood theory and mecha-
nisms from the history of data man-
agement systems research. But this is
just one of many ways to represent how
a system provides alternative computa-
tions to achieve its expected outcomes.
is its reliance on assumptions of de-
terminism. In particular, it assumes
that if it has witnessed a computation
that, under a particular contingency
(that is, given certain inputs and in the
presence of certain faults), produces
a successful outcome, then any future
computation under that contingency
will produce the same outcome. That
is to say, it ignores the uncertainty in
timing that is fundamental to distrib-
uted systems. A more appropriate way
to model system redundancy would be
revolution makes
simulating
distributed
executions of
black-box software
within a single
hypervisor easier
than ever.
GDB, DDD, and Eclipse. No Starch Press, 2008.
how-to queries. Proceedings of the ACM SIGMOD
International Conference on the Management of Data
(2012), 337-348.
fault analysis service, 2016; https://azure.microsoft.
com/en-us/documentation/articles/ service-fabric-
testability-overview/.
model checking real code. ACM SIGOPS Operating
Systems Review. In Proceedings of the 5th Symposium
on Operating Systems Design and Implementation 36
(2002), 75–88.
Heisenbugs in concurrent programs. In Proceedings
of the 8th Usenix Conference on Operating Systems
Design and Implementation (2008), 267–280.
Amazon Web Services. Technical Report, 2014; http://
lamport.azurewebsites.net/tla/formal-methods-
amazon .
for custom monitoring and debugging of distributed
data flows. In Proceedings of the ACM SIGMOD
International Conference on the Management of Data
(2011), 1221–1224.
44. Pasquier, T.F. J.-M., Singh, J., Eyers, D.M., Bacon, J.
2015; https://arxiv.org/pdf/1506.04391 .
redundant arrays of inexpensive disks (RAID). In
Proceedings of the 1988 ACM SIGMOD International
Conference on Management of Data, 109–116;
http://web.mit.edu/6.033/2015/wwwdocs/papers/
Patterson88 .
Proceedings of the 9th Usenix Workshop on Hot Topics
in Cloud Computing (2017).
opportunities microservices provide. Uber Engineering,
2016; https: //eng.uber.com/building-tincup/.
arguments in system design. ACM Trans. Computing
Systems 2, 4 (1984): 277–288.
implementation and experience. Technical report, Sun
Microsystems. In Proceedings of the Summer 1986
Usenix Technical Conference and Exhibition.
Uber Engineering, 2017; https://uber.github.io/jaeger/.
systems tracing infrastructure. Technical report.
Research at Google, 2010; https://research.google.
com/pubs/pub36356.html.
failure induction framework. Linkedin Engineering,
2016; https://engineering.linkedin.com/blog/2016/03/
deep-dive-Simoorg-open-source-failure-induction-
framework.
model checking of unmodifed distributed systems.
In Proceedings of the 6th Usenix Symposium on
Networked Systems Design and Implementation
(2009), 213–228.
specifications. In Proceedings of the 10th IFIP WG
10.5 Advanced Research Working Conference on
Correct Hardware Design and Verification Methods
(1999), 54–66.
profiler for distributed systems. In Proceedings of the
11th Usenix Conference on Operating Systems Design
and Implementation (2014), 629–644.
science at the University of California Santa Cruz,
where he leads the Disorderly Labs research group
(disorderlylabs.github.io).
documentation for both internal and external users
of enterprise and open source software, including for
Microsoft, CNET, VMware, and Oracle.
Publication rights licensed to ACM. $15.00.
crepancy would be likely to be near the
“frontier” of the difference.
A sea change is occurring in the tech-
niques used to determine whether
distributed systems are fault tolerant.
The emergence of fault injection ap-
proaches such as Chaos Engineering
and Jepsen is a reaction to the erosion
of the availability of expert program-
mers, formal specifications, and uni-
form source code. For all of their prom-
ise, these new approaches are crippled
by their reliance on superusers who
decide which faults to inject.
ing, we propose a way of modeling and
ultimately automating the process
carried out by these superusers. The
enabling technologies for this vision
are the rapidly improving observabil-
ity and fault injection infrastructures
that are becoming commonplace in
the industry. While LDFI provides con-
structive proof that this approach is
possible and profitable, it is only the
beginning. Much work remains to be
done in targeting faults at a finer grain,
constructing more accurate models of
system redundancy, and providing bet-
ter explanations to end users of exactly
what went wrong when bugs are identi-
fied. The distributed systems research
community is invited to join in explor-
ing this new and promising domain.
on queue.acm.org
John Allspaw
http://queue.acm.org/detail.cfm?id=2353017
Caitie McCaffrey
http://queue.acm.org/detail.cfm?id=2889274
Steve Chessin
http://queue.acm.org/detail.cfm?id=1839574
1. Alvaro, P. et al. Automating failure-testing research
Symposium on Cloud Computing (2016), 17–28.
fault injection. In Proceedings of the ACM SIGMOD
International Conference on Management of Data
(2015), 331–346.
4. Aniszczyk, C. Distributed systems tracing with Zipkin.
distributed-systems-tracing-with-zipkin.
reliable. DevOps.com; http://devops.com/2014/06/03/
inject-failure/.
(2016), 35–41.
Reliability Engineering. O’Reilly, 2016.
procedure calls. ACM Trans. Computer Systems 2, 1
(1984), 39–59.
failure detector for solving consensus. J.ACM 43, 4
(1996), 685–722.
better network diagnostics with differential
provenance. In Proceedings of the ACM SIGCOMM
Conference (2016), 115–128.
Explaining outputs in modern data analytics. In
Proceedings of the VLDB Endowment 9, 12 (2016):
1137–1148.
performance analysis of large-scale Internet services.
In Proceedings of the 11th Usenix Conference on
Operating Systems Design and Implementation
(2014), 217–231.
view data in a warehousing environment. ACM Trans.
Database Systems 25, 2 (2000), 179–227.
Fault Injection Environment for Distributed Systems.
In Proceedings of the 26th International Symposium
on Fault-tolerant Computing, (1996).
of distributed consensus with one faulty process.
J. ACM 32, 2 (1985): 374–382; https://groups.csail.mit.
edu/tds/papers/Lynch/jacm85 .
fault tolerance of distributed protocols. In Tools
and Algorithms for the Construction and Analysis of
Systems, Lecture Notes in Computer Science 4963,
Springer Verlag (2008). 315–331.
injection testing. Netflix Technology Blog; http://
techblog.netflix.com/2014/10/fit-failure-injection-
testing.html.
be done about it? Tandem Technical Report 85.7
(1985); http://www.hpl.hp.com/techreports/
tandem/TR-85.7 .
for cloud recovery testing. In Proceedings of the 8th
Usenix Conference on Networked Systems Design
and Implementation (2011), 238–252; http://db.cs.
berkeley.edu/papers/nsdi11-fate-destini .
Reference Manual. Addison-Wesley Professional, 2003.
22. Interlandi, M. et al. Titian: Data provenance support in
(2015), 216–227.
Netflix Technology Blog; http: //techblog.netflix.
com/2011/07/ netflix-simian-army.html.
http://jepsen.io/.
26. Kafka 0.8.0. Apache, 2013; https://kafka.apache.
27. Kanawati, G.A., Kanawati, N.A., Abraham, J.A. Ferrari:
system. IEEE Trans. Computers 44, 2 (1995): 248–260.
on distributed computing. Technical Report, 1994. Sun
Microsystems Laboratories.
death, and the critical transition: Finding liveness
bugs in systems code. Networked System Design and
Implementation, (2007); 243–256.
aphyr.com/posts/293-call-me-maybe-kafka.
32. Lafeldt, M. The discipline of Chaos Engineering.
discipline-of-chaos-engineering-e39d2383c459.
Systems—Architecture and Implementation, An
Advanced Cours: (1980), 246–265; https://link.
springer.com/chapter/10.1007%2F3-540-10571-9_11.
35. Marinescu, P.D., Candea, G. LFI: A practical and
International Conference on Dependable Systems and
Networks (2009).
Machinery and its content may not be copied or emailed to multiple sites or posted to a
listserv without the copyright holder’s express written permission. However, users may print,
download, or email articles for individual use.
objects, subroutines, and components through
software product lines. However, relatively little has
been published on very large-grained reuse. One
example of this type of large-grained reuse might be
that of an entire Internet banking system (applications
and infrastructure) reused in business units all over
the world. In contrast, “large scale” software reuse
in current research generally refers to systems that
reuse a large number of smaller components, or that
perhaps reuse subsystems.9 In this article, we explore a
case of an organization with an internal development
group that has been very successful with large-grained
software reuse.
Center (BTC) in particular, have created a number of
software systems that have been reused in multiple
businesses and in multiple countries. BigFinancial
and BTC thus provided a rich source of data for
case studies to look at the characteristics of those
projects and why they have been successful, as well
as to look at projects that have been less successful
and to understand what has caused those results and
what might be done differently to prevent issues in
the future. The research is focused on technology,
process, and organizational elements of the
development process, rather than on specific product
features and functions.
level may help to alleviate some of the
issues that occur in more traditional
reuse programs, which tend to be finer-
grained. In particular, because BigFi-
nancial was trying to gain commonal-
ity in business processes and operating
models, reuse of large-grained compo-
nents was more closely aligned with its
business goals. This same effect may
well not have happened with finer-
grained reuse, due to the continued
ability of business units to more readily
pick and choose components for reuse.
unit of BigFinancial, with operations
in both the eastern and western US. Ap-
proximately 500 people are employed
by BTC, reporting ultimately through a
single line manager responsible to the
Global Retail Business unit head of Big-
Financial. BTC is organized to deliver
both products and infrastructure com-
ponents to BigFinancial, and its prod-
uct line has through the years included
consumer Internet banking services,
teller systems, ATM software, and net-
work management tools. BigFinancial
has its U.S. operations headquartered
in the eastern U.S., and employs more
than 8,000 technologists worldwide.
three cases for further study from a pool
of about 25. These cases were the Java
Banking Toolkit (JBT) and its related ap-
plication systems, the Worldwide Single
Signon (WSSO) subsystem, and the Big-
Financial Message Switch (BMS).
reuse and bigfinancial
Various definitions appear in the lit-
erature for software reuse. Karlsson de-
fines software reuse as “the process of
creating software systems from existing
software assets, rather than building
software systems from scratch.” One
taxonomy of the approaches to software
reuse includes notions of the scope of
reuse, the target of the reuse, and the
granularity of the reuse.5 The notion of
granularity is a key differentiator of the
type of software reuse practiced at Big-
Financial, as BigFinancial has demon-
for reuse
and thus the JBT infrastructure is al-
ready reused for multiple applications.
To some extent, these multiple appli-
cations could be studied as subcases,
though they have thus far tended to be
deployed as a group. In addition, the
online banking, portal services, and
alerts functions are themselves reused
at the application level across multiple
business units globally.
al current and recent projects showed
significant reuse across independent
business units that could have made
alternative technology development
decisions. The results are summarized
in Table 1.
to support multiple languages and
business-specific functional variabili-
ty, BTC found that it was able to accom-
modate these requirements by design-
ing its products to be rule-based, and by
designing its user interface to separate
content from language. In this manner,
business rules drove the behavior of
the Internet banking applications, and
language- and format-definition tools
drove the details of application behav-
ior, while maintaining a consistent set
of underlying application code.
sible for creation of system infrastruc-
ture components, built on top of in-
dustry-standard commercial operating
systems and components, to support
the banking functionality required
by its customers within BigFinancial.
The functions of these infrastructure
components included systems man-
agement, high-reliability logging pro-
cesses, high-availability mechanisms,
and other features not readily available
in commercial products at the time
that the components were created. The
same infrastructure was used to sup-
port consumer Internet banking as
programs – building a system once and
reusing it in multiple businesses.
such as that proposed by Griss4 and fur-
ther expanded upon by Clements and
Northrop2 and by Krueger6 suggest that
software components can be treated
similarly to the notions used in manu-
facturing – reusable parts that contrib-
ute to consistency across a product line
as well as to improved efficiencies in
manufacturing. Benefits of such reuse
include the high levels of commonal-
ity of such features as user interfaces,7
which increases switching costs and
customer loyalty in some domains.
This could logically extend to banking
systems in the form of common func-
tionality and user interfaces across
systems within a business, and across
business units.
stances of successful, large-grained re-
use projects. We identified projects that
have been successfully reused across a
wide range of business environments
or business domains, resulting in sig-
nificant benefit to BigFinancial. These
included the JBT platform and its re-
lated application packages, as well as
the Worldwide SSO product. These
projects demonstrated broad success,
and the authors evaluated these for evi-
dence to identify what contributed to,
and what may have worked against, the
success of each project.
project that has been successfully re-
used across a relatively narrow range of
business environments. This project,
the BigFinancial Message Switch (BMS)
was designed for a region-wide level of
reuse, and had succeeded at that level.
As such, it appears to have invested ap-
propriately in features and capabilities
needed for its client base, and did not
appear to have over-invested.
related services
We focused on BTC’s multi-use Java
Banking Toolkit (JBT) as a model of
a successful project. The Toolkit is
in wide use across multiple business
units, and represents reuse both at the
largest-grained levels as well as reuse
of large-scale infrastructure compo-
nents. JBT supports three application
sets today, including online banking,
Internet banking services will be iden-
tified here as the Legacy Internet Bank-
ing product (LIB).
Internet transaction services were ac-
complished via another instance of
reuse. Taking its pre-Internet banking
components, BTC was able to “scrape”
the content from the pages displayed
in that product, and wrap HTML code
around them for display on a Web
browser. Other components were re-
sponsible for modifying the input and
menuing functions for the Internet.
Internet delivery was to more rapidly
deliver a product to the Internet, with-
out modification of the legacy business
logic, thereby reducing risk as well. In
what amounted to an early separation
of business and presentation logic, the
pre-Internet business logic remained
in place, and the presentation layer
re-mapped its content for the browser
environment.
ognized two key issues that needed to
be addressed. The platform for their
legacy Internet Banking application
was nearing end of life (having been
first deployed in 1996), and there were
too many disparate platforms for its
consumer Internet offerings. BTC’s
Internet banking, alerts, and portal
functions each required separate hard-
ware and operating environments.
BTC planned its activities such that the
costs of the new development could
fit within the existing annual mainte-
nance and new development costs al-
ready being paid by its clients.
trust in BTC’s organization as a key to
allowing BTC the opportunity to devel-
op the JBT product. In addition, BTC’s
prior success with reusing software
components at fine and medium gran-
automated Teller Machines
components – >35 businesses worldwide
reuse as a best practice.
an integrated platform and application
set for a range of consumer Internet
functions. The infrastructure package,
named the Java Banking Toolkit (JBT),
was based on Java 2 Enterprise Edition
(J2EE) standards and was intended
to allow BigFinancial to centralize its
server infrastructure for consumer
Internet functions. The authors con-
ducted detailed interviews with several
BTC managers and architects, and re-
viewed several hundred documents.
Current deployment statistics for JBT
are shown in Table 2.
cations were designed and built by
BTC and its regional partners, with in-
put from its clients around the world.
BTC’s experience had shown that con-
sumer banking applications were not
fundamentally different from one an-
other across the business units, and
BTC proposed and received funding
for creation of a consolidated applica-
tion set for Internet banking. A market
evaluation determined that there were
no suitable, globally reusable, com-
plete applications on the market, nor
record of success required for confi-
dence in the delivery. Final funding
approval came from BigFinancial tech-
nology and business executives.
for several major functional elements.
The requirements were broken out
among the infrastructural elements
supporting the various planned appli-
cation packages, and the applications
themselves. The applications delivered
with the initial release of JBT included
a consumer Internet banking applica-
tion set, an account activity and bal-
ance alerting function, and a portal
content toolset.
signed to be reused intact in each busi-
ness unit around the world, requiring
only changes to business rules and
language phrases that may be unique
to a business. One of the fundamental
requirements for each of the JBT appli-
cations was to include capabilities that
were designed to be common to and
shared by as many business units as
possible, while allowing for all neces-
sary business-specific variability.
the requirements process, building
on the LIB infrastructure and applica-
tions, as well as the legacy portal and
alerts services that were already in pro-
duction. Examples of the region- and
business-specific variability include
language variations, compliance with
local regulatory requirements, and
functionality based on local and re-
gional competitive requirements.
across a range of categories. These
categories included technology, opera-
tions, deployment, development, and
tools. These requirements were in-
tended to form the foundation for ini-
tial discussion and agreement with the
stakeholders, and to support division of
the upcoming tasks to define the archi-
tecture. Nine additional, more detailed,
requirements documents were created
to flesh out the details referenced in
the top-level requirements. Additional
topics addressed by the detailed docu-
ments included language, business
rules, host messaging, logging, portal
services, and system management.
nology leaders reported that JBT has
been much easier to integrate than the
legacy product, given its larger applica-
tion base and ability to readily add ap-
plications to it. Notably, he indicated
that JBT’s design had taken into ac-
count the lessons learned from prior
products, including improvements in
performance, stability, and total cost
of ownership. This resulted in a “win/
win/win for businesses, technology
groups, and customers.”
nancial indicates that the cost savings
for first-time business unit implemen-
tations of products already deployed to
other business units averaged between
20 and 40%, relative to the cost of new de-
velopment. Further, the cost savings for
subsequent deployments of updated re-
leases to a group of business units result-
ed in cost savings of 50% – 75% relative to
the cost of maintaining the software for
each business unit independently.
supported by a single global applica-
tion set. There remain, in some cases,
functions required only by a specific
business or region. The JBT architec-
ture allows for those region-specific
applications to be developed by the
regional technology unit as required.
An overview of the JBT architecture is
shown in Figure 1.
of a layered architecture,12 focusing on
interoperability and modularity. For
example, the application components
interact only with the application body
section of the page; all other elements
of navigation and branding are handled
by the common and portal services
along with the cost of training, devel-
opment and testing of business rules,
and ramp-up of operational processes.
In contrast, ongoing maintenance sav-
ings are generally larger, due to the
commonality across the code base for
numerous business units. This com-
monality enables bug fixes, security
patches, and other maintenance activi-
ties to be performed on one code base,
rather than one for each business unit.
it is possible for a large organization,
building software for its own internal
use, to move beyond the more common
models of software reuse. In so doing,
BigFinancial has achieved significant
economies of scale across its many
business units, and has shortened the
time to market for new deployments of
its products.
the success of the reuse projects. These
included elements expected from the
more traditional reuse literature, in-
cluding organizational structure, tech-
nological foundations, and economic
factors. In addition, several new ele-
ments have been identified. These in-
clude the notions of trust and culture,
the concepts of a track record of large-
and fine-grained reuse success, and the
virtuous (and potentially vicious) cycle
of corporate mandates. Conversely,
organizational barriers prove to be the
greatest inhibitor to successful reuse.13
of many years, to create and strengthen
its culture of reuse. Across numerous
product lines, reuse of components and
infrastructure packages was strongly
encouraged. Reuse of large-grained
elements was the next logical step,
working with a group of business units
within a single regional organization.
This supported the necessary business
alignment to enable large-grained re-
use. In addition, due to its position as a
global technology provide to BigFinan-
cial, BTC was able to leverage its knowl-
edge of requirements across business
units, and explicitly design products to
be readily reusable, as well as to drive
commonality of requirements to sup-
port that reuse as well.
to reuse, BTC’s results have provided
empirical evidence regarding the use
of various technologies and patterns
messaging is isolated from the applica-
tion via a message abstraction layer, so
that unique messaging models can be
used in each region, if necessary.
ture and applications components for
a range of banking functionality. The
infrastructure and applications com-
ponents are defined as independently
changeable releases, but are currently
packaged as a group to simplify the de-
ployment process.
projects are coordinated through BTC,
with significant participation from the
business units. Business units have the
opportunity to choose other vendors
for their technology needs, though the
corporate technology strategy limited
that option as the JBT project gained
wider rollout status. Business units
participate in a semi-annual in-person
planning exercise to evaluate enhance-
ment requests and prioritize new busi-
ness deployments.
The authors examined a total of six dif-
ferent cases of software reuse. Three of
these were subcases of the Java Banking
services, and alerts, along with the re-
use of the JBT platform itself. The oth-
ers were the Worldwide SSO product,
and the BigFinancial Message Switch.
There were a variety of reuse success
levels, and a variety of levels of evidence
of anticipated supports and barriers to
reuse. The range of outcomes is repre-
sented as a two dimensional graph, as
shown in Figure 2.
success in a very pragmatic, straight-
forward fashion. Rather than measur-
ing reused modules, lines of code, or
function points, BigFinancial instead
simply measures total deployments
of compatible code sets. Due to on-
going enhancements, the code base
continues to evolve over time, but in a
backwards-compatible fashion, so that
older versions can be and are readily
upgraded to the latest version as busi-
ness needs dictate.
economic measures of cost savings.
However, their estimates of the range
of cost savings are shown in Figure 3.
Cost savings are smaller for new de-
ployments due to the significant effort
required to map business unit require-
of these technologies and patterns are
platform-independent interfaces, busi-
ness rule structures, rigorous isolation
of concerns across software layers, and
versioning of interfaces to allow phased
migration of components to updated
interfaces. These techniques, among
others, are commonly recognized as
good architectural approaches for de-
signing systems, and have been exam-
ined more closely for their contribution
to the success of the reuse activities. In
this examination, they have been found
to contribute highly to the technologi-
cal elements required for success of
large-grained reuse projects.
application service providers, routinely
conduct this type of development and
reuse, though with different motiva-
tions. (Application service providers
are now often referred to as providers
of Software as a Service.) As commer-
cial providers, they are more likely to be
market-driven, often with sales of Pro-
fessional Services for customization. In
contrast, the motivations in evidence
at BigFinancial seemed more aimed
at achieving the best combinations of
functionality, time to market, and cost.
nity to examine, in-depth, the various
forms of reuse practiced on three proj-
ects, and three subprojects, inside Big-
Financial. Some of those forms include
design reuse, code reuse, pattern reuse,
and test case reuse. The authors have
found based on documents and re-
ports from participants that the active
practice of systematic, finer-grained re-
use contributed to successful reuse of
systems at larger levels of granularity.
management structures and leader-
ship styles, and an opportunity to ex-
amine how those contribute to, or work
against, successful reuse. Much has
been captured about IT governance in
general, and about organizational con-
structs to support reuse in various situ-
ations at BigFinancial/BTC. Leadership
of both BTC and BigFinancial was cited
as contributing to the success of the re-
use efforts, and indeed also was cited
as a prerequisite for even launching
a project that intends to accomplish
such large-grained reuse.
trust in outsourced IS relationships,
where the participants in projects may
not know one another before a project,
project. As such, the establishment
and maintenance of trust is critical in
that environment. This is not entirely
applicable to BTC, as it is a peer organi-
zation to its client’s technology groups,
and its members often have long-stand-
ing relationships with their peers. Ring
and Van de Ven examine the broader
notions of cooperative inter-organiza-
tional relationships (IOR’s), and note
that trust is a fundamental part of an
IOR. Trust is used to serve to mitigate
the risks inherent in a relationship,
and at both a personal and organiza-
tional level is itself mitigated by the po-
tential overriding forces of the legal or
organizational systems.10 This element
does seem to be applicable to BTC’s en-
vironment, in that trust is reported to
have been foundational to the assign-
ment of the creation of JBT to BTC.
ment of the organizational structure
that can impede reuse. A culture that
fears loss of creativity, lacks trust, or
doesn’t know how to effectively reuse
software will not be as successful as an
organization that doesn’t have these
impediments.4 The converse is likely
then also reasonable – that a culture
that focuses on and implicitly welcomes
reuse will likely be more successful.
BTC’s long history of reuse, its lack of
explicit incentives and metrics around
more traditional reuse, and its position
as a global provider of technology to its
business partners make it likely that its
culture, is, indeed a strong supporter
of its reuse success.
mented on the impact of organizational
culture on reuse. Morisio et al8 refer in
passing to cultural factors, primarily as
potential inhibitors to reuse. Card and
Comer1 examine four cultural aspects
that can contribute to reuse adoption:
training, incentives, measurement,
and management. In addition, Card
and Comer’s work focuses generally on
cultural barriers, and how to overcome
them. In BTC’s case, however, there is
a solid cultural bias for reuse, and one
that, for example, no longer requires
incentives to promote reuse.
a strong opinion to offer in relation to
fine- vs. coarse-grained reuse. The lead
architect for JBT was explicitly and vig-
orously opposed to a definition of reuse
Assistant Professor of Information Technology at
California Lutheran University.
Professor and Dean of the School of Information Systems
at Claremont Graduate University.
– of objects and components at a fine-
grained level. This person’s opinion
was that while reuse at this granularity
was possible (indeed, BTC demonstrat-
ed success at this level), fine-grained
reuse was very difficult to achieve in a
distributed development project. The
lead architect further believed that
the leverage it provides was not nearly
as great as the leverage from a large-
grained reuse program. The integrators
of such larger-grained components can
then have more confidence that the
component has been used in a similar
environment, tested under appropri-
ate loads, and so on – relieving the risk
that a fine-grained component built for
one domain may get misused in a new
domain or at a new scale, and be unsuc-
cessful in that environment.
some extent, work as part of a software
product line (supporting its three ma-
jor applications), JBT’s real reuse does
not come in the form of developing
more instances from a common set of
core assets. Rather, it appears that JBT
is itself reused, intact, to support the
needs of each of the various businesses
in a highly configurable fashion.
at least in part, to contribute to the lack
of broad deployment of the BigFinan-
cial Message Switch. Gallivan3 defined
a model for technology innovation as-
similation and adoption, which includ-
ed the notion that even in the face of
management directive, some employ-
ees and organizations might not adopt
and assimilate a particular technology
or innovation. This concept might part-
ly explain the results with BMS, that it
was possible for some business units
and technology groups to resist its in-
troduction on a variety of grounds, in-
cluding business case, even with a de-
cision by a global steering committee
to proceed with deployment.
impact of inter-organizational barriers
on reuse adoption, particularly in the
BMS case. This was particularly evident
in that the organization that created
BMS, and was in large part responsible
for “selling” it to other business units,
was positioned at a regional rather than
global technology level. This organiza-
tional location, along with the organi-
zation’s more limited experience with
contributed to the difficulty in accom-
plishing broader reuse of that product.
While BTC’s results and BigFinancial’s
specific business needs may be some-
what unusual, it is likely that the busi-
ness and technology practices support-
ing reuse may be generalizable to other
banks and other technology users. Good
system architecture, supporting reuse,
and an established business case that
identify the business value of the reuse
were fundamental to establishing the
global reuse accomplished by BTC, and
should be readily scalable to smaller
and less global environments.
cessful project will be a solid technolo-
gy foundation, experience building and
maintaining reusable software, and a
financial and organizational structure
that supports and promotes reuse. In
addition, the organization will need to
actively build a culture of large-grained
reuse, and establish trust with its busi-
ness partners. Establishing that trust
will be vital to even having the oppor-
tunity to propose a large-grained reus-
able project.
1. Card, D. and Comer, E. Why do so many reuse
2. Clements, P. and Northrop, L.m. Software Product
Professional, 2002.
of complex technological innovations: Development
and application of a new framework. The DATA BASE
for Advances in Information Systems 32, 3, 51-85.
IBM Systems Journal 32, 4, 548-566.
John Wiley & Sons, West Sussex, England, 1995.
practice. Comm. ACM 49, 12, (Dec. 2006), 37-40.
Reuse Revisited. Hewlett-Packard Software
Technology Laboratory, Irvine, CA, 1993, 19.
factors in software reuse. IEEE Transactions on
Software Engineering 28, 4, 340-357.
scale software reuse: An industrial case study. In
Proceedings for International Conference on Software
Reuse, (Orlando, FL, 1996), 104-111.
processes of cooperative interorganizational
relationships. Academy of Management Review 19, 1,
90-118.
Development Projects. Comm. of the ACM 42, 2, (Feb.
1999), 80-86.
software: beyond object-oriented programming ACm
Press, New York, 2002.
software reuse: A case from banking. In Proceedings
for Hawaii International Conference on System
Sciences, (Waikoloa, HI, 2007), IEEE Computer
Society.
that many pointy-haired bosses (PHBs) say things like:
time.”
real function is later.”
software.”
can decide later how to sandwich the changes in.”
from what many of us are doing today.
This is not a good thing: when you hit
the first bug, all the time you may have
“saved” by ignoring the need to do
maintenance will be gone.
Electric designed a mainframe that it
claimed would be sufficient for all the
computer uses in Boston, and would
never need to be shut down for repair
or for software tweaks. The machine
it eventually built wasn’t nearly big
enough, but it did succeed at running
continuously without need for hard-
ware or software changes.
work of computers provided by thou-
sands of businesses, sufficient for ev-
eryone in at least North America, if not
the world. Still, we must keep shutting
down individual parts of the network to
repair or change the software. We do so
because we’ve forgotten how to do soft-
ware maintenance.
Software maintenance is not like hard-
ware maintenance, which is the return
of the item to its original state. Software
maintenance involves moving an item
away from its original state. It encom-
passes all activities associated with the
process of changing software. That in-
cludes everything associated with “bug
fixes,” functional and performance
enhancements, providing backward
compatibility, updating its algorithm,
covering up hardware errors, creating
user-interface access methods, and
other cosmetic changes.
tomobile expressway to a railroad
bridge is considered maintenance—
and it would be particularly valuable
if you could do it without stopping the
train traffic.
can be maintained in this way? Yes, it
is. So, why don’t we?
the apocalypse
There are four approaches to software
Know
Jack about
software
maintenance
queue.acm.org
maintenance is easiest and most effective
when built into a system from the ground up.
h
t
g
a
h
y
a
P
g
U
e
a
D
aren’t quite good enough, so in prac-
tice the specification is frozen while
it’s still faulty. This is often because it
cannot be validated, so you can’t tell if
it’s faulty until too late. Then the spec-
ification is not adhered to when code
is written, so you can’t prove the pro-
gram follows the specification, much
less prove it’s correct. So, you test un-
til the program is late, and then ship.
Some months later you replace it as a
complete entity, by sending out new
ROMs. This is the typical history of
video games, washing machines, and
embedded systems from the U.S. De-
partment of Defense.
proach is the current state of prac-
tice: define hard-and-fast, highly
configuration-controlled interfaces
to elements of software, and regularly
carry out massive all-at-once changes.
Next, ship an entire new copy of the
program, or a “patch” that silently
replaces entire executables and li-
braries. (As we write this, a new copy
of Open Office is asking us please to
download it.)
luctantly) the fact of change, keeps a
parts list and tools list on every item,
allows only preauthorized changes
under strict configuration control,
and forces all servers’/users’ changes
to take place in one discrete step. In
practice, the program is running mul-
tiple places, and each must kick off
its users, do the upgrade, and then let
them back on again. Change happens
more often and in more places than
predicted, all the components of an
crete, and continuous—or, perhaps,
war, famine, plague, and death. In any
case, 3.5 of them are terrible ideas.
project”). This one is easy: don’t even
think about the possibility of main-
tenance. Hard-code constants, avoid
subroutines, use all global variables,
use short and non-meaningful vari-
able names. In other words, make it
difficult to change any one thing with-
out changing everything. Everyone
knows examples of this approach—
and the PHBs who thoughtlessly push
you into it, usually because of sched-
ule pressures.
ware is like fighting a war. The enemy
fights back! It particularly fights back
when you have to change interfaces,
and you find you’ve only changed
some of the copies.
decide upfront that maintenance will
never occur. You simply write wonder-
ful programs right from the start. This
is actually credible in some embedded
systems, which will be burned to ROM
and never changed. Toasters, video
games, and cruise missiles come to
mind.
fect specifications and interfaces,
and never change them. Change only
the implementation, and then only
for bug fixes before the product is
released. The code quality is wildly
better than it is for the traditional ap-
proach, but never quite good enough
to avoid change completely.
alive (and, unfortunately, thriving) be-
cause of the time lag for authorization
and the rebuild time for the system.
faces are controlled, unofficial in-
terfaces proliferate; and with C and
older languages, data structures are
so available that even when change is
desired, too many functions “know”
that the structure has a particular
layout. When you change the data
structure, some program or library
that you didn’t even know existed
starts to crash or return enotsup.
A mismatch between an older Linux
kernel and newer glibc once had
getuid returning “Operation not
supported,” much to the surprise of
the recipients.
pletely unrealistic to expect all users
to whom an interface is visible will be
able to change at the same time. The
result is that single-step changes can-
not happen: multiple change interre-
lationships conflict, networks mean
multiple versions are simultaneously
current, and owners/users want to
control change dates.
es, but the changes actually spread
through a population of computers
in a wave over time. This is often lik-
ened to a plague, and is every bit as
popular.
“never” approach to software main-
tenance against the vendors of these
plagues: they build a known work-
ing configuration, then “freeze and
forget.” When an update is required,
they build a completely new system
from the ground up and freeze it. This
works unless you get an urgent secu-
rity patch, at which time you either
ignore it or start a large unscheduled
rebuild project.
proach to maintenance sounds like
just running new code willy-nilly and
watching what happens. We know at
least one company that does just that:
a newly logged-on user will unknow-
ingly be running different code from
everyone else. If it doesn’t work, the
user’s system will either crash or be
kicked off by the sysadmin, then will
have to log back on and repeat the
work using the previous version.
struct {
unsigned short major; /* = 1 */
unsigned short minor; /* = 0 */
} version;
unsigned part_no;
unsigned quantity;
struct location_t {
char state[4];
char city[8];
unsigned warehouse;
short area;
short pigeonhole;
} location;
…
ing of continuous. The real continu-
ous approach comes from Multics,
the machine that was never sup-
posed to shut down and that used
controlled, transparent change. The
developers understood the only con-
stant is change and that migration
for hardware, software, and function
during system operation is necessary.
Therefore, the ability to change was
designed from the very beginning.
ten to evolve as changes happen, us-
ing a weakly typed high-level language
and, in older programs, a good macro
assembler. No direct references are al-
lowed to anything if they can be avoid-
ed. Every data structure is designed
for expansion and self-identifying
as to version. Every code segment is
made self-identifying by the compil-
er or other construction procedure.
Code and data are changeable on a
per-command/process/system basis,
and as few as possible copies of any-
thing are kept, so single copies could
be dynamically updated as necessary.
manage interface changes. Even in
the Multics days, it was easy to forget
to change every single instance of an
interface. Today, with distributed pro-
grams, changing all possible copies of
an interface at once is going to be in-
sanely difficult, if not flat-out impos-
sible.
BBN Technologies was the first com-
pany to perform continuous con-
trolled change when they built the
ARPANET backbone in 1969. They
placed a 1-bit version number in ev-
ery packet. If it changed from 0 to 1,
it meant that the IMP (router) was to
switch to a new version of its software
and set the bit to 1 on every outgoing
packet. This allowed the entire ARPA-
NET to switch easily to new versions
of the software without interrupting
its operation. That was very important
to the pre-TCP Internet, as it was quite
experimental and suffered a consider-
able amount of change.
all of these good things, the most im-
portant of which was the discipline
used with data structures: if an inter-
face took more than one parameter,
placing them in a structure with a ver-
sion number. The caller set the ver-
sion, and the recipient checked it. If it
was completely obsolete, it was flatly
rejected. If it was not quite current,
it was processed differently, by be-
ing upgraded on input and probably
downgraded on return.
versions of a program or kernel mod-
ule could exist simultaneously, while
upgrades took place at the user’s con-
venience. It also meant that upgrades
could happen automatically and that
multiple sites, multiple suppliers,
and networks didn’t cause problems.
a U.S.-based warehousing company
(translated to C from Multics PL/1)
is illustrated in the accompanying
box. The company bought a Canadian
competitor and needed to add inter-
country transfers, initially from three
of its warehouses in border cities.
This, in turn, required the state field
to split into two parts:
char state _ province[4];
mented the version number from 1.0
to 2.0 and arranged for the server to
support both types. New clients used
version 2.0 structures and were able
to ship to Canada. Old ones continued
to use version 1.0 structures. When
the server received a type 1 structure,
it used an “updater” subroutine that
copied the data into a type 2 structure
and set the country code to U.S.
would add a new subclass with a con-
structor that supports a country code,
and update your new clients to use it.
The process is this:
Change the clients that run in 2.
Now they can move items from U.S. to
Canadian warehouses.
Canadian locations needing to move
stock.
ents at their leisure.
a need to stop the whole system, only
the individual copies, and that can be
maintenance is
not like hardware
maintenance,
which is the
return of the item
to its original
state. software
maintenance
involves moving
an item away from
its original state.
nience. The change can be immedi-
ate, or can wait for a suitable time.
curred, we simultaneously add a check
to produce a server error message for
anyone who accidentally uses an ou-
dated U.S.-only version of the client.
This check is a bit like the “can’t hap-
pen” case in an else-if: it’s done to
identify impossibly out-of-date calls.
It fails conspicuously, and the system
administrators can then hunt down
and replace the ancient version of the
program. This also discourages the
unwise from permanently deferring
fixes to their programs, much like the
coarse version numbers on entire pro-
grams in present practice.
This kind of fine-grain versioning is
sometimes seen in more recent pro-
grams. Linkers are an example, as
they read files containing numbered
records, each of which identifies a
particular kind of code or data. For ex-
ample, a record number 7 might con-
tain the information needed to link
a subroutine call, containing items
such as the name of the function to
call and a space for an address. If the
linker uses record types 1 through 34,
and later needs to extend 7 for a new
compiler, then create a type 35, use it
for the new compiler, and schedule
changes from type 7 to type 35 in all
the other compilers, typically by an-
nouncing the date on which type 7 re-
cords would no longer be accepted.
protocols such as IBM SMB (Server
Message Block), used for Windows
networking. It has both protocol ver-
sions and packet types that can be
used exactly the same way as the re-
cord types of a linker.
controlled maintenance by creat-
ing new versions as subclasses of the
same parent. This is a slightly odd use
of a subclass, as the variations you
create aren’t necessarily meant to per-
sist, but you can go back and clean out
unneeded variants later, after they’re
no longer in use.
ent can be downloaded every time the
program is run, thus allowing change
without versioning. A larger client
ing scheme, enough to allow it to be
downloaded whenever it was out of
date.
uous maintenance exists in relational
databases: one can always add col-
umns to a relation, and there is a well-
known value called null that stands
for “no data.” If the programs that
use the database understand that any
calculation with a null yields a null,
then a new column can be added, pro-
grams changed to use it over some
period of time, and the old column(s)
filled with nulls. Once all the users of
the old column are no more, as indi-
cated by the column being null for
some time, then the old column can
be dropped.
markup language such as SGML or
XML, which can add or subtract attri-
butes of a type at will. If you’re careful
to change the attribute name when
the type changes, and if your XML
processor understands that adding 3
to a null value is still null, you’ve an
easy way to transfer and store mutat-
ing data.
During the last boom, (author) Col-
lier-Brown’s team needed to create
a single front end to multiple back
ends, under the usual insane time
pressures. The front end passed a few
parameters and a C structure to the
back ends, and the structure repeat-
edly needed to be changed for one or
another of the back ends as they were
developed.
the same machine, the team couldn’t
change them simultaneously because
they would have been forced to stop
everything they were doing and ap-
ply a structure change. Therefore, the
team started using version numbers.
If a back end needed version 2.6 of the
structure, it told the front end, which
handed it the new one. If it could use
only version 2.5, that’s what it asked
for. The team never had a “flag day”
when all work stopped to apply an
interface change. They could make
those changes when they could sched-
ule them.
make the changes eventually, and
that, but they were able to make the
changes when it wouldn’t destroy our
schedule. In an early precursor to test-
directed design, they had a regression
test that checked whether all the ver-
sion numbers were up to date and
warned them if updates were needed.
flag day, they gained the few hours ex-
pended preparing for change. By the
12th time, they were winning big.
importantly, investing time to pre-
pare for it can save you and your man-
agement time in the most frantic of
projects.
on queue.acm.org
Kode Vicious
http://queue.acm.org/detail.cfm?id=1594861
John Mashey
http://queue.acm.org/detail.cfm?id=1165766
http://queue.acm.org/detail.cfm?id=1165764
in development, quality assurance, and process. one
of his focal areas is how to create correct, reliable,
functional software in effective and efficient ways in many
programming languages. Most of his work has been with
life-, safety-, and security-critical applications from his
home base in the twin Cities of Minnesota.
programmer, formerly with Sun Microsystems, who
mostly does performance and capacity work from his
home in toronto.