ACM Home Page
Please provide us with feedback. Feedback
Theoretical and empirical studies on using program mutation to test the functional correctness of programs
Full text PdfPdf (1.39 MB)
Source Annual Symposium on Principles of Programming Languages archive
Proceedings of the 7th ACM SIGPLAN-SIGACT symposium on Principles of programming languages table of contents
Las Vegas, Nevada
Pages: 220 - 233  
Year of Publication: 1980
ISBN:0-89791-011-7
Authors
Sponsors
SIGPLAN: ACM Special Interest Group on Programming Languages
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 55,   Citation Count: 19
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/567446.567468
What is a DOI?

ABSTRACT

In testing for program correctness, the standard approaches [11,13,21,22,23,24,34] have centered on finding data D, a finite subset of all possible inputs to program P, such that

1) if for all x in D, P(x) = f(x), then P* = f

where f is a partial recursive function that specifies the intended behavior of the program and P* is the function actually computed by program P. A major stumbling block in such formalizations has been that the conclusion of (1) is so strong that, except for trivial classes of programs, (1) is bound to be formally undecidable [23].

There is an undeniable tendency among practitioners to consider program testing an ad hoc human technique: one creates test data that intuitively seems to capture some aspect of the program, observes the program in execution on it, and then draws conclusions on the program's correctness based on the observations. To augment this undisciplined strategy, techniques have been proposed that yield quantitative information on the degree to which a program has been tested. (See Goodenough [14] for a recent survey.) Thus the tester is given an inductive basis for confidence that (1) holds for the particular application. Paralleling the undecidability of deductive testing methods, the inductive methods all have had trivial examples of failure [14,18,22,23].

These deductive and inductive approaches have had a common theme: all have aimed at the strong conclusion of (1). Program mutation [1,7,9,27], on the other hand, is a testing technique that aims at drawing a weaker, yet quite realistic, conclusion of the following nature:

(2) if for all x in D, P(x) = f(x), then P* = f OR P is "pathological."

To paraphrase,

3) if P is not pathological and P(x) = f(x) for all x in D then P* = f.

Below we will make precise what is meant by "P is pathological"; for now it suffices to say that P not pathological means that P was written by a competent programmer who had a good understanding of the task to be performed. Therefore if P does not realize f it is "close" to doing so. This underlying hypothesis of program mutation has become known as the competent programmer hypothesis: either P* = f or some program Q "close" to P has the property Q* = f.

To be more specific, program mutation is a testing method that proposes the following version of correctness testing:

Given that P was written by a competent programmer, find test data D for which P(D) = f(D) implies P* = f.

Our method of developing D, assuming either P or some program close to P is correct, is by eliminating the alternatives. Let &phis; be the set of programs close to P. We restate the method as follows:

Find test data D such that:

i) for all x in D P(x) = f(x) and

ii) for all Q in &phis; either Q* = P* or for some x in D, Q(x) ≠ P(x).

If test data D can be developed having properties (i) and (ii), then we say that D differentiates P from &phis;, alternatively P passes the &phis; mutant test.

The goal of this paper is to study, from both theoretical and experimental viewpoints, two basic questions:

Question 1: If P is written by a competent programmer and if P passes the &phis; mutant test with test data D, does P* = f?

Note that, after formally defining &phis; for P in a fixed programming language L, an affirmative answer to question 1 reduces to showing that the competent programmer hypothesis holds for this L and &phis;.

We have observed that under many natural definitions of &phis; there is often a strong coupling between members of &phis; and a small subset µ. That is, often one can reduce the problem of finding test data that differentiates P from &phis; to that of finding test data that differentiates P from µ. We will call this subset µ the mutants of P and the second question we will study involves the so-called coupling effect [9]:

Question 2 (Coupling Effect): If P passes the µ mutant test with data D, does P pass the &phis; mutant test with data D?

Intuitively, one can think of µ as representing the programs that are "very close" to P.

In the next section we will present two types of theoretical results concerning the two questions above: general results expressed in terms of properties of the language class L, and specific results for a class of decision table programs and for a subset of LISP. Portions of the work on decision tables and LISP have appeared elsewhere [5,6], but the presentations given here are both simpler and more unified. In the final section we present a system for applying program mutation to FORTRAN and we introduce a new type of software experiment, called a "beat the system" experiment, for evaluating how well our system approximates an affirmative response to the program mutation questions.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. T. Acree, R. A. DeMillo, T. A. Budd, R. J. Lipton, and F. G. Sayward. "Mutation analysis." Technical Report GIT-ICS-79/08, Georgia Institute of Technology, 1979.
2
3
4
 
5
Timothy A. Budd and Richard J. Lipton. "Mutation analysis of decision table programs." Proceedings of the 1978 Conference on Information Sciences and Systems, pages 346-349. The Johns Hopkins University, 1978.
 
6
Timothy A. Budd and Richard J. Lipton. "Proving LISP programs using test data." Digest for the workshop on Software Testing and Test Documentation, pages 374-403, 1978.
 
7
 
8
H. Y. Chang. Fault Diagnosis of Digital Systems. Wiley-Interscience, 1970.
 
9
Richard A. DeMillo, Richard J. Lipton, and Frederick G. Sayward. "Hints on test data selection: Help for the practicing programmer." Computer 11(4):34-43, April 1978.
 
10
K. Foster. "Error sensitive test cases." Digest for the Workshop on Software Testing and Test Documentation, pages 206-225, 1978.
11
 
12
Susan L. Gerhart and Lawrence Yelowitz. "Observations of fallibility in applications of modern programming methodologies." IEEE Transactions on Software Engineering SE-2(3) : 195-207, September 1976.
 
13
John B. Goodenough and S. L. Gerhart. "Towards a theory of test data selection." IEEE Transactions on Software Engineering SE-1 (2):156-173, June 1975.
 
14
J. Goodenough. "A survey of program testing issues." In P. Wegner, editor, Research Directions in Software Technology, pages 316-340. MIT Press, 1979.
 
15
John D. Gould and Paul Drongowski. "An exploratory study of computer program debugging." Human Factors 16(3):258-277, May 1974.
 
16
Richard Hamlet. "Critique of reliability theory. Digest for the Workshop on Software Testing and Test Documentation, pages 57-69, 1978.
 
17
Steven Hardy. "Synthesis of LISP programs from examples." Proceedings of the Fourth International Joint Conference on Artificial Intelligence, pages 240-245. Held in Tbilisi, Georgia, USSR, 1975.
 
18
P. Henderson and R. Snowden. "An experiment in structured programming." BIT 12:38-53, 1972.
19
 
20
 
21
William E. Howden. "Methodology for the generation of program test data." IEEE Transactions on Computers C-24(5):554-560, May 1975.
 
22
William E. Howden. "An evaluation of the effectiveness of symbolic testing." Software: Practice and Experience 8:381-397, 1978.
 
23
William E. Howden. "Reliability of the path analysis testing strategy." IEEE Transactions on Software Engineering SE-2(3):208-214, September 1976.
24
 
25
International Business Machines. System/360 Scientific Subroutine Package. IBM Application Program H20-0205-3, 1966.
 
26
H. Ledgart. "The case for structured programming." BIT 13:45-57, 1973.
 
27
R. J. Lipton and F. G. Sayward. "The status of research on program mutation." Digest for the Workshop on Software Testing and Test Documentation, pages 355-378, 1978.
 
28
M. Montalbano. Decision Tables. Science Research Associates, 1974.
 
29
P. Naur. "Programming by action clusters." BIT 9:250-258, 1969.
 
30
L. J. Osterweil and L. D. Fosdick. "Experiences with DAVE -- A FORTRAN program analyzer." Proceedings of the 1978 AFIP National Computer Conference, pages 909-915, 1978.
 
31
S. L. Pollack, H. T. Hicks, and W. J. Harrison. Decision Tables: Theory and Practice. John Wiley and Sons, 1971.
 
32
D. E. Shaw, W. K. Swartout, and C. C. Green. "Inferring LISP programs from examples." Proceedings of the Fourth International Joint Conference on Artificial Intelligence, pages 260-267. Held in Tbilisi, Georgia, USSR, 1975.
 
33
Philip Dale Summers. Program Construction from Examples. PhD thesis, Yale University, 1975.
 
34
L. J. White, E. I. Cohen, and B. Chandrasekaran. A Domain Strategy for Computer Program Testing. Technical Report OSU-CISRC-TR-78-4, Ohio State University, 1978.
35

CITED BY  19
Collaborative Colleagues:
Timothy A. Budd: colleagues
Richard A. DeMillo: colleagues
Richard J. Lipton: colleagues
Frederick G. Sayward: colleagues