|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ABSTRACT
In testing for program correctness, the standard approaches [11,13,21,22,23,24,34] have centered on finding data D, a finite subset of all possible inputs to program P, such that 1) if for all x in D, P(x) = f(x), then P* = f where f is a partial recursive function that specifies the intended behavior of the program and P* is the function actually computed by program P. A major stumbling block in such formalizations has been that the conclusion of (1) is so strong that, except for trivial classes of programs, (1) is bound to be formally undecidable [23]. There is an undeniable tendency among practitioners to consider program testing an ad hoc human technique: one creates test data that intuitively seems to capture some aspect of the program, observes the program in execution on it, and then draws conclusions on the program's correctness based on the observations. To augment this undisciplined strategy, techniques have been proposed that yield quantitative information on the degree to which a program has been tested. (See Goodenough [14] for a recent survey.) Thus the tester is given an inductive basis for confidence that (1) holds for the particular application. Paralleling the undecidability of deductive testing methods, the inductive methods all have had trivial examples of failure [14,18,22,23]. These deductive and inductive approaches have had a common theme: all have aimed at the strong conclusion of (1). Program mutation [1,7,9,27], on the other hand, is a testing technique that aims at drawing a weaker, yet quite realistic, conclusion of the following nature: (2) if for all x in D, P(x) = f(x), then P* = f OR P is "pathological." To paraphrase, 3) if P is not pathological and P(x) = f(x) for all x in D then P* = f. Below we will make precise what is meant by "P is pathological"; for now it suffices to say that P not pathological means that P was written by a competent programmer who had a good understanding of the task to be performed. Therefore if P does not realize f it is "close" to doing so. This underlying hypothesis of program mutation has become known as the competent programmer hypothesis: either P* = f or some program Q "close" to P has the property Q* = f. To be more specific, program mutation is a testing method that proposes the following version of correctness testing: Given that P was written by a competent programmer, find test data D for which P(D) = f(D) implies P* = f. Our method of developing D, assuming either P or some program close to P is correct, is by eliminating the alternatives. Let &phis; be the set of programs close to P. We restate the method as follows: Find test data D such that: i) for all x in D P(x) = f(x) and ii) for all Q in &phis; either Q* = P* or for some x in D, Q(x) ≠ P(x). If test data D can be developed having properties (i) and (ii), then we say that D differentiates P from &phis;, alternatively P passes the &phis; mutant test. The goal of this paper is to study, from both theoretical and experimental viewpoints, two basic questions: Question 1: If P is written by a competent programmer and if P passes the &phis; mutant test with test data D, does P* = f? Note that, after formally defining &phis; for P in a fixed programming language L, an affirmative answer to question 1 reduces to showing that the competent programmer hypothesis holds for this L and &phis;. We have observed that under many natural definitions of &phis; there is often a strong coupling between members of &phis; and a small subset µ. That is, often one can reduce the problem of finding test data that differentiates P from &phis; to that of finding test data that differentiates P from µ. We will call this subset µ the mutants of P and the second question we will study involves the so-called coupling effect [9]: Question 2 (Coupling Effect): If P passes the µ mutant test with data D, does P pass the &phis; mutant test with data D? Intuitively, one can think of µ as representing the programs that are "very close" to P. In the next section we will present two types of theoretical results concerning the two questions above: general results expressed in terms of properties of the language class L, and specific results for a class of decision table programs and for a subset of LISP. Portions of the work on decision tables and LISP have appeared elsewhere [5,6], but the presentations given here are both simpler and more unified. In the final section we present a system for applying program mutation to FORTRAN and we introduce a new type of software experiment, called a "beat the system" experiment, for evaluating how well our system approximates an affirmative response to the program mutation questions. REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
Collaborative Colleagues:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||