TESTING
SOFTWARE QA
RESOURCES
Notes taken from Glenford Myer's The Art of Software Testing.

One of the most difficult questions to answer when testing a program is determining when to stop, since there is no way of knowing if the error just detected is the last remaining error. In fact, in anything but a small program, it is unreasonable to expect that all errors will eventually be detected. Given this dilemma, and given the fact that economics dictate that testing must eventually terminate, one might wonder if the question has to be answered in a purely arbitrary way, or if there are some useful stopping criteria.

The completion criteria typically used in practice are both meaningless and counterproductive. The two most common criteria are

Stop when the scheduled time for testing expires.

Stop when all the test cases execute without detecting errors (i.e., when the test cases are unsuccessful).

The first criterion is useless because it can be satisfied by doing absolutely nothing (i.e., it does not measure the quality of the testing). The second criterion is equally useless because it also is independent of the quality of the test cases. Furthermore, it is counterproductive because it subconsciously encourages one to write test cases that have a low probability of detecting errors. As discussed in Chapter 2, human beings are highly goal-oriented; if one is told that he has finished a task when his test cases are unsuccessful, he will subconsciously write test cases that lead him to this goal, avoiding the useful, high-yield, destructive test cases.

There are three categories of more-useful criteria. The first category, but not the best, is to base completion on the use of specific test-case-design methodologies. For instance, one might define the completion of module testing as the following:

The test cases are derived from (1) satisfying the multicondition-coverage criterion and (2) a boundary-value analysis of the module interface specification, and all resultant test cases are eventually unsuccessful.
One might define the function test as being complete when the following conditions are satisfied:
The test cases are derived from (1) cause-effect graphing, (2) boundary-value analysis, and (3) error guessing, and all resultant test cases are eventually unsuccessful.
Although this type of criterion is superior to the two mentioned earlier, it has three problems. First, it is not helpful in a test phase in which specific methodologies are not available, such as the system-test phase. Second, it is a subjective measurement, since there is no way to guarantee that a person has used a particular methodology (e.g., boundary-value analysis) properly and rigorously. Third, rather than giving one a goal and letting him choose the most appropriate way of achieving it, it does the opposite; the test-case-design methodologies are dictated, but no goal is given. Hence this type of criterion is useful sometimes for some testing phases, but it should only be applied when the tester has proven his or her abilities in the past in applying the test-case-design methodologies successfully.

The second category of criteria, perhaps the most valuable one, is to state the completion requirements in positive terms. Since the goal of testing is to find errors, why not make the completion criterion be the detection of some predefined number of errors? For instance, one might state that a module test of a particular module is not complete until 3 errors are discovered. Perhaps the completion criterion for a system test should be defined as the detection and repair of 70 errors or an elapsed time of 3 months, whichever comes later.

Notice that this type of criterion reinforces the definition of testing. It does have two problems, both of which are surmountable. One problem is determing how to obtain the number of errors to be detected. Obtaining this number requires

  1. An estimate of the total number of errors in the program.
  2. An estimate of what percentage of these errors can be feasibly found through testing.
  3. Estimates of what fraction of the errors originated in particular design processes, and during what testing phases these errors are likely to be detected.

A rough estimate of the total number of errors can be obtained in several ways. One method is obtaining them through experience with previous programs. Also, a variety of predictive models exist (e.g., reference 1, chapter 18). Some of these models require one to test the program for some period of time, record the elapsed times between the detection of successive errors, and insert these times into parameters in a formula. Other models involve the seeding of known, but unpublicized, errors into the program, testing the program for a while, and then examining the ratio of detected seeded errors to detected unseeded errors. Another model employs two independent test teams who test for a while, examine the errors found by each and the errors detected in common by both teams, and use these parameters to estimate the total number of errors. Another gross method to obtain this estimate is to use industry-wide averages. For instance, the number of errors that exist in typical programs at the time that coding is completed (before a code walkthrough or inspection is employed) is approximately 4-8 errors per 100 program statements.

Estimate 2 (the percentage of errors that can be feasibly found through testing) involves a somewhat arbitrary guess, taking into consideration the nature of the program and the consequences of undetected errors.

Given the current paucity of information about how and when errors are made, estimate 3 is the most difficult. The data that exist indicate that, in large programs, approximately 40% of the errors are coding and logic-design mistakes, and the remainder are generated in the earlier design processes.

Although the reader, to use this criterion, must develop his or her own estimates that are pertinent to the program at hand, a simple example is presented here. Assume we are about to begin testing a 10,000-statement program, the number of errors remaining after code inspections are performed is estimated at five per 100 statements, and we establish, as an objective, the detection of 98% of the coding and logic-design errors and 95% of the design errors. The total number of errors is thus estimated as 500. Of the 500 errors, we assume that 200 are coding and logic-design errors, and 300 are design flaws. Hence the goal is to find 196 coding and logic-design errors and 285 design errors. A plausible estimate of when the errors are likely to be detected is shown in Table 1.

Table 1: Hypothetical Estimate of When the Errors Might Be Found
 

Coding and logic-design errors

Design errors

Module test

65%

0%

Function test

30%

60%

System test

3%

35%

Total

98%

95%

If we have scheduled four months for function testing and three months for system testing, the completion criteria might be established as

  1. Module testing is complete when 130 errors are found and corrected (65% of the estimated 200 coding and logic-design errors).
  2. Function testing is complete when 240 errors (30% of 200 plus 60% of 300) are found and corrected, or when four months of function testing have expired, whichever occurs later. (The reason for the second clause is that if we find 240 errors quickly, this is probably an indication that we have underestimated the total number of errors and thus should not stop function testing early.)
  3. Item testing is complete when 111 errors are found and corrected, or when three months of system testing have expired, whichever occurs later.

The other obvious problem with this type of criterion is one of overestimation. What if, in the above example, there are less than 240 errors remaining when function test starts? Based on the criterion, we could never complete the function-test phase. This is a strange problem if one thinks about it. Our "problem" is that we do not have enough errors; the program is too good. One could label it a nonproblem because it is the kind of problem that a lot of people would love to have. If it does occur, a bit of common sense can solve it. If we cannot find 240 errors in four months, the project manager can employ an outsider to analyze the test cases to judge whether the problem is (1) inadequate test cases or (2) excellent test cases but a lack of errors to detect.

The third type of completion criterion is an easy one on the surface, but it involves a lot of judgment and intuition. It requires one to plot the number of errors found per unit time during the test phase. By examining the shape of the curve, one can often determine whether to continue the test phase or end it and begin the next test phase.

Suppose a program is being function tested and the number of errors found per week is being plotted. If, in the seventh week, the curve is the left one of Figure 1, it would be imprudent to stop the function test, even if we had reached our criterion for the number of errors to be found. Since, in the seventh week, we still seem to be in high gear (finding many errors), the wisest decision (remembering that our goal is to find errors) is to continue function testing, designing additional test cases if necessary.


Figure 1: Estimating completion by plotting errors detected per unit time.

On the other hand, suppose the curve is the right one in Figure 1. The error-detection efficiency has dropped significantly, implying that we have perhaps picked the function-test bone clean and that perhaps the best move is to terminate function testing and begin a new type of testing (e.g., system test). (Of course, we must also consider other factors such as whether the drop in error-detection efficiency was due to a lack of computer time or exhaustion of the available test cases.)

Figure 2 is an illustration of what happens when one fails to plot the number of errors being detected. The graph represents three testing phases of an extremely large software system; it was drawn as part of a post-mortem study of the project. An obvious conclusion is that the project should not have switched to a different testing phase after period 6. During period 6, the error-detection rate was good (to a tester, the higher the rate, the better), but switching to a second phase at this point caused the error-detection rate to drop significantly.


Figure 2: Post-mortem study of the testing processes of a large project.

The best completion criterion is probably a combination of the three types discussed above. For the module test, particularly because most projects do not formally track detected errors during this phase, the best completion criterion is probably the first; one should request that a particular set of test-case-design methodologies be used. For the function- and system-test phases, the completion rule might be to stop when a predefined number of errors are detected, or when the scheduled time has elapsed, whichever comes later, but provided that an analysis of the errors-versus-time graph indicates that the test has become unproductive.