1. Introduction A test data adequacy criterion is a set of rules that prescribe some property for test-sets. Two broad categories of test criteria are data-flow and control-flow test criteria. Data-flow criteria prescribe rules for test-sets that are based on the data definitions and its usage in the program. As the objective of this paper is to explore control-flow testing criteria, we shall expand only on this aspect. 1 Acknowledgement: This research has benefited from participation in and discussions with colleagues on the UK EPSRC FORTEST Network on formal methods and testing (http://www.fortest.org.uk). Financial support for travel is gratefully acknowledged. (EPSRC Grant no. GR/R43150/01)

Control-flow test criteria [15, 21, 33] are meant to challenge the logical decisions made by the program with testcases based on the structure and logic of the design and source code. The decisions often consist of one or more atomic conditions connected by logical operators. The control-flow test criteria check these Boolean decisions of the program based on the specification and ensure that program statements and decisions are fully exercised by code execution. The effectiveness of a test criterion [4, 5, 6, 7, 8, 10, 28, 34] is the probability that a test-set selected randomly from the set of all test-sets satisfying the criterion will expose an error. Consider the set of all the test-sets that satisfy a given criterion, C, for a given program and specification. It may be the case that some of these test-sets expose an error, while others do not. If a large percentage of the C-adequate test-sets expose an error, then C is an effective criterion for the given program [5]. The approach used in the earlier empirical studies [18, 19, 27, 30] have only considered the average effectiveness based on random selection of test-sets. However, there is often a large variation in the fault detection effectiveness of different test-sets with equal probabilities for high and low fault detection. Because of this, it is not sufficient to consider only the average effectiveness but it is also important to consider the variation in effectiveness of all the test-sets. The example in the following section illustrates such a situation.

1.1. An Example Consider for example the following decision, D, C ∧ A ∨ ¬E ∨ ¬C ∨ B ∧ (A ∨ D) where A, B, C, D and E are atomic Boolean conditions. Let us assume that the implementation of the above decision includes a mistake, where one of the occurrences of an atomic condition is replaced by its negation; i.e. X is substituted by ¬X or vice versa. There can be seven such faulty

decisions since there are seven conditions in the decision D. The well-known industrially used criterion, Decision Coverage (DC), requires that the test-set must exercise both true and false branches. Figure 1 shows the effectiveness of all possible DC test-sets. The total number of DC test-sets are 87 (= 29 × 3) as there are 29 true entries out of 32 (= 2 5 ) in the truth table of decision D. Figure1: Distribution of effectiveness for example decision D 35

no. of test-sets

30 25 20

DC

15 10 5 0 0

1

2

3

4

5

6

7

no. of faults

It is evident from the above figure that the fault detection effectiveness of a randomly selected DC test-set for the decision D could vary from 45% (≈ 3/7) to 71% (≈ 5/7) as all the test-sets that identify the faults in this range have equal probability to be selected (≈ 0.3). It can be further noticed that the maximum effectiveness could be as high as 85%. However, assuming that all DC test-sets have equal probability of being selected (i.e., the test-set selection algorithm is unbiased), the probability of choosing test-sets with maximum effectiveness is very low (0.03).

1.2. Research Objectives The research issue addressed by this empirical study is finding how good a randomly selected test-set is in exposing errors according to a given test criterion. The factor that makes it difficult to answer this question is that for a given program and test adequacy criterion, there is typically a very large number of adequate test-sets [5]. If the program is incorrect, then usually some of these test-sets expose an error while others do not. The earlier empirical studies [18, 19, 27, 30] of this nature had the following limitations: (a) they used a relatively small number of Boolean decisions, and (b) test-sets were selected at random from the large sample space. We attempt to solve the first problem by generating a large number of Boolean decisions with a varying number of conditions (3–12). The second problem has been addressed by choosing all the test-sets instead of selecting only some. Of course, this requires the use of abstraction. The abstractions made are justified for the following reasons: (a) the scope of study is limited to control-flow testing

criteria which are intrinsically designed to test the decisions made by the program at the control points, and (b) if there is more than one fault either at the same control point or on other control point there is still very high likelihood of finding all the faults. This is because research into the ‘fault coupling effect’ as demonstrated in [16] supports the hypothesis that complex faults are coupled to simple faults in such a way that a test data set that detects all simple faults in a program will detect a high percentage of the complex faults. We considered Boolean specifications as the subject for our experiments. Faults were introduced one at a time to generate a faulty specification representing the implementation. The main objectives of the study can be summarized as follows: • To study and compare industrially used control-flow test criteria like, DC, Full Predicate Coverage (FPC) and Modified Condition/Decision Coverage (MC/DC). For example, aviation industry standard have a mandatory requirement to use MC/DC test-sets [23] and also most of the testing tools use DC as a testing criterion. • To study the variation in effectiveness by exhaustively generating all the test-sets satisfying a criterion. Earlier work concentrated on choosing some test-sets according to a given criterion which introduces a certain degree of inconsistency in various results [9]. • To establish a connection between fault detection effectiveness and control-flow testing criteria. Since the considered criteria are widely used in industry and there are tools [25] which automatically generate the test data for given a test criterion, it can be guaranteed that certain types of faults are not present in the implementation. The rest of the paper is organized as follows. The next section introduces the theoretical background. Section 3 gives details about related work. Experimental design and analysis of results are presented in sections 4 and 5 respectively. Section 6 presents the conclusions of the study.

2. Theoretical Underpinning Let P be a program and S be the specification that P is intended to satisfy. A program, P is said to be correct on input x if P (x) = S(x), where P (x) denotes the value computed by P on input x, and S(x) denotes the intended output for x. Otherwise, P is said to fail, with x as the failure-causing input. A failure is defined to be any deviation between the actual output and the specified output for a given input. A fault in a program is a cause which results in a failure.

A test-case is an set of inputs and a starting state. A testset is a set of test-cases. If a test-case is a failure-causing input for a program, then the program is said to fail on that test-case and such a test-case is said to expose a fault. A test-set, T , is said to detect a fault if it includes at least one failure-causing test-case. A test criterion could be viewed either as a generator or as an acceptor [35]. Both generators and acceptors are mathematically equivalent. A criterion when used as a generator can be considered as a function C : P × S → 2 T . Here, given a program and specification, the test criterion is used to generate a set of test-sets that satisfies the criterion. In other words, the criterion is used as rule to select the test-cases to generate a test-set. The control-flow testing criteria are based on the statements and Boolean decision points in the program. The Boolean decisions consist of conditions connected by logical operators. Conditions are atomic predicates which could either be a Boolean variable or a simple relational expression. The following notation is used throughout the paper, ∧, ∨ and ¬ represent the “and”, “or” and “not” Boolean operations respectively. We use T and F to denote “true” and “false”. A Boolean formula can be represented in various formats such as Conjunctive Normal Form (CNF, also known product-of-sums) and Disjunctive Normal Form (DNF, sumof-products). As these normal forms may not be unique, there are also canonical representations which are unique for a given Boolean formula. One such canonical form is prime-DNF (PDNF) which is unique up to commutativity. Consider a given Boolean decision D, and an elemen tary conjunction, K (i.e. K ≡ i Xi , where Xi is either a condition or its negation). K is an implicant of D if K implies D. An implicant is said to be prime implicant of decision D, if there is no other implicant, K of D such that K ∨ K = K . A PDNF form of a given Boolean decision, D, is the sum of all prime implicants of D. A condition in a Boolean decision D is said to be redundant if the condition does not appear in the PDNF representation of the D. For example, B is redundant in the decision D ≡ A ∨ A ∧ B since the PDNF representation of D is A.

2.1. Control-Flow Test Criteria In this section, we review some existing control-flow test criteria [15, 21]. The basic control-flow testing criterion is Statement Coverage (SC) which requires that test-data must execute every statement of the program at least once. It is evident that this is very weak as it ignores the branching structure of the design and implementation. The Decision Coverage (DC) criterion requires that, in addition to SC, every possible branch must be executed at least once. In other words, the Boolean

decisions that appear in the program must take both T and F values. The DC criterion treats a decision as a single node in the program structure, regardless of the complexity of the Boolean expression and dependencies between the conditions constituting the decision. For critical real-time applications where over half of the executable statements may involve Boolean expressions, the complexity of the expression is of concern. Therefore other criteria have been developed to address the concerns due to the complexity of Boolean decisions. The Full Predicate Coverage (FPC) criterion improves on DC by taking into account values of both the decision and conditions. It requires that every statement in the program has been executed at least once, and each condition in a decision has taken all possible outcomes where the value of the decision is directly correlated with the value of a condition [17]. This means that the test-set must include testcases such that the value of the decision is different when the condition changes. Although FPC checks the conditions that appear in a Boolean decision, it ignores the direct dependency and influence of conditions on the decision. The Modified Condition/Decision Coverage (MC/DC) [2, 3, 11, 12, 31], which is a mandatory requirement for testing avionics software [23], improves on this drawback and is defined as follows: Every point of entry and exit in the program has been invoked at least once, every condition in a decision in the program has taken on all possible outcomes at least once, and each condition has been shown to independently affect the decision’s outcome. A condition is shown to independently affect a decision’s outcome by varying just that condition while holding fixed all other possible conditions. The MC/DC criterion requires at least one pair of test-cases for every condition. One of the goals in defining the above criteria is to minimize the number of test-cases in a test-set while keeping effectiveness as high as possible. In order to completely verify a decision with n conditions, 2 n test-cases are required. Therefore, it is computationally expensive or infeasible to test all possible combinations as the number of conditions and decisions grows in the program. In this sense MC/DC, provides a practical solution that only requires a linear number of test-cases in terms of n. As can be observed that DC requires only two test-cases per test-set whereas other techniques may require more than two test-cases per test-set. To ensure that it is the test-set property and not the test-set size that influenced the effectiveness, we included a variant of DC criterion, Decision Coverage/Random (DC/R) that required the test-set to satisfy DC and also that the average test-set size be approxi-

mately equal to the size of test-sets generated using other studied criteria. The extra test-cases added to DC/R were generated at random. This approach of augmenting the testsets has been analysed in [1]. In Table 1, we give a summary of minimum and maximum test-cases in a test-set that can satisfy a given criterion. The table also shows if the criterion takes into account the decision and conditions. Criterion

Min. no. Max. no. Decision Condition Test-cases Test-cases outcome outcome √ DC 2 2 × √ DC/R n+1 2n × √ √‡ FPC 2 2n √ √ MC/DC n+1 2n ‡ Correlated, Correlated and Independent

Table 1. Summary of test criteria

The criteria described above have a hierarchy among them. For example, any test-set that satisfies MC/DC coverage will also satisfy DC coverage. Formally, a criterion C 1 subsumes C2 if, for every program P and specification S, every test-set that satisfies C1 also satisfies C2 . However, it is not guaranteed if C 1 subsumes C2 , then C1 is more effective than C2 for every program [6].

2.2. Types of Faults One way to study the fault detection effectiveness is to define a set of faults that occur in real life situations and use them to generate the subject implementations from the specification. In the past various types of faults were defined and used to study effectiveness of test criteria; for example, see [14, 20, 24, 27, 30]. An analysis of various fault classes and the hierarchy among them is presented in [13, 26]. We have considered the following four faults in our empirical study: • Associative Shift Fault (ASF): A Boolean expression is replaced by one in which the association between conditions is incorrect; for example, X ∧ (Y ∨ Z) could be replaced by X ∧ Y ∨ Z. As the ‘∧’ operator has higher precedence over ‘∨’ operator, the second expression is equivalent to (X ∧ Y ) ∨ Z. • Expression Negation Fault (ENF): A Boolean expression is replaced by its negation, for example, X ∧ (Y ∨ Z) by X ∧ ¬(Y ∨ Z). • Operator Reference Fault (ORF): A Boolean operator is replaced by another, for example, X ∧ Y by X ∨ Y . • Variable Negation Fault (VNF): A Boolean condition X is replaced by another variable ¬X. VNF only considers the atomic condition and is different from ENF,

which changes the logical sub-expression excluding the conditions. During the analysis only one fault is introduced at a time. In a real life scenario, more than one fault can be present at the same or other control points. However, research into the ‘fault coupling effect’ by Offutt [16] supports the hypothesis that complex faults are coupled to simple faults in such a way that a test data set that detects all simple faults in a program will detect a high percentage of the complex faults.

3. Related Work Experimental comparisons of effectiveness of data-flow and control-flow testing criteria are considered in [5, 4, 10]. The subjects of these experiments were programs. Empirical evaluation of testing techniques for Boolean specification was studied in [18, 19, 27, 30]. Harman et. al., in [9], have proposed a model to conduct empirical studies in software testing. The model suggests standardizing the process to avoid the conflicts in results of different empirical studies. In [4], the space of test-sets was sampled and appropriate statistical techniques were used to interpret the result. The selection of test-sets was done after creating a universe of possible test-cases and then test-sets of a given size were randomly selected from that universe. The distribution of effectiveness with respect to coverage was given. Various types of formal relationships between the testing criteria have also been studied as in [34]. For example, the ‘subsumes’ relationship is based on the property of test criteria. However, subsumption of one criterion by other criterion does not mean that it is more effective in detecting fault than the other. Frankl and Weyuker, in [6], proved that the subsume relation between software test adequacy criteria does not guarantee better fault detecting ability in the prior testing scenario. Kuhn [13] has explored the relationship between various fault types. It is shown that ENF are the weakest faults in the sense that any test technique that catches stronger faults is very likely to find ENFs. The results have been further improved by Tsuchiya and Kikuno [26]. However, the result is applicable only for associated faulty decisions and not in general to all faulty decisions; therefore nothing can be said about overall effectiveness and variation in effectiveness. Effectiveness measures, as defined in [6], are in terms of a program and specification. As has been pointed in [29], it is difficult to say what is a typical or representative program and specification. Further similar difficulties arise when classifying faults into different categories based on their severity. Therefore it is not possible to use this measure for effectiveness in general. To overcome the difficulty of assessing test criteria based on the severity of faults, a framework is defined in [8].

Another research question of relevance is whether minimisation of test-set size has any impact on fault detection effectiveness [32]. It was found that the minimisation of test-sets had minimal effect. On the other hand, another empirical study [22], presents a contradictory result. The aim of our experiments automatically includes this topic as we consider all possible test-sets and if the minimisation reduces the effectiveness that should be reflected in the overall result.

4. Experimental Design The three main control-flow test criteria that we studied were DC, FPC, and MCDC. As the DC test-sets (c.f. Table 1) requires only two test-cases per test-set, we also analysed DC/R, a variant of DC. It is incorrect to compare test criterion, C1 that gives large test-sets with another criterion, C2 that selects small test-sets. As it is impossible to tell whether any reported benefit of C 1 is due to its inherent properties, or due to the fact that the test-sets considered are larger, and hence more likely to detect faults. Because the test criteria of our study are based on decisions that appear in the program, we restricted our empirical evaluation to Boolean specifications and their implementations. At first sight, this may appear to be oversimplification; however, it can be noted that for a given Boolean n specification with n conditions, only one of 2 2 possible realisations is correct. Further complexity is added by the exponentially large number of test-sets satisfying a given criterion for a given specification. Let us assume that the size of test-set is 2n; n then the possible candidate space of test-sets has size 22n . On the other hand, working with Boolean specification makes it possible to generate large number of random decisions with varying sizes and properties. Further, although large, it is still possible to generate all test-sets for the decisions that usually occur in real programs. Chilenski and Miller, in [2], investigated the complexity of expressions in different software domains. It was found that complex Boolean expressions are more common in the avionics domain with the number of conditions in a Boolean decision as high as 30. Our experimental analysis was done using software that was specifically written for this purpose. The prototype version of software was written in the functional programming language, Haskell. As we had to deal with an exponentially large number of test-sets, the prototype Haskell version failed to handle complex decisions. Therefore, the next version of the software was rewritten in Java. The software allows the analysis of a given or randomly generated Boolean decision. The experimental steps involved in the empirical analysis of above mentioned test criteria were as follows:

1. Select subject Boolean decisions. 2. Generate all faulty decisions. 3. Generate all test-sets for a criterion. 4. Compute effectiveness of each test-set. 5. Compute variation in effectiveness of all test-sets. Steps 1 and 2 are elaborated in the following section. Section 4.2 presents step 3 and section 4.3 gives the evaluation metrics used in steps 4 and 5.

4.1. Subject Boolean Specification and Implementation The subject Boolean decisions were taken from a research paper [30] and also generated using a random generator. The Boolean specification used in [30] were taken from an aircraft collision avoidance system. We also analysed 500 decisions that were generated randomly. The number of conditions varied from three to twelve in generated decisions. The number of conditions is required as an input for the random Boolean decision generator. The seed was fixed for the random generator before starting the experiments. This was done to ensure that the experiments can be repeated, restarted and re-analysed by a third party [9]. The Boolean decisions generated are guaranteed to have only irredundant conditions. A condition can occur more than once in a decision. The generation of decisions with non-redundant conditions guarantees that faulty implementation will not depend on any condition on which the original decision did not depend. Consider, for example, the Boolean decision D ≡ A ∧ (A ∨ B) in which only A is irredundant. An associative shift fault in D would change the decision to DASF ≡ A ∧ A ∨ B. But now DASF is equivalent to A ∨ B in which both A and B are irredundant. We studied the fault detection effectiveness of test criteria for Boolean specifications with respect to four types of faults: ASF, ENF, ORF and VNF. The faulty Boolean decisions (implementation) were generated for each type of fault by introducing one fault at a time. For example, consider the decision, D ≡ A ∨ ¬B ∧ C. Given D as input, the ORF fault operator would generate two Boolean decisions, A ∧ ¬B ∧ C and A ∨ ¬B ∨ C. Only one fault is present in one implementation. As mentioned before, in a real life scenario if more than one fault is present at the same control point or other control point, there is still a high probability of finding it [16].

4.2. Generation of all possible test-sets For a given Boolean decision with n conditions, there are 2n entries in the matching truth table. As the objective

dcrTestSets := ∅ for each ts ∈ ST × SF begin choose random r from range [n + 1, 2n] testSet := ts if (∃ condition c | ts satisfies MC/DC for c) then add ts to mc/dc pair list for c endif for i = 1 to r begin select random test-case t testSet := testSet ∪ t end dcrTestSets := dcrTestSets ∪ {testSet} end As mentioned above, the generation of DC test-sets required consideration of all possible pairs of test-cases for which the decision evaluates to T and F. A further evaluation was done to check if the test-case pair satisfied the MC/DC property for any condition. In that case, the testcase pair was added to the list maintained for every condition appearing in the Boolean decision. The generation of all MC/DC test-sets was done by considering all possible combinations of test-case pairs satisfying the MC/DC property for every condition.

The effectiveness of a test criterion, EF , with respect to a fault is the average of effectiveness for all test-sets: ET (C, D, F, T ) T satisfies C EF (C, D, F ) = no. of T satisfying C Furthermore, the variation in effectiveness is the variation of ET given by the following formula: V ar(C, D, F ) = (EF (C, D, F ) − ET (C, D, F, T ))2 T satisfies C (no. of T satisfying C) − 1

5. Analysis and Results The effectiveness of each test criterion was measured by computing the average of effectiveness for all test-sets. The graphs presented in this section show the percentage of faults plotted against the percentage of test-sets that detected them. This represents the variation in fault detection effectiveness for all possible test-sets satisfying a given criterion. For example, 55% of DCR test-sets were between 90 to 100% effective in detecting expression negation fault (Figure 2). The MC/DC criterion was found to be highly effective for Expression Negation Fault (ENF), Operator Reference Fault (ORF) and Variable Negation Fault (VNF); however a large variation in effectiveness was observed for Associative Shift Fault (ASF). In particular it was observed that around 25% of MC/DC test-sets detected less than 10% of ASF faults. Figure 2: Distribution of effectiveness for expression negation fault 100 90 80 70

% of test-sets

of experiments was to do exhaustive analysis of all possible test-sets, all entries in this truth table were generated. Let the set of test-cases for which a decision evaluates to T and F be ST and SF , respectively. A DC test-set can be generated by choosing one test-case each from S T and SF . All DC test-sets can be generated by taking the cross product of ST and SF . For generating FPC test-sets, DC test-sets were appended with test-cases such that the extended testsets satisfied the FPC property. The methodology adopted to generate FPC test-sets was chosen to work around the computational infeasibility problem. It can be noted that the 2n . With k equal to space of possible test-sets of size n is C2k n and n ≥ 5 it is impossible to consider all combinations (for n = 5, 32 10 = 64, 512, 240). The following algorithmic steps describe the generation of DC/R and MC/DC test-sets:

DC DCR FPC MCDC

60 50 40 30 20 10

4.3. Evaluation Metrics

0 10

20

30

40

50

60

70

80

90

100

% of faults

Let D be a Boolean decision with n conditions and let the number of faulty decisions generated using a given fault, F , for D be k. The effectiveness, ET , of a test-set, T , satisfying the criterion, C, with respect to F and D is given by ET (C, D, F, T ) = m k × 100 where m is the number of faulty decisions identified by T to be different from D.

As described in [13], the expression negation faults are the weakest types of faults. It is expected that ENF detection will be high, as can be observed in Figure 2. It can be observed from the Figures 2, 3 and 4 that MC/DC test-sets were always found to have more than 60% effectiveness. On the other hand, DC, DC/R and FPC test-sets had fault detection effectiveness as low as 10%.

Figure 3: Distribution of effectiveness for operator reference fault 80 70

% of test-sets

60 50

DC DCR FPC MCDC

40 30 20 10 0 10

20

30

40

50

60

70

80

90

100

% of faults

Figure 4: Distribution of effectiveness for variable negation fault 70

ASF faults in a typical Boolean decision is usually between one and five leading to uneven distribution on a percentage scale. For example, if a Boolean decision has only one ASF fault, the effectiveness of a test-set would be either 0 or 100%. In the case of ENF and ASF, if for cost reasons few testsets are to be selected, the MC/DC criterion is more reliable in detecting faults. But as it is difficult to generate MC/DC test-sets compared to other criteria, we may still detect a high number of faults by choosing more test-sets satisfying the other criteria. The distribution of effectiveness for DC/R was found to be similar to that for DC and FPC, although the number of test-cases in DC/R test-sets is much higher than for DC. Therefore it can be concluded that the test-set property has more influence on the fault detection effectiveness in comparison to the size of test-set.

60

% of test-sets

50

DC DCR FPC MCDC

40 30 20 10 0 10

20

30

40

50

60

70

80

90

100

% of faults

Figure 5: Distribution of effectiveness for associative shift fault 80 70

% of test-sets

60 50

DC DCR FPC MCDC

40 30

5.1.1 Average Effectiveness As noticed earlier, with the increase in the number of conditions in a Boolean decision, there is an exponential increase in the number of possible combinations and realisations. However, the number of test-cases in the test-set remains linear with respect to the number of conditions. To observe the effect of an increase in the number of conditions on the effectiveness of test criteria, we also studied the average effectiveness of different criteria with respect to the number of conditions. Figure 6: Distribution of avg. effectiveness w.r.t. no. of conditions for expression negation fault 100 90 average effectiveness

The ORF and VNF faults are close together in the fault relation hierarchy. For this reason, their variation in effectiveness is quite similar, as is evident from Figures 3 and 4. The effectiveness of MC/DC test-sets for ORF and VNF faults remains above 55%. The number of test-sets in any fault detection range for the DC, DC/R and FPC criteria remains below 20%; because of this, use of these criteria for detecting ORF and VNF faults will be less reliable.

5.1. Variation of Effectiveness with Number of Conditions

80 70 DC DCR FPC MCDC

60 50 40 30 20 10 0 3

4

5

6

7

8

9

10

11

12

no. of conditions

20 10 0 10

20

30

40

50

60

70

80

90

100

% of faults

As shown in Figure 5, the variation in effectiveness for ASF faults is different from other faults. This is due to two reasons: (a) ASF is the strongest fault class and so any test technique is unlikely to detect them, and (b) the number of

In general, the MC/DC criterion was found to be effective independent of the number of conditions in the Boolean decision. However, the following figures show that the effectiveness of other criteria decreased linearly as the number of conditions increased. For ENF, being the weakest fault, there is less decrease in the average effectiveness as shown in Figure 6. In this case, the average effectiveness of all the criteria remains above 60%.

Figure 7: Distribution of avg. effectiveness w.r.t. no. of conditions for operator reference fault 100

average effectiveness

90 80 70 DC DCR FPC MCDC

60 50 40 30

number of conditions rises, the number of associations is likely to grow, leading to higher probability of fault detection. For example, one ASF fault leads to 30% effectiveness while five ASF faults lead to more than 50% effectiveness. The other criteria do not show any improvement in effectiveness. 5.1.2 Standard Deviation

20

We also studied the stability of a test criterion with respect to the number of conditions. The following plots depict the change in the standard deviation of the effectiveness with the number of conditions. Ideally, the test criterion should have high average effectiveness along with low standard deviation.

10 0 3

4

5

6

7

8

9

10

11

12

no. of conditions

Figure 8: Distribution of avg. effectiveness w.r.t. no. of conditions for variable negation fault

Figure 10: Distribution of std. deviation (eff.) w.r.t. no. of conditions for expression negation fault

100 90 30

70 DC DCR FPC MCDC

60 50 40 30 20 10 0 3

4

5

6

7

8

9

10

11

25 S.D. (effectiveness)

average effectivness

80

12

20

DC DCR FPC MCDC

15 10 5

no. of conditions 0 3

There is a sharp drop in the average effectiveness for ORF and VNF as compared to ENF because they are stronger faults than ENF (see Figures 7 and 8). As the number of conditions increases the size of the truth table increases exponentially, leading to less changes in the truth table due to faults. This makes the faults indistinguishable for simple techniques which do not take into consideration the correlation between the conditions.

8

9

10

11

12

For expression negation faults, the standard deviation of the effectiveness of DC, DC/R and FPC increased, with the number of conditions, while MC/DC criterion had a low and almost constant standard deviation of effectiveness. Figure 11: Distribution of std. deviation (eff.) w.r.t. no. of conditions for operator reference fault

60 DC DCR FPC MCDC

S.D. (effectiveness)

average effectiveness

7

35

70

30

6

40

80

40

5

no. of conditions

Figure 9: Distribution of avg. effectiveness w.r.t. no. of conditions for associative shift fault

50

4

30 25

DC DCR FPC MCDC

20 15 10 5 0

20

3 10

4

5

6

7

8

9

10

11

12

no. of conditions

0 3

4

5

6

7

8

9

10

11

12

no. of conditions

The variation in effectiveness with respect to the number of conditions for associative shift faults was slightly different in comparison to ENF, ORF and VNF. Figure 9 shows that the effectiveness of MC/DC increased with the number of conditions. The reason for this increase is that as the

In the case of ORF and VNF, the standard deviation was observed to decrease with increase in number of conditions for DC, DC/R and FPC while it remained constant for MC/DC (see Figures 11 and 12). But as the average effectiveness for DC, DC/R and FPC decreases while MC/DC remained constant, MC/DC can be considered more reliable (see Figures 7 and 8).

6. Conclusions

Figure 12: Distribution of std. deviation (eff.) w.r.t. no. of conditions for variable negation fault 30

S.D. (effectivness)

25 20

DC DCR FPC MCDC

15 10 5 0 3

4

5

6

7

8

9

10

11

12

no. of conditions

Figure 13: Distribution of std. deviation (eff.) w.r.t. no. of conditions for associative shift fault 50 45 S.D. (effectiveness)

40 35 DC DCR FPC MCDC

30 25 20 15 10 5 0 3

4

5

6

7

8

9

10

11

12

no. of conditions

The standard deviation of the effectiveness for associative shift faults remained high for all the considered criteria even with an increasing number of conditions (see Figure 13). The behaviour of MC/DC test-sets showed only a marginal decrease in standard deviation with increasing average effectiveness. This implies that the number of testsets required to detect all ASF faults is more compared to the other faults.

5.2. Threat to Validity The controlled experiments conducted in this study are done for abstract control points (Boolean decisions); however, in a real implementation, only fault detection is not sufficient but it must be propagated causing a failure. Also, in real specifications and implementations, some conditions may be coupled; i.e. it may not be possible to vary a condition without affecting other condition. Further complexity in analysing real programs can arise due to the infeasibility of certain combinations of conditions. Both condition coupling and infeasibility may reduce the number of testsets that satisfies a given criterion, and may introduce a bias towards effective or ineffective test-sets.

An empirical evaluation of effectiveness of three main control-flow test criteria (DC, FPC, MC/DC) has been performed. As the size of test-sets generated using these criteria are different, another criterion, DC/R, was introduced to ensure that the test-set property has more influence on fault detection effectiveness as compared to the size of the testset. The study was based on exhaustive generation of all possible test-sets against ENF, ORF, VNF and ASF faults. We also analysed the variation in effectiveness of test criteria, which is helpful in determining its fault detection reliability. Boolean decisions in a program were considered as subjects of study with varying numbers of conditions. An analysis of the influence of the number of conditions on average fault detection effectiveness and standard deviation of effectiveness was done for each test criteria. In general MC/DC was found to be effective in detecting faults. However, MC/DC test-sets had a large variation in fault detection effectiveness for ASF faults. The average effectiveness for DC, DC/R and FPC was found to decrease with the increase in the number of conditions, while that of MC/DC remained almost constant. The standard deviation of the effectiveness showed variation in DC, DC/R and FPC but remained constant for MC/DC. The MC/DC test criterion was thus found to be more stable and promising for practical applications. If for cost reasons only few test-sets can be selected, it might be best to choose MC/DC test-sets. However, as it is more expensive to generate the MC/DC test-sets, high fault detection is also possible by selecting a greater number of test-sets satisfying the other criteria. ENF was found to be the weakest fault with all test criteria giving good results. The distribution of effectiveness was uniform for ORF and VNF. ASF being the strongest fault showed different behaviour for all metrics. The effectiveness of DC/R test-sets was found to be similar to DC and FPC, although the number of test-cases in the DC/R test-sets was much higher. Therefore it can be concluded that in comparison to the size a of test-set, the property of test-set has more influence on the fault detection effectiveness.

References [1] W. Chen, R. H. Untch, G. Rothermel, S. Elbaum, and J. V. Ronne. Can fault-exposure-potential estimates improve the fault detection abilities of test suites? Journal of Software Testing, Verification, and Reliability, 4(2), December 2002. [2] J. Chilenski and S. Miller. Applicability of modified condition/decision coverage to software testing. Software Engineering Journal, pages 193–200, September 1994.

[3] A. Dupuy and N. Leveson. An empirical evaluation of the MC/DC coverage criterion on the HETE-2 satellite software. In DASC: Digital Aviation Systems Conference, Phildelphia, October 2000. [4] P. G. Frankl and O. Iakounenko. Further empirical studies of test effectiveness. In ACM SIGSOFT Software Engineering Notes, Sixth International Symposium on Foundations of Software Engineering, volume 23, pages 153–162, November 1998. [5] P. G. Frankl and S. Weiss. An experimental comparison of the effectiveness of branch testing and data flow testing. IEEE Transactions on Software Engineering, 19(8):774– 787, August 1993. [6] P. G. Frankl and E. Weyuker. A formal analysis of the faultdetecting ability of testing methods. IEEE Transactions on Software Engineering, 19(3):202–213, March 1993. [7] P. G. Frankl and E. J. Weyuker. Provable improvements on branch testing. IEEE Transactions on Software Engineering, 19(10):962–975, October 1993. [8] P. G. Frankl and E. J. Weyuker. Testing software to detect and reduce risk. The Journal of Systems and Software, pages 275–286, 2000. [9] M. Harman, R. Hierons, M. Holcombe, B. Jones, S. Reid, M. Roper, and M. Woodward. Towards a maturity model for empirical studies of software testing. In Fifth Workshop on Empirical Studies of Software Maintenance (WESS’99), Keble College, Oxford, UK, September 1999. [10] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand. Experiments on the effectiveness of dataflow and control-flow based test adequacy criteria. In 16th International Conference on Software Engineering (ICSE-16), pages 191–200, 1994. [11] R. Jasper, M. Brennan, K. Williamson, B. Currier, and D. Zimmerman. Test data generation and feasible path analysis. In SIGSOFT: International Symposium on Software Testing and Analysis, Seattle, Washington, US, pages 95– 107. ACM, 1994. [12] J. A. Jones and M. J. Harrold. Test-suite reduction and prioritization for modified condition/decision coverage. In International Conference on Software Maintenance (ICSM’01), Florence, Italy, pages 92–101. IEEE, November 2001. [13] D. R. Kuhn. Fault classes and error detection capability of specification-based testing. ACM Transactions on Software Engineering and Methodology, 8(4):411–424, October 1999. [14] L. J. Morell. A theory of fault-based testing. IEEE Transactions on Software Engineering, 16(8):844–857, August 1990. [15] G. Myers. The art of software engineering. WileyInterscience, 1979. [16] A. J. Offutt. Investigations of the software testing coupling effect. ACM Transactions on Software Engineering and Methodology, 1(1):5–20, 1992. [17] A. J. Offutt, Y. Xiong, and S. Liu. Criteria for generating specification-based tests. In Fifth IEEE International Conference on Engineering of Complex Computer Systems (ISECCS’99), Las Vegas, Nevada, USA, pages 119–129, October 1999.

[18] A. Paradkar and K. Tai. Test-generation for Boolean expressions. In Sixth International Symposium on Software Reliability Engineering (ISSRE’95), Toulouse, France, pages 106–115, 1995. [19] A. Paradkar, K. Tai, and M. A. Vouk. Automatic testgeneration for predicates. IEEE Transactions on reliability, 45(4):515–530, December 1996. [20] D. J. Richardson and M. C. Thompson. An analysis of test data selection criteria using the RELAY model of fault detection. IEEE Transactions on Software Engineering, 19(6):533–553, June 1993. [21] M. Roper. Software Testing. McGraw-Hill Book Company Europe, 1994. [22] G. Rothermel, M. J. Harrold, J. V. Ronne, and C. Hong. Empirical studies of test suite reduction. Journal of Software Testing, Verification, and Reliability, 4(2), December 2002. [23] RTCA/DO-178B. Software considerations in airborne systems and equipment certification. Washington DC USA, 1992. [24] K.-C. Tai. Theory of fault-based predicate testing for computer programs. IEEE Transactions on Software Engineering, 22(8):552–562, August 1996. [25] Software testing FAQ: Testing tools supplier. http://www.testingfaqs.org/tools.htm. [26] T. Tsuchiya and T. Kikuno. On fault classes and error detection capability of specification-based testing. ACM Transactions on Software Engineering and Methodology, 11(1):58– 62, January 2001. [27] M. A. Vouk, K. C. Tai, and A. Paradkar. Empirical studies of predicate-based software testing. In 5th International Symposium on Software Reliability Engineering, pages 55–64, 1994. [28] E. Weyuker. Can we measure software testing effectiveness? In First International Software Metrics Symposium, Baltimore, USA, pages 100–107. IEEE, 21-22 May 1993. [29] E. Weyuker. Thinking formally about testing without a formal specification. In Formal Approaches to Testing of Software (FATES’02), A Satellite Workshop of CONCUR’02, Brno, Czech Republic, pages 1–10, 24 August 2002. [30] E. Weyuker, T. Gorodia, and A. Singh. Automatically generating test data from a boolean specification. IEEE Transactions on Software Engineering, 20(5):353–363, May 1994. [31] A. White. Comments on modified condition/decision coverage for software testing. In IEEE Aerospace Conference, Big sky, Montana, USA, volume 6, pages 2821–2828, 10-17 March 2001. [32] W. Wong, J. Horgan, A. Mathur, and A. Pasquini. Test set size minimization and fault detection effectiveness: a case study in a space application. In Twenty-First Annual International Computer Software and Applications Conference (COMPSAC’97), pages 522–528, 13-15 August 1997. [33] H. Zhu. Axiomatic assessment of control flow-based software test adequacy criteria. Software Engineering Journal, pages 194–204, September 1995. [34] H. Zhu. A formal analysis of the subsume relation between software test adequacy criteria. IEEE Transactions on Software Engineering, 22(4):248–255, April 1996. [35] H. Zhu, P. Hall, and H. R. May. Software unit test coverage and adequacy. ACM Computing Surveys, 29(4):336– 427, December 1997.