Experimental Evaluation of the Tolerance for Control-Flow Test Criteria Kalpesh Kapoor Jonathan Bowen Centre for Applied Formal Methods South Bank University, London, UK {kapoork, bowenjp}@sbu.ac.uk http://www.cafm.sbu.ac.uk

Abstract For a given test criterion, the number of test-sets satisfying the criterion may be very large, with varying fault detection effectiveness. In recent work [29], the measure of variation in effectiveness of test criterion was defined as ‘tolerance’. This paper presents an experimental evaluation of tolerance for control-flow test criteria. The experimental analysis is done by exhaustive test-set generation, wherever possible, for a given criteria which improves on earlier empirical studies that adopted analysis of some test-sets using random selection techniques. Four industrially used control-flow testing criteria, Condition Coverage (CC), Decision Condition Coverage (DCC), Full Predicate Coverage (FPC) and Modified Condition/Decision Coverage (MC/DC) have been analysed against four types of faults. A new test criteria, Reinforced Condition/Decision Coverage (RC/DC) [28], is also analysed and compared. The Boolean specifications considered were taken from a past research paper and also generated randomly. To ensure that it is the test-set property that influences the effectiveness and not the test-set size, the average test-set size was kept same for all the test criteria except RC/DC. A further analysis of variation in average effectiveness with respect to number of conditions in the decision was also undertaken. The empirical results show that the MC/DC criterion is more reliable and stable when compared to the other considered criteria. Though the number of test-cases is large in RC/DC testsets, no significant improvement in effectiveness and tolerance was observed over MC/DC.

Keywords: Control-flow testing criteria, effectiveness, tolerance, empirical evaluation, CC, DCC, FPC, MC/DC, RC/DC, Boolean specification, fault-based testing.

1 Introduction 1

Acknowledgement: This research has benefited from participation in and discussions with colleagues on the UK EPSRC FORTEST Network (GR/R43150/01) on formal methods and testing (http://www.fortest.org.uk). Financial support for travel is gratefully acknowledged.

1

A test data adequacy criterion is set of rules that prescribe some property for testsets. More formally, a test criterion is a function C: Programs × Specifications → 2Test-Sets used for determining whether the program P has been thoroughly tested by a test-set, T , relative to the specification, S [38]. If a test-set, T ∈ C(P, S), then we say T is adequate for testing P with respect to S according to C, or that T is C-adequate for P and S. The two broad categories of test criteria are data-flow and control-flow test criteria. Data-flow criteria prescribe rules for test-sets that are based on the data definitions and its usage in the program. As the objective of this paper is to explore control-flow testing criteria, we shall expand only on this aspect. Control-flow test criteria [17,23,36] are meant to challenge the decisions made by the program with test-cases based on the structure and logic of the design and source code. Control-flow testing ensures that program statements and decisions are fully exercised by code execution. The well-studied control flow graph, which is widely used for static analysis of the program, is also used for defining and analysing control-flow test criteria. A flow graph is a directed graph in which every node represents a linear sequence of computations. Each edge in the graph represents the transfer of control and is associated with a Boolean decision that represents the condition of control transfer from one node to other. The control-flow test criteria check these Boolean decisions of the program based on the specification. The effectiveness of test criterion [3–7, 12, 31, 37], C, is the probability that a test-set selected randomly from the set of all test-sets satisfying the criterion will expose an error. As there are an exponentially large number of test-sets satisfying a test criterion, experimental analysis of fault detection effectiveness is usually based on random selection of different test-sets satisfying the given criterion. Such an analysis is useful to find the average effectiveness but may be misleading due to the uneven distribution of fault detection capability of the test-sets. For instance, consider the set of all the test-sets, T , that satisfy a criterion, C, for a given program and specification. It may be the case that some of the test-sets in T expose an error, while others do not. If a large percentage of the Cadequate test-sets expose an error, then C is an effective criterion for this program [4]. However, it may happen that there is a large variation in the fault detection effectiveness of different test-sets with equal probabilities for high and low fault detection. In such a scenario, empirical studies based on random selection of testsets may give incorrect results. Therefore it is also important to study the variation in effectiveness of test-sets of a test criterion.

1.1 An Example Consider, for example, the following decision D, ¬A ∧ B ∧ C ∨ A ∧ ¬B ∧ C ∨ A ∧ B ∧ ¬C

2

% of test sets

where A, B and C are atomic Boolean conditions. Let us assume that the implementation of the above decision includes a mistake, where one of the occurrences of an atomic condition is replaced by its negation; i.e. X is substituted by ¬X or vice versa. There can be nine such faulty decisions as the conditions appear nine times in the decision D. The well-known criterion, condition coverage (CC), requires that the test-set must exercise both true and false evaluation for every atomic condition. Another criteria, Modified Condition/Decision Coverage (MC/DC), which is a mandatory requirement for testing avionics software under DO-178B [24], requires that the test-set must exercise both true and false branch and also check the independent effect of the atomic conditions on the final outcome of Boolean decision. The following figure shows the effectiveness of all possible MC/DC and CC test-sets. It is evident from Figure 1 that fault Figure 1: Distribution of effectiveness for example decision D detection effectiveness for a randomly 100 selected CC test-set for the decision 80 D could vary from 30 to 100 per60 CC cent as all the test-sets with effectiveMCDC 40 ness in this range have approximately 20 0 same probability of being selected (ex10 20 30 40 50 60 70 80 90 100 cept those with 70–80% effectiveness). % of faults However the variation in effectiveness of MC/DC is low and is in the smaller range of 70–100%. Therefore, although the average effectiveness of CC criteria is high (≈ 70%), the test criteria may be unsuitable due to high variation in effectiveness for the decision D.

1.2 Research Objectives The research issue addressed by this empirical study is to find how good or bad a randomly selected test-set could be in exposing errors according to a given test criterion. This question is difficult to answer because for a given program and adequacy criterion, there is typically a very large number of adequate test-sets [4]. If the program is incorrect, then usually some of these test-sets expose an error while others do not. Previous empirical studies [20, 21, 30, 33] of this nature had the following limitations: (a) they used a relatively small number of Boolean decisions; (b) test-sets were selected at random from the large sample space. We attempt to solve the first problem by generating a large number of Boolean decisions with a varying number of conditions (3–12). The second problem has been addressed by choosing all the test-sets instead of selecting only some (wherever possible). Of course, this requires the use of abstraction. The abstractions made are justified for the following reasons: (a) the scope of study being limited to control-flow testing criteria which are intrinsically designed to test the decisions made by the program at the control points; (b) if there is more than one fault either at the same control point or at another control point there is still a very high 3

likelihood of finding all the faults. This is because research into the ‘fault coupling effect’ as demonstrated in [18] shows that complex faults are coupled to simple faults in such a way that a test data set that detects all simple faults in a program will detect a high percentage of the complex faults. We considered Boolean specifications as the subject for our experiments. Faults were introduced one at a time to generate a faulty specification representing the implementation. In summary, the main objectives of the study can be summarized as follows: • To study and compare industrially used control-flow test criteria like, Condition Coverage (CC), Decision Condition Coverage (DCC), Full Predicate Coverage (FPC) and Modified Condition/Decision Coverage (MC/DC). For example, a major aviation industry standard have a mandatory requirement to use MC/DC test-sets [24]. • To evaluate a new testing criterion, Reinforced Condition/Decision Coverage (RC/DC), proposed recently as a potential improvement on the MC/DC criterion [28]. • To study the variation in effectiveness by exhaustively generating (wherever possible) all the test-sets satisfying a criterion. Earlier work concentrated on choosing some test-sets according to a given criterion which introduces a certain degree of inconsistency in various results [10]. • To establish a connection between fault detection effectiveness and controlflow testing criteria. Since the considered criteria are widely used in industry and there are tools [26] which automatically generate the test data for given a test criterion, it can be guaranteed that certain types of faults are not present in the implementation. The rest of the paper is organized as follows. The next section introduces the definitions and preliminary results used in other parts of the paper. Section 3 gives details about related work. Experimental design and analysis of results are presented in sections 4 and 5 respectively. Section 6 presents the conclusions of the study.

2 Theoretical Background Let P be a program and S be the specification that P is intended to satisfy. A program, P is said to be correct on input x if P (x) = S(x), where P (x) denotes the value computed by P on input x, and S(x) denotes the intended output for x. Otherwise, P is said to fail with x as the failure-causing input. A failure is defined to be any deviation between the actual output and the specified output for a given input. A fault in a program is a cause which results in a failure.

4

A test-case is an input. A test-set is a set of test-cases. If a test-case is a failurecausing input for a program, then the program is said to fail on that test-case and such a test-case is said to expose a fault. A test-set, T , is said to detect a fault if it includes at least one failure-causing test-case. A test criterion could be viewed either as a generator or as an acceptor [38]. Both generators and acceptors are mathematically equivalent. The acceptor view can be thought of as a function C : P × S × T → {T, F}, where T and F denotes logical ‘true’ and ‘false’ respectively. When a test criterion, C, is used as a acceptor, test-cases are added to a test-set till it satisfies C. In other words, an acceptor-view uses the criterion as a stopping rule and therefore checks if the goal is achieved after adding a test-case to the test-set. A criterion when used as a generator can be considered as a function C : P × S → 2T . Here, given a program and specification, the test criterion is used to generate a set of test-sets that satisfies the criterion. In other words, the criterion is used as rule to select the test-cases to generate a test-set. The control-flow testing criteria are based on the statements and Boolean decision points in the program. The Boolean decisions consist of conditions connected by logical operators. Conditions are atomic predicates which could either be a Boolean variable or a simple relational expression. The following notation is used throughout the paper, ∧, ∨ and ¬ represent the ‘and’, ‘or’ and ‘not’ Boolean operations respectively. A Boolean formula can be represented in various formats such as Conjunctive Normal Form (CNF, also known product-of-sums) and Disjunctive Normal Form (DNF, sum-of-products). As these normal forms may not be unique, there are also canonical representations which are unique for a given Boolean formula. One such canonical form is prime-DNF (PDNF) which is unique up to commutativity. Consider a given Boolean decision D, and an elementary conjunction, K (i.e., V K ≡ i Xi , where Xi is either a condition or its negation). K is an implicant of D if K implies D. An implicant is said to be prime implicant of decision D, if there is no other implicant, K1 of D such that K ∨ K1 = K1 . A PDNF form of a given Boolean decision, D, is the sum of all prime implicants of D. A condition in a Boolean decision D is said to be irredundant if the condition appears in the PDNF representation of the D, otherwise it is said to be redundant. For example, B is redundant in decision D ≡ A ∨ (A ∧ B) since the PDNF representation of D is A.

2.1 Control-Flow Test Criteria In this section, we review some existing control-flow test criteria [17, 23]. The basic control-flow testing criterion is Statement Coverage (SC) which requires that test-data must execute every statement of the program at least once. It is evident that this is very weak as it ignores the branching structure of the design and implementation. The Decision Coverage (DC) criterion requires that, in addition to SC, every possible branch must be executed at least once. In other words, the Boolean 5

decisions that appear in the program must take both T and F values. The DC criterion treats a decision as a single node in the program structure, regardless of the complexity of the Boolean expression and dependencies between the conditions constituting the decision. For critical real-time applications where over half of the executable statements may involve Boolean expressions, the complexity of the expression is of concern. Therefore other criteria have been developed to address the concerns due to the complexity of Boolean decisions. The Condition Coverage (CC) criterion includes SC and requires that every condition in each decision must take both T and F values at least once. Although CC considers the individual condition that makes a decision it does not prescribe any rule that checks the outcome of decision. The following two criteria improve on DC and CC by taking into account values of both the decision and conditions that occur in the decision. • Decision/Condition Coverage (DCC): Every statement in the program has been executed at least once, every decision and condition in the program has taken all possible outcomes at least once. DC essentially requires that test-data must satisfy SC, DC and CC. • Full Predicate Coverage (FPC): Every statement in the program has been executed at least once, and each condition in a decision has taken all possible outcomes where the value of the decision is directly correlated with the value of a condition [19]. This means that the test-set must include test-cases such that the value of the decision is different when the condition changes. It can be observed that while both DCC and FPC combine SC, DC, CC test criteria, FPC also checks for the influence of conditions on the outcome of decision. However, due to the presence of more than one condition in a decision, certain combination of other conditions can mask a condition if not checked independently. The Modified Condition/Decision Coverage (MC/DC) [1, 2, 13, 14, 34], which is a mandatory requirement for testing avionics software [24], improves on this drawback and is defined as follows: Every point of entry and exit in the program has been invoked at least once, every condition in a decision in the program has taken on all possible outcomes at least once, and each condition has been shown to independently affect the decision’s outcome. A condition is shown to independently affect a decision’s outcome by varying just that condition while holding fixed all other possible conditions. The criterion requires at least one pair of test-cases for every condition. The MC/DC criterion takes into account only those situations where an independent change in the condition causes change in the value of a decision. The shortcoming of this approach is that it does not check for the situation where a change in condition should keep the value of decision. To improve on this, a new

6

criterion called Reinforced Condition/Decision Coverage (RC/DC) has been proposed in [28]. In addition to MC/DC, RC/DC requires all the decisions to independently keep the value of decision remain the same at both T and F. One of the goals in defining the above criteria is to minimize the number of test-cases in a test-set while keeping effectiveness as high as possible. In Max. no. Decision Condition order to completely ver- Criterion Min. no. Test-cases Test-cases outcome outcome ify a decision with n con√ DC 2 2 × n √ ditions, 2 test-cases are CC 2 n+1 × √ √† required. Therefore, it is DCC 2 n+2 √ √‡ computationally expensive FPC 2 2n √ √? n+1 2n or infeasible to test all MC/DC √ √? n+1 6n possible combinations as RC/DC the number of conditions Table 1: Summary of test criteria († Uncorrelated, ‡ Correlated, ? and decisions grows in the Correlated and Independent) program. In this sense MC/DC and RC/DC, provides a practical solution that only requires a linear number of test-cases in terms of n. In Table 1, we give a summary of the minimum and maximum test-cases in a test-set that can satisfy a given criterion. The table also shows if the criterion takes into account the decision and conditions. As can be observed (in Table 1) that the average test-set size of CC, DCC and FPC is different from MC/DC and RC/DC. To ensure that it is the test-set property and not the test-set size that influenced the effectiveness, extra test-cases were added to the CC, DCC and FPC test-sets to keep the average test-set size the same as that of MC/DC. The extra test-cases added were generated at random. The criteria described above have a hierarchy among them as the criteria defined later include some of the criteria defined before them. For example, any test-set that satisfies MC/DC coverage will also satisfy DC coverage. Formally, a criterion C1 subsumes C2 if, for every program P and specification S, every test-set that satisfies C1 also satisfies C2 . It might seem at first glance, that if C1 subsumes C2 , then C1 is guaranteed to be more effective than C2 for every program. However, it is not true as it may happen that for some program P , specification S, and test generation strategy G, test-sets that only satisfy C2 may expose more errors than those that satisfy C1 . This issue has been explored formally by Weyuker [5].

2.2 Types of Faults One way to study fault detection effectiveness is to define a set of faults that occurs in a real life situation and use them to generate the implementations from the specification. In the past, various types of faults have been defined and used to study effectiveness of test criteria; see for example [16, 22, 25, 30, 33]. An analysis of various fault classes and the hierarchy among them is presented in [15, 27]. We have considered the following four faults in our empirical study:

7

• Associative Shift Fault (ASF): A Boolean expression is replaced by one in which the association between conditions is incorrect; for example, X ∧ (Y ∨ Z) could be replaced by X ∧ Y ∨ Z. As the ‘∧’ operator has higher precedence over ‘∨’ operator, the second expression is equivalent to (X ∧ Y ) ∨ Z. • Expression Negation Fault (ENF): A Boolean expression is replaced by its negation, for example, X ∧ (Y ∨ Z) by X ∧ ¬(Y ∨ Z). • Operator Reference Fault (ORF): A Boolean operator is replaced by another, for example, X ∧ Y by X ∨ Y . • Variable Negation Fault (VNF): A Boolean condition X is replaced by another variable ¬X. VNF only considers the atomic condition and is different from ENF, which changes the logical sub-expression excluding the conditions. During the analysis, only one fault is introduced at a time. In a real life scenario, more than one fault can be present at the same or other control points. However, research into ‘fault coupling effect’ by Offutt [18] demonstrated that complex faults are coupled to simple faults in such a way that if a test data set detects all simple faults in a program it will also detect a high percentage of the complex faults.

3 Related Work In [8], Goodenough and Gerhart introduced the notion of reliability and validity of a testing method. Essentially, a method, M , is said to be reliable if all the test-sets generated using M have same fault detection effectiveness. The obvious advantage of using a reliable method is that it is enough to select only one test-set using any algorithm. Unfortunately, no practical criterion satisfies this property. A similar concept of reliable test-sets was also proposed by Howden [11]. In the current testing literature, ‘reliability’ has a standard engineering meaning in terms of failure probability over an interval of time [9]. An approach, known as fault based testing [16, 22], attempts to alleviate the problem of test-set reliability by seeking a solution through hypothesizing faults that can occur in the implementation. The reliability of a technique is measured in terms of these hypothesized faults. Thus, fault based testing provides a practical means of evaluating the test criterion. The empirical evaluation of testing techniques for Boolean specification has been studied in [20, 21, 30, 33]. Harman et. al., in [10], have proposed a model to conduct empirical studies in software testing. The model suggests standardizing the process to avoid the conflicts in results of different empirical studies. Experimental comparisons of effectiveness of data-flow and control-flow testing criteria are considered in [3, 4, 12]. The subjects of these experiments were programs. 8

In [3], the space of test-sets was sampled and appropriate statistical techniques were used to interpret the result. The selection of test-sets was done after creating a universe of possible test-cases and then test-sets of a given size were randomly selected from that universe. The distribution of effectiveness with respect to coverage is included. Various types of formal relationships between the testing criteria have also been studied as in [37]. For example, the ‘subsumes’ relationship is based on the property of test criteria. However, subsumption of one criterion by other criterion does not mean that it is more effective in detecting fault than the other. Frankl and Weyuker, in [5], proved that the subsumes relation between software test adequacy criteria does not guarantee better fault detecting ability in the prior testing scenario. Kuhn [15] has explored the relationship between various fault types. It is shown that ENF are the weakest faults in the sense that any test technique that catches stronger faults is very likely to find ENFs. The results have been further improved by Tsuchiya and Kikuno [27]. Effectiveness measures, as defined in [5], are in terms of a specification and matching program. As has been pointed out in [32], it is difficult to say what is a typical or representative program and specification. Further similar difficulty arises when classifying faults into different categories based on the severity. Therefore it is not possible to use this measure for effectiveness in general. To overcome the difficulty of assessing test criteria based on the severity of faults, a framework is defined in [7]. Another research question of relevance is whether minimization of the test-set size has any impact on fault detection effectiveness [35]. It was found that the minimization of test-sets had minimal effect. The aim of this paper automatically includes this topic as we consider all possible test-sets; if the minimization reduces the effectiveness that should be reflected in the overall result. Further, with random selection of test-sets satisfying a criterion, by choosing a test criterion which has a low variation in effectiveness we can be more sure about fault detection.

4 Experimental Design The four main control-flow test criteria that we have studied were CC, DCC, FPC, and MC/DC. As the average test-set size of CC, DCC, FPC is less than that for MC/DC test-sets (see Table 1), extra test-cases were added to test-sets to ensure that average test-set size was equal. It is invalid to compare a test criterion, C1 , that gives large test-sets with another criterion, C2 , that selects small test-sets, as it is impossible to tell whether any reported benefit of C1 is due to its inherent properties or due to the fact that the test-sets considered are larger and hence more likely to detect faults. Another newly proposed test criterion, RC/DC [28], was also analysed and compared with the above mentioned criteria. The number of test-cases in an RC/DC test-set could vary from (n + 1) to 6n, where n is the number of conditions in the Boolean decision.

9

Because the test criteria of our study are based on decisions that appear in a program we restricted our empirical evaluation to Boolean specifications and their implementations. At first sight, this may appear to be an over-simplification; however, it can be noted that for a given a Boolean specification with n conditions, n only one of 22 possible realizations is correct. Further complexity is added by the exponentially large number of test-sets satisfying a given criterion for a particular specification. Let us assume that the size of a test-set is 2n; then the possible can2n . On the other hand, working with a Boolean didate space of test-sets has size C2n specification allows the generation of a large number of random decisions with varying sizes and properties. Though the number of test-sets is large, it is still possible to generate (possibly) all test-sets for typical decisions that usually occur in real programs. For instance, Chilenski and Miller, in [1], investigated the complexity of expressions in different software domains. It was found that complex Boolean expressions are more common in the avionics domain, with the number of conditions in a Boolean decision as high as 30. Our experimental analysis was done using software that was specifically written for this purpose. The prototype version of software was written in the functional programming language, Haskell. As we had to deal with an exponentially large number of test-sets, the prototype Haskell version failed to handle complex decisions. Therefore, the next version of the software was rewritten in Java. The software allows the analysis of a given or randomly generated Boolean decision. The experimental steps involved in the empirical analysis of the above mentioned test criteria were as follows: 1. Select subject Boolean decisions. 2. Generate all faulty decisions. 3. Generate all test-sets for a criterion. 4. Compute the effectiveness of each test-set. 5. Compute the variation in effectiveness of all test-sets. Steps 1 and 2 are elaborated in the following section. Section 4.2 presents step 3 and section 4.3 gives the evaluation metrics used in steps 4 and 5.

4.1 Subject Boolean Specification and Implementation The subject Boolean decisions were taken from research paper [33] and also generated using a random generator. The Boolean specifications used in [33] were taken from an aircraft collision avoidance system. We also analysed 500 decisions that were generated randomly. The number of conditions varied from three to twelve in the generated decisions. The number of conditions is required as an input for the random Boolean decision generator. The seed was fixed for the random generator before starting the 10

experiments. This was done to ensure that the experiments could be repeated, restarted and re-analysed by a third party if desired [10]. The Boolean decisions generated are guaranteed to have only irredundant conditions. A condition can appear more than once in a decision. The generation of decisions with irredundant conditions guarantees that a faulty implementation will not depend on any condition on which the original decision did not depend. Consider, for example, the Boolean decision D ≡ A ∧ (A ∨ B) in which only A is irredundant. An associative shift fault in D would change the decision to DASF ≡ A ∧ A ∨ B. But now DASF is equivalent to A ∨ B in which both A and B are irredundant. We studied the fault detection effectiveness of test criteria for Boolean specification with respect to four types of faults: ASF, ENF, ORF and VNF. The faulty Boolean decisions (implementation) were generated for each type of fault by introducing one fault at a time. For example, consider the decision, D ≡ A ∨ ¬B ∧ C. Given D as input, the ORF fault operator would generate two Boolean decisions: A ∧ ¬B ∧ C and A ∨ ¬B ∨ C. Only one fault is present in one implementation. As mentioned before, in a real life scenario, if more than one fault is present at the same control point or another control point, there is still high possibility of finding it [18].

4.2 Generation of All Possible Test-Sets For a given Boolean decision with n conditions, there are 2n entries in the matching truth table. As the objective of experiments was to do exhaustive analysis of all possible test-sets, all entries in the truth table were generated. Let the set of testcases for which a decision evaluates to T and F be ST and SF , respectively. A DCC test-set can be generated by choosing one test-case each from ST and SF and adding the test-cases to the test-set till the condition coverage is achieved. Thus, all DCC test-sets were generated by extending the pairs taken from the cross product of ST and SF . A similar approach was used for generating FPC test-sets by extending them such that every test-set included pairs satisfying the correlation property. The methodology adopted to generate test-sets was chosen to work around the computational infeasibility problem. It can be noted that the space of 2n . With k equal to n and n ≥ 5 it is infeasible to possible test-sets of size n is C2k ¡ ¢ consider all combinations in practice (for n = 5, 32 10 = 64, 512, 240). CC test-sets were chosen purely by adding the random test-cases till the criterion was satisfied. The number of CC test-sets generated was equal to the |ST | × |SF |. In the case of CC, DCC and FPC test-sets, it was ensured that the average size of test-sets was the same as that of the MC/DC test-sets. The average was maintained by generating a random number between n and 2n as the size of test-set. As all the numbers are equally likely to be chosen the average size of the test-set is statistically expected to be same for all the test criteria. A further evaluation was done to check if the test-case pair in ST × SF satisfied the MC/DC property for any condition. A similar analysis was done for generating 11

the RC/DC pairs by taking all possible pairs from ST and SF sets. In that case, the test-case pair was added to the list maintained for every condition appearing in the Boolean decision. Generation of all MC/DC and RC/DC test-sets was done by considering all possible combinations of test-case pairs satisfying the property for every condition. If it was not possible to generate all possible test-sets, the test-sets were generated by making a random selection from the pair list of every condition that satisfied the criterion.

4.3 Evaluation Metrics Let D be a Boolean decision with n conditions and let the number of faulty decisions generated using a given fault operator, F , for D be k. The effectiveness as a percentage, ET , of a test-set, T , satisfying a criterion, C, with respect to F and D is given by, ET (C, D, F, T ) = m k × 100, where m is the number of faulty decisions identified by T to be different from D. The effectiveness of a test criterion, EF , with respect to a fault is the average of effectiveness for all test-sets: X

EF (C, D, F ) =

T

ET (C, D, F, T ) satisfies C no. of T satisfying C

Furthermore, tolerance of criterion [29] defined as the variation in effectiveness is the variation of ET given by the following formula, X

V ar(C, D, F ) =

T

(EF (C, D, F ) − ET (C, D, F, T ))2

satisfies C (no. of T satisfying C) − 1

5 Analysis and Results The effectiveness of each test criterion was measured by computing the average effectiveness of all test-sets. The graphs presented in this section show the percentage of faults plotted against the Figure 2: Distribution of effectiveness for expression negation fault percentage of test-sets that detected them. For example, in Figure 2, for a given decision, on average 48% CC DCC of CC test-sets were found FPC to be between 90-100% efMCDC RCDC fective. The MC/DC criterion was found to be highly effective for ENF, ORF and VNF faults (see section 2.2), a large variation in effectiveness was observed for ASF faults. In particular, it was ob100 90 80

% of test sets

70 60 50 40 30 20 10

0

10

20

30

40

50

60

% of faults

12

70

80

90

100

served that around 30% of MC/DC test-sets detected less than 10% of ASF faults. As is evident from Figures 2, 3 and 4 that MC/DC and RC/DC test-sets were always found to have more Figure 3: Distribution of effectiveness for operator reference fault than 50% effectiveness. On the other hand CC, DCC and FPC had fault detection effectiveness as low as CC DCC 10%. FPC MCDC As described in [15], exRCDC pression negation faults are the weakest types of faults. It is expected that ENF detection will be high as can be observed in Figure 2. The variation in effectiveness of test-sets for ORF and VNF faults is similar as they are close in the fault relation hierarchy (Figures 3 and 4). If for Figure 4: Distribution of effectiveness for variable negation cost reasons few test-sets fault are to be selected, then the MC/DC criterion is more reliable for detecting faults. But as it is difficult to genCC DCC erate MC/DC test-sets comFPC MCDC pared to other criteria, we RCDC may still detect a high number of faults by choosing a large number of test-sets satisfying the other criteria. As shown in Figure 5, the variation in effectiveness for ASF faults is different from other faults. This is due to two reasons: (a) since ASF is the strongest fault class, most of the test techFigure 5: Distribution of effectiveness for associative shift fault niques are unlikely to de100 tect them, and (b) the num90 80 ber of ASF faults in a typ70 CC ical Boolean decision is usu60 DCC ally between one and five 50 FPC MCDC 40 leading to uneven distribuRCDC 30 tion on a percentage scale. 20 10 For example, if a Boolean 0 decision has only one ASF 10 20 30 40 50 60 70 80 90 100 % of faults fault, the effectiveness of a test-set would be either 0 or 100%. Here, DCC and FPC have similar variation while MC/DC and RC/DC are also similar. 100 90 80

% of test sets

70 60 50 40 30 20 10

0

10

20

30

40

50

60

70

80

90

100

70

80

90

100

% of faults

100 90 80

% of test sets

70 60 50 40 30 20 10

0

10

20

30

40

50

60

% of test sets

% of faults

13

The distribution of effectiveness for RC/DC was found to be similar to MC/DC, although the number of test-cases in RC/DC test-sets is higher. Therefore it can also be concluded that the test-set property has more influence on the fault detection effectiveness as compared to the size of the test-set.

5.1 Variation in Effectiveness with Number of Conditions 5.1.1 Average Effectiveness

average effectiveness

average effectiveness

As noticed earlier, there is an exponential increase in the number of possible combinations and realFigure 6: Distribution of avg. effectiveness w.r.t. no. of conditions for expression negation fault isations with the increase 100 in number of conditions in 90 a Boolean decision. How80 70 ever, the number of testCC 60 DCC cases in the test-set remains 50 FPC MCDC linear with respect to num40 RCDC 30 ber of conditions. To an20 swer the question of whether 10 0 the increase in number of 3 4 5 6 7 8 9 10 11 12 no. of conditions conditions influences the effectiveness of test criteria, we also studied the average effectiveness of different criteria with respect to the number of conditions. In general, the MC/DC and RC/DC criteria were found to be effective independent of the number of conditions in the Boolean decision. However, Figure 6 shows that effectiveness Figure 7: Distribution of avg. effectiveness w.r.t. no. of conditions for operator reference fault of other criteria decreased 100 linearly as the number of 90 conditions increased. The 80 average effectiveness was found 70 CC 60 DCC to be overlapping for DCC 50 FPC MCDC with FPC and MC/DC with 40 RCDC 30 RC/DC. 20 For ENF, being the weak10 0 est fault, there is less de3 4 5 6 7 8 9 10 11 12 no. of conditions crease in the average effectiveness as shown in Figure 6. In this case, the average effectiveness of all the criteria remains above 65%. There is sharp drop in the average effectiveness for ORF and VNF in the case of the CC, DCC and FPC criteria (see Figures 7 and 8). As the num-

14

average effectiveness

average effectivness

ber of conditions increase, Figure 8: Distribution of avg. effectiveness w.r.t. no. of conditions for variable negation fault the size of truth tables in100 creases exponentially, lead90 ing to less changes in the 80 70 truth table due to faults. This CC 60 DCC makes the faults indistin50 FPC MCDC guishable for simple tech40 RCDC 30 niques which do not take 20 into consideration the in10 dependent correlation between 0 3 4 5 6 7 8 9 10 11 12 no. of conditions conditions. The effectiveness remained stable and overlapping for MC/DC and RC/DC. The variation in effecFigure 9: Distribution of avg. effectiveness w.r.t. no. of conditions for associative shift fault tiveness with respect to the 90 number of conditions for associative shift faults was slightly 80 70 different in comparison to 60 CC DCC 50 ENF, ORF and VNF faults. FPC 40 MCDC Figure 9 shows that the efRCDC 30 fectiveness of MC/DC and 20 10 RC/DC increased with the 0 number of conditions. The 3 4 5 6 7 8 9 10 11 12 no. of conditions reason is that as the number of conditions rises, the number of associations is likely to grow, leading to a higher probability of fault detection. For example, one ASF fault leads to 30% effectiveness, while five ASF faults lead to more than 50% effectiveness. RC/DC was found to have the maximum average effectiveness. The other criteria do not show any improvement in effectiveness. 5.1.2 Tolerance Figure 10: Distribution of std. deviation (eff.) w.r.t. no. of conditions for expression negation fault 30 25 S.D. (effectiveness)

We also studied the stability of a test criterion with respect to the number of conditions. Figures 10–13 depict the tolerance with respect to number of conditions. Ideally, the test criterion should have a high average effectiveness along with a low standard deviation (high tolerance). For expression negation

20

CC DCC FPC MCDC RCDC

15 10 5 0 3

4

5

6

7

8

no. of conditions

15

9

10

11

12

S.D. (effectiveness)

S.D. (effectivness)

S.D. (effectiveness)

faults, the standard deviation of effectiveness for CC, DCC and FPC increased with the increase in the number of conditions, while MC/DC and RC/DC had a low and almost constant standard deviation. CC had highest variation in effectiveness. The variation was overlapping for DCC with FPC and MC/DC with RC/DC. In the case of ORF and VNF, the standard deviation was observed to decrease with the increase in number Figure 11: Distribution of std. deviation (eff.) w.r.t. no. of conditions for operator reference fault of conditions for CC, DCC 30 and FPC while it remained constant for MC/DC and RC/DC 25 (see Figures 11 and 12). But 20 CC DCC as the average effectiveness 15 FPC MCDC for CC, DCC and FPC deRCDC 10 creases while MC/DC and 5 RC/DC remain constant, the 0 latter can be considered more 3 4 5 6 7 8 9 10 11 12 no. of conditions reliable (see Figures 7 and 8). The standard deviation Figure 12: Distribution of std. deviation (eff.) w.r.t. no. of conditions for variable negation fault of effectiveness with the in30 crease in number of conditions, remained high for 25 associative shift faults in com- 20 CC DCC parison to the other faults 15 FPC MCDC (see Figure 13). The beRCDC 10 haviour of MC/DC and RC/DC test-sets show only a marginal 5 0 decrease in standard devi3 4 5 6 7 8 9 10 11 12 no. of conditions ation with increasing average effectiveness (cf. Figure 9). Comparison of Figures Figure 13: Distribution of std. deviation (eff.) w.r.t. no. of conditions for associative shift fault 10–12 with Figure 13 shows 50 that the standard deviation 45 is high for all the test cri40 35 teria for ASF faults, while CC 30 DCC MC/DC and RC/DC show 25 FPC MCDC low standard deviation for 20 RCDC 15 ENF, ORF and VNF faults. 10 This implies that the num5 0 ber of test-sets required to 3 4 5 6 7 8 9 10 11 12 no. of conditions detect all ASF faults is higher compared with the other types of faults.

16

5.2 Limitations of the Study The controlled experiments conducted in this study attempt to evaluate the fault detection effectiveness of the control flow test criteria only for abstract Boolean decisions. In this sense, this study is artificial in nature. In real programs, more than one control point (Boolean decision) is present and it is not sufficient to only detect the faults, but also to propagate them causing failure. Therefore, it is important to study fault propagation as well. The fault detection effectiveness was studied by introducing only one fault at a time. Though there is some empirical evidence, a more elaborate study is required to evaluate the effect of multiple faults on effectiveness of test criterion. In real life specifications and implementations, often the conditions are coupled. Also, the presence of multiple control points may cause some combinations of conditions to be infeasible. Both coupling and infeasibility of conditions may reduce the number of possible test-sets significantly and therefore might prune out effective test-sets. This also needs to be investigated further.

6 Conclusions An empirical evaluation of effectiveness of four major control-flow test criteria (CC, DCC, FPC and MC/DC) has been performed. The analysis of tolerance of a newly proposed test criterion, RC/DC, has also been included for comparison. The average size of test-sets generated is kept the same to ensure that the comparison of effectiveness is due to the property of the test-set. The study was based on exhaustive generation of all possible test-sets (wherever possible) against ENF, ORF, VNF and ASF faults. In the circumstances where it was not possible to analyse all the test-sets, a large set of test-sets was selected from only those testsets that satisfied the test criterion. Boolean decisions in programs were considered as subjects of study with varying numbers of conditions. An analysis of the influence of the number of conditions on average fault detection effectiveness and standard deviation of effectiveness was undertaken for each test criteria. It was also observed that the property of test-sets had more influence on the fault detection effectiveness in comparison to the size of test-set. The empirical results for the DCC and FPC test criteria were found to be similar. This is due to the fact that the test-set satisfying the DCC property is likely to satisfy the FPC property and vice versa. In general, MC/DC was found to be effective in detecting faults. However, MC/DC test-sets had a large variation in fault detection effectiveness for ASF faults. The average effectiveness for CC, DCC and FPC was found to decrease with the increase in the number of conditions, while that of MC/DC remained almost constant. The standard deviation of effectiveness varied for CC, DCC and FPC but remained constant for MC/DC. ENF was found to be the weakest fault, with all the test criteria giving good results. The distribution of effectiveness was uniform for ORF and VNF. ASF, 17

being the strongest fault, displayed a different behaviour for all metrics. In comparison to MC/DC, the RC/DC criteria did not show a great improvement in effectiveness and tolerance, although the number of test-cases in RC/DC test-sets is more than the other criteria. The MC/DC test criterion was thus found to be more stable and promising for practical applications. If, for cost reasons, only a small number of test-sets can be selected, it might be better to choose MC/DC test-sets. However, as it is more expensive to generate the MC/DC test-sets, high fault detection is also possible by selecting a greater number of test-sets satisfying the other criteria.

References [1] J. Chilenski and S. Miller. Applicability of modified condition/decision coverage to software testing. Software Engineering Journal, pages 193–200, September 1994. [2] A. Dupuy and N. Leveson. An empirical evaluation of the MC/DC coverage criterion on the HETE-2 satellite software. In DASC: Digital Aviation Systems Conference, Phildelphia, October 2000. [3] P. G. Frankl and O. Iakounenko. Further empirical studies of test effectiveness. In ACM SIGSOFT Software Engineering Notes, Sixth International Symposium on Foundations of Software Engineering, volume 23, pages 153– 162, November 1998. [4] P. G. Frankl and S. Weiss. An experimental comparison of the effectiveness of branch testing and data flow testing. IEEE Transactions on Software Engineering, 19(8):774–787, August 1993. [5] P. G. Frankl and E. Weyuker. A formal analysis of the fault-detecting ability of testing methods. IEEE Transactions on Software Engineering, 19(3):202– 213, March 1993. [6] P. G. Frankl and E. J. Weyuker. Provable improvements on branch testing. IEEE Transactions on Software Engineering, 19(10):962–975, October 1993. [7] P. G. Frankl and E. J. Weyuker. Testing software to detect and reduce risk. The Journal of Systems and Software, pages 275–286, 2000. [8] J. B. Goodenough and S. L. Gerhart. Towards a theory of test data selection. IEEE Transactions on Software Engineering, 1(2):156–173, June 1975. [9] D. Hamlet. What can we learn by testing a program. In ACM SIGSOFT International Symposium on Software Testing and Analysis, Clearwater Beach, Florida, US, pages 50–52. ACM, 1998.

18

[10] M. Harman, R. Hierons, M. Holcombe, B. Jones, S. Reid, M. Roper, and M. Woodward. Towards a maturity model for empirical studies of software testing. In Fifth Workshop on Empirical Studies of Software Maintenance (WESS’99), Keble College, Oxford, UK, September 1999. [11] W. E. Howden. Reliability of the path analysis testing strategy. IEEE Transactions on Software Engineering, 2(3):208–215, September 1976. [12] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand. Experiments on the effectiveness of dataflow and control-flow based test adequacy criteria. In 16th International Conference on Software Engineering (ICSE-16), pages 191–200, 1994. [13] R. Jasper, M. Brennan, K. Williamson, B. Currier, and D. Zimmerman. Test data generation and feasible path analysis. In SIGSOFT: International Symposium on Software Testing and Analysis, Seattle, Washington, US, pages 95–107. ACM, 1994. [14] J. A. Jones and M. J. Harrold. Test-suite reduction and prioritization for modified condition/decision coverage. In International Conference on Software Maintenance (ICSM’01), Florence, Italy, pages 92–101. IEEE, November 2001. [15] D. R. Kuhn. Fault classes and error detection capability of specificationbased testing. ACM Transactions on Software Engineering and Methodology, 8(4):411–424, October 1999. [16] L. J. Morell. A theory of fault-based testing. IEEE Transactions on Software Engineering, 16(8):844–857, August 1990. [17] G. Myers. The art of software engineering. Wiley-Interscience, 1979. [18] A. J. Offutt. Investigations of the software testing coupling effect. ACM Transactions on Software Engineering and Methodology, 1(1):5–20, 1992. [19] A. J. Offutt, Y. Xiong, and S. Liu. Criteria for generating specification-based tests. In Fifth IEEE International Conference on Engineering of Complex Computer Systems (ISECCS’99), Las Vegas, Nevada, USA, pages 119–129, October 1999. [20] A. Paradkar and K.C. Tai. Test-generation for Boolean expressions. In Sixth International Symposium on Software Reliability Engineering (ISSRE’95), Toulouse, France, pages 106–115, 1995. [21] A. Paradkar, K.C. Tai, and M. A. Vouk. Automatic test-generation for predicates. IEEE Transactions on reliability, 45(4):515–530, December 1996.

19

[22] D. J. Richardson and M. C. Thompson. An analysis of test data selection criteria using the RELAY model of fault detection. IEEE Transactions on Software Engineering, 19(6):533–553, June 1993. [23] M. Roper. Software Testing. McGraw-Hill Book Company Europe, 1994. [24] RTCA/DO-178B. Software considerations in airborne systems and equipment certification. Washington DC USA, 1992. [25] K.-C. Tai. Theory of fault-based predicate testing for computer programs. IEEE Transactions on Software Engineering, 22(8):552–562, August 1996. [26] Software testing FAQ: http://www.testingfaqs.org/tools.htm.

Testing

tools

supplier.

[27] T. Tsuchiya and T. Kikuno. On fault classes and error detection capability of specification-based testing. ACM Transactions on Software Engineering and Methodology, 11(1):58–62, January 2001. [28] S. A. Vilkomir and J. P. Bowen. Reinforced condition/decision coverage (RC/DC): A new criterion for software testing. In D. Bert, J. P. Bowen, M. Henson, and K. Robinson, editors, ZB 2002: Formal Specification and Development in Z and B, volume 2272 of Lecture Notes in Computer Science, pages 295–313. Springer-Verlag, January 2002. [29] S. A. Vilkomir, K. Kapoor, and J. P. Bowen. Tolerance of control-flow testing criteria. Technical Report SBU-CISM-03-01, SCISM, South Bank University, London, UK, March 2003. [30] M. A. Vouk, K. C. Tai, and A. Paradkar. Empirical studies of predicate-based software testing. In 5th International Symposium on Software Reliability Engineering, pages 55–64, 1994. [31] E. Weyuker. Can we measure software testing effectiveness? In First International Software Metrics Symposium, Baltimore, USA, pages 100–107. IEEE, 21-22 May 1993. [32] E. Weyuker. Thinking formally about testing without a formal specification. In Formal Approaches to Testing of Software (FATES’02), A Satellite Workshop of CONCUR’02, Brno, Czech Republic, pages 1–10, 24 August 2002. [33] E. Weyuker, T. Gorodia, and A. Singh. Automatically generating test data from a boolean specification. IEEE Transactions on Software Engineering, 20(5):353–363, May 1994. [34] A. White. Comments on modified condition/decision coverage for software testing. In IEEE Aerospace Conference, Big sky, Montana, USA, volume 6, pages 2821–2828, 10-17 March 2001. 20

[35] W. Wong, J. Horgan, A. Mathur, and A. Pasquini. Test set size minimization and fault detection effectiveness: a case study in a space application. In Twenty-First Annual International Computer Software and Applications Conference (COMPSAC’97), pages 522–528, 13-15 August 1997. [36] H. Zhu. Axiomatic assessment of control flow-based software test adequacy criteria. Software Engineering Journal, pages 194–204, September 1995. [37] H. Zhu. A formal analysis of the subsume relation between software test adequacy criteria. IEEE Transactions on Software Engineering, 22(4):248– 255, April 1996. [38] H. Zhu, P. Hall, and H. R. May. Software unit test coverage and adequacy. ACM Computing Surveys, 29(4):336–427, December 1997.

21

Experimental Evaluation of the Tolerance for Control ...

Abstract. For a given test criterion, the number of test-sets satisfying the crite- rion may be very large, with varying fault detection effectiveness. In re- cent work [29], the measure of variation in effectiveness of test criterion was defined as 'tolerance'. This paper presents an experimental evaluation of tol- erance for control-flow ...

171KB Sizes 0 Downloads 200 Views

Recommend Documents

Experimental evaluation of three osteosynthesis ...
doi:10.1016/j.jcms.2005.09.005, available online at http://www.sciencedirect.com. Experimental evaluation of three osteosynthesis devices used for stabilizing.

Experimental Performance Evaluation of a ...
packets among SW MAC, HW MAC, and Host-PC. The HW. MAC writes the packets received from the PHY into the shared-memory using Direct Memory Access ...

field experimental evaluation of secondary ... - Semantic Scholar
developed a great variety of potential defenses against fouling ... surface energy (Targett, 1988; Davis et al., 1989;. Wahl, 1989; Davis ... possibly provide an alternative to the commercial .... the concentrations of the metabolites in the source.

EXPERIMENTAL AND NUMERICAL EVALUATION OF ...
considered to be the primary unit of cancellous bone, are aligned along the ...... GSM is that its non-zero coefficients are clustered about the diagonal and the ...

Experimental Evaluation of Cooperative Voltage ...
Abstract. Power-efficient design of real-time embedded systems becomes more important as the system functionality is increasingly realized through software. This paper presents a dynamic power management method called cooperative voltage scaling (CVS

Experimental Evaluation of the Variation in ...
A test data adequacy criterion is a set of rules that pre- scribe some property ... control-flow test criteria check these Boolean decisions of the program based on ...

An Experimental Evaluation of the Computational Cost ...
Customizing Data-plane Processing in. Edge Routers ... Lawful interception. Video streaming optimizer ... data plane applications that operate on a network slice ...

Experimental evidence for hillslope control of ...
Sep 2, 2015 - or contraction of the valley network from changes .... channel networks (blue) and locations of hillslope .... Updated information and services,.

Tolerance of Control-Flow Testing Criteria
{kapoork, bowenjp}@sbu.ac.uk. Abstract. Effectiveness of testing criteria is the ability to detect fail- ures in a software program. We consider not only effec- tiveness of some testing criterion in itself but a variance of effectiveness of different

Tolerance of Control-Flow Testing Criteria
C (which divides the input domain into subdomains and re- quires the selection of one test case or set from ... based, where each domain is formed by test sets for testing one specific decision in a program. But it is ..... IEEE Computer Society Pres

Packer Jaccard Index Experimental Evaluation Generating ... - GitHub
A packer compresses or encrypts the instructions and data of a program ... the code must be decrypted before static analysis can be applied. Moreover .... The research aims at developing a detection mechanism based on multiple classifier ...

field experimental evaluation of secondary metabolites ...
surface energy (Targett, 1988; Davis et al., 1989;. Wahl ... possibly provide an alternative to the commercial .... with the solution (diluted in 0.5 ml methanol) after.

An Experimental Evaluation of Network Reliability in a ...
they are exposing many services occupying a lot of RAM. The RE-Mote has 16KB of RAM retention, which was not enough for ..... 1500. 2000. 2500. 3000. 3500. 4000. 4500. RSSI (dBm). Measurements. Figure 5. Histogram of the RSSI of indoor devices. Figur

Tolerance for Predatory Wildlife Treves_Bruskotter_2014.pdf
Tolerance for Predatory Wildlife Treves_Bruskotter_2014.pdf. Tolerance for Predatory Wildlife Treves_Bruskotter_2014.pdf. Open. Extract. Open with. Sign In.

Tolerance for Predatory Wildlife Treves_Bruskotter_2014.pdf ...
fi gure) is subjected to the largest cyclic cir- cumferential stretch from the distending. blood pressure ... strongest predictors of acceptance of tigers ... We therefore.

Review of the DAC Principles for Evaluation of ...
objectives of poverty reduction, governance issues (elections, role of civil societies, human rights, ..... clarify the role of evaluation in multi-donor support programmes, which were variously ...... requested, on TORs, selection of evaluators, met