MFL: Method-Level Fault Localization with Causal ...

Viewer
Transcript

2013 IEEE Sixth International Conference on Software Testing, Verification and Validation

MFL: Method-Level Fault Localization with Causal Inference Gang Shu, Boya Sun, Andy Podgurski, Feng Cao EECS Department Case Western Reserve University Cleveland, OH 44106 @case.edu Abstract—Recent studies have shown that use of causal inference techniques for reducing confounding bias improves the effectiveness of statistical fault localization (SFL) at the level of program statements. However, with very large programs and test suites, the overhead of statement-level causal SFL may be excessive. Moreover cost evaluations of statement-level SFL techniques generally are based on a questionable assumption—that software developers can consistently recognize faults when examining statements in isolation. To address these issues, we propose and evaluate a novel method-level SFL technique called MFL, which is based on causal inference methodology. In addition to reframing SFL at the method level, our technique incorporates a new algorithm for selecting covariates to use in adjusting for confounding bias. This algorithm attempts to ensure that such covariates satisfy the conditional exchangeability and positivity properties required for identifying causal effects with observational data. We present empirical results indicating that our approach is more effective than four method-level versions of well-known SFL techniques and that our confounder selection algorithm is superior to two alternatives.

in non-increasing order by their suspiciousness scores, until a fault is found. That is, developers examine the program elements that are most strongly associated with observed failures, as indicated by their suspiciousness scores, on the assumption that they are most likely to have caused the failures. SFL techniques are usually evaluated empirically in terms of the number of lines of source code that developers examine, in non-increasing order of their suspiciousness scores, to find known faults. Baah et al [8, 9] pointed out that most SFL techniques (e.g., [2-7]) produce biased and inaccurate measures of the actual causal effect [10, 11] of individual program elements on the occurrence of failures, which Baah et al called an element’s failure-causing effect. This is because the techniques are applied to observational data and do not adjust for confounding bias (or confounding) [10, 11], which is well-known form of bias due to common causes of a “treatment” or “exposure”, on one hand, and an outcome of interest, on the other. In the case of typical SFL techniques, the “treatment” is execution of a particular program element, and the outcome of interest is program failure. Confounding bias may seriously distort the suspiciousness score of a correct statement , for example, because a faulty statement causes to be executed erroneously. In experiments, randomized treatment assignment is used to prevent the effect of an experimental treatment from being confounded by other factors [11]. The units or subjects to receive the treatment and those not to receive it (the controls) are chosen randomly, so that treatment assignment is statistically independent of possible confounding factors. Often, however, randomized experiments are impractical to conduct or involve unrepresentative subjects or treatment conditions. In such cases, researchers must rely on observational data [12]. Fortunately, researchers in such diverse fields as computer science, statistics, epidemiology, and the social sciences have gradually developed a principled methodology for making causal inferences from observational data [10, 11]. (2012 ACM Turing Award recipient Judea Pearl made seminal contributions to this work.) A key assumption needed for identifying causal effects from observational data is called “conditional exchangeability” [13] of the treated units or subjects with the controls. (It is also called “conditional ignorability” of treatment assignment [11]). Consider a binary treatment variable and an outcome variable , both defined over a given population. Let the pre-treatment potential outcomes under the treatment and under the control regime (no treatment or standard treatment) be denoted and ,

Keywords-statistical fault localization; statistical debugging; causal inference; causal graph; confounding bias; confounder selection; positivity; dynamic call graph; dynamic data dependences I

INTRODUCTION

Fault localization is the task of locating a fault (defect) in program code, given one or more failure-causing inputs. It is often the most difficult and time-consuming part of program debugging [1]. To reduce its cost, a number of statistical fault localization (SFL) or statistical debugging techniques have been proposed (e.g., [2-7]), which are applicable when pass/fail labels and profile data are available for set of executions. SFL techniques typically involve the following steps: (1) execution of a faulty program, by developers or end users, on a set of test inputs or of operational inputs, 1 respectively, that induces both successful executions and failures; (2) recording, for the same inputs, execution profiles that characterize code coverage or related events in detail; (3) manual or automatic labeling of executions as passing or failing; (4) calculation of statistical measures, called suspiciousness scores or metrics, of the association between the execution of individual program elements, such as statements or predicates, and the occurrence of program failures; (5) manual examination of the program elements by developers, 1

We shall use the term “test” to refer to both artificial test inputs and operational inputs (and the executions they induce).

978-0-7695-4968-2/13 $26.00 © 2013 IEEE DOI 10.1109/ICST.2013.31

124

respectively. (Note that for any individual, only one of the two potential outcomes and is actually observed; 2 the other one is counterfactual—counter to the facts.) Finally, let be a set of covariates. Conditional exchangeability holds with respect to , , and if the potential outcomes and are conditionally independent of the random variable T given the values of the covariates in , which is denoted , | . Intuitively, this means that if the value of is known then also knowing the value of provides no additional information about the values of and . A causal DAG is a directed acyclic graph representing nonstatistical assumptions about the possible causal relations among a set of variables. Causal inference researchers have derived precise conditions, in terms of a causal DAG, for selecting a set of covariates that ensures conditional exchangeability is satisfied [10] (see Section II). By conditioning on (e.g., stratifying on) the values of the covariates in , an unconfounded estimate of the causal effect of interest, e.g. of the risk difference Pr[ = 1]

Pr[ = 1], can be computed. Causal inference methodology provides powerful tools for investigating the causes of program failures. Moreover, certain graph representations of programs, such as program dependence graphs [14] and call graphs [15] provide a natural basis for deriving graphs suitable for use in causal inference. Baah et al [8, 9] showed how causal inference methodology could be used to improve the performance of statement-level SFL by reducing confounding bias. Their approach involves deriving a causal graph that reflects program dependences between statements [14] and then using it to select confounding covariates to adjust for statistically. For example, their initial approach [8] estimates the failure-causing effect of a program statement using a linear regression model = + + + where is a binary outcome variable that is 1 for a program execution if it fails, is a binary “treatment” variable that is 1 if statement is covered by the execution, is a binary variable that is 1 if the forward control dependence predecessor of s [16] is covered, is an intercept, and are coefficients, and is a error term. Here the coefficient represents the average treatment effect E[ ] E[ ] = Pr[ = 1] Pr[ = 1], and its fitted value is used as a suspiciousness score for statement . The covariate is a confounder and is included in the model to eliminate or reduce confounding bias. This model can be formally justified based on Pearl’s Backdoor Criterion [10] for control of confounding (see Section II) and based on the very strong assumptions that binary coverage variables and a program’s forward control dependences suffice to represent failure-causation in the program under analysis. Naturally, the latter assumptions don’t hold for all defects. 2

Baah et al subsequently proposed an approach to causal SFL [9] that addresses a program’s data dependences as well as its control dependences and that uses the wellknown causal inference technique matching [11] to create comparison groups of executions that are relatively balanced with respect to local dependences. Gore and Reynolds extended Baah et al’s original causal regression model by adjusting for predicate outcomes rather than only for predicate coverage [17]. Since Baah et al’s techniques involve fitting a regression model or computing a matching for each statement in a program, their computational and profile storage overheads may be considerable for large programs. (For similar reasons, Liblit et al sampled runtime predicate values rather than collecting complete profiles [4].) An even greater concern, which applies to all statement-level SFL techniques, is the validity of the implicit assumption that software developers can consistently recognize faults by examining only “suspicious” program statements. Our experience suggests that in many if not in most cases, recognizing a fault requires a developer to familiarize (or refamilarize) themselves with at least its local context(s), e.g., the subprogram unit containing it. That is, a developer often must read and understand the code surrounding a fault to know that it is a fault. The cost of this additional effort is not accounted for in evaluations of statement-level SFL techniques. One way to reduce the costs of collecting, storing, and analyzing execution profiles in SFL and also to better account for the developer effort it entails is to assign suspiciousness scores to program units or regions of larger size than individual statements. It is natural to consider assigning suspiciousness scores to subprograms, such as methods, functions, or procedures. Subprograms are created partly to make a program easier to understand and partly to facilitate code reuse and maintenance. They are often documented. A well-designed subprogram is logically cohesive and has low coupling with other subprograms [18]. These properties usually make the purpose of an individual subprogram easier to understand than the purpose of an individual statement. Also, many (though not all) multi-statement faults are contained within single subprograms. We contend that all of these factors make subprograms more realistic initial targets than statements for suspiciousness scoring and for inspection by developers. For these reasons, we propose a novel SFL technique, called MFL, that seeks to localize faults in an objectoriented program to individual methods. (The ideas of MFL also apply to non-OO subprograms.) In order to reduce the rank-distorting effects of confounding bias on suspiciousness scores, MFL is based on causal inference methodology. To the best of our knowledge, MFL is the first SFL technique designed specifically to locate faulty methods or subprograms in large and complex software using causal inference methodology. To achieve our goal of reducing profiling, storage, and analysis overhead, MFL employs a causal graph based on a program’s dynamic call graph, augmented to reflect inter-method data dependences

It is assumed that either treatment or control is applied to an individual, but not both.

125

outcomes and the actual treatment are independent, given the measured covariates; (3) positivity—the conditional probability Pr[ = | = ] of receiving each treatment value t, given the measured covariates = , is nonzero for all x such that Pr[ = ] > 0 . Conditional exchangeability ensures that the treatment and control group are comparable. It is not an empirically checkable property, but it may be assumed in a given study based on background knowledge. Positivity holds if there are both treated and control units in every stratum of covariate values. This is necessary to permit the effect of treatment to be estimated in each stratum. Positivity may be violated due to data sparsity, but it is a checkable property [20].

that are not reflected in the call graph. (This is done instead of employing a graph derived from a statementlevel program dependence graph as in [8, 9].) This change required us to develop and evaluate a new, method-level algorithm for selecting confounding variables to adjust for. Moreover, the algorithm is designed to satisfy, in addition to the conditional-exchangeability property, another important precondition for making valid causal inferences called positivity [19], which was not considered in previous work on causal SFL (see Section V.B). Satisfying both of these properties is challenging because it involves searching the causal graph for an admissible set of confounders. We also report on an empirical evaluation of MFL, whose results suggests that (a) it is more effective, in terms of the quality of the method rankings it produces, than method-level versions of four well-known SFL techniques and that (b) our confounder selection algorithm is superior to two alternative algorithms. The main contributions of the paper are as follows: x A method-level statistical fault localization technique, MFL, that employs causal inference methodology to improve localization by reducing confounding bias. x A new algorithm for selecting confounding covariates that is based on causal analysis of a program’s dynamic call graph and inter-method data dependences and that seeks to satisfy the key properties conditional exchangeability and positivity. x Results of an empirical comparison of MFL with four well-known SFL techniques adapted to operate at the method level and with two alternative confounder selection algorithms. This study also extends the range of subject programs to which causal SFL has been applied. II

Causal DAGs and the Backdoor Criterion A causal DAG or causal diagram is a graphical representation of causal relationships among variables [10]. Vertices represent random variables. There is a directed edge or arrow if and only if is assumed to be an actual or potential cause of . Causal DAGs embody nonstatistical prior knowledge or assumptions about the causal relationships that are relevant to a problem, and hence they guide the process of making causal inferences from data. A number of theoretical results characterize conditions under which causal effects can be identified, in terms of properties of a causal DAG. Perhaps the best known of these is Pearl’s Backdoor Criterion [21], which implies that confounding bias can be eliminated by conditioning on a set of variables that block statistical associations made possible by “backdoor paths” in a causal DAG from the treatment variable to the outcome variable. By convention, paths in a causal DAG may contain either forward or backward arrows; a backdoor path begins with a backward arrow and hence is not a causal (directed) path. Definition [10]: A path in a causal DAG is said to be d-separated or blocked by a set of variables if and only if either a. contains a chain or a fork such that the middle variable is in , or b. contains a collider such that the middle variable is not in and no descendant of is in . In the chain , is an intermediate variable on a causal path from to . In the fork , is a common cause of and on a backdoor path between them. In both cases, fixing the value of M makes and independent. In the collider , is a common effect of and on a non-causal, non-backdoor path between them. Fixing its value or that of a descendant actually creates a statistical association between and . 3 B.

BACKGROUND

Causal Inference from Observational Data We now briefly present some fundamental ideas and results of causal inference theory. Ideal randomized experiments can be used to identify causal effects because they are expected to produce unconditional exchangeability of the treated and the untreated groups [13]. This means that the units/subjects assigned to treatment could be exchanged, prior to treatment, with those assigned to control, without altering the final causal effect measure (except due to chance). With an observational study, in which treatment may not be assigned randomly, causal inference requires assuming that the study can be viewed as a conditionally randomized experiment, in which treatment assignment is randomized within subgroups defined by the values of certain covariates. This is possible under the following three conditions [13]: (1) well-defined intervention—the values of the treatment variable correspond to well-defined interventions that could be applied experimentally in principle; (2) conditional exchangeability—the potential A.

3

126

For example, suppose indicates whether it has rained, indicates whether the lawn sprinkler was on, and M indicates whether the lawn is wet [Pearl 2009]. Given that the lawn is wet, if the sprinkler was not on then it must have rained.

Test #1 (t1):

R'A'B'C'D'B'C'D** *****

R A B C D E F G H I Y

Test #2 (t2):

R' A' E' G' H' I*****

Test #3 (t3):

R' A' E' F' H' I****' B ' C' D****

Test #4 (t4):

R' A' E' G***

Test #5 (t5):

R' A' B' C***

(a) Static Call Graph

(b) Method Call Sequences Figure 1

t3

t4

t5

1 1 0 0 0 1 0 1 1 1 0

1 1 1 1 1 1 1 0 1 1 1

1 1 0 0 0 1 0 1 0 0 0

1 1 1 1 0 0 0 0 0 0 0

(c) Method Coverage under Tests

columns, a binary value indicates whether method - was executed by test or not (1 for yes). An additional row labeled at the bottom indicates the outcome of each test (1 for failing and 0 for passing). IV

DEFINITIONS

In this section, we present definitions and equations used in our confounder selection algorithm. Dynamic Call Graph Whereas a static-CG is constructed from a program’s code without running the program, a dynamic call graph (dynamic-CG) is built from method call sequences observed when executing a program on a set of inputs. In a dynamic-CG, vertices represent executed methods and directed edges represent executed calls. Each method is represented using a single vertex regardless of the number of paths on which it appears. The dynamic-CG for our motivating example is shown in Figure 2(a). DynamicCGs are useful for SFL because they represent the method calls that have actually occurred during execution of a set of tests. A.

Pr[ = ] = Pr[ = | = , = !]Pr[ = !] "

The right hand side of this equation is a probabilityweighted or “standardized” average of conditional probabilities over strata of . (Not all methods for estimating causal effect measures are directly based on this equation [11].)

Dynamic Data-Dependence Graph Dynamic data dependences are important for SFL because they carry the values used in individual statements [9] or methods, even if these values are not transmitted via method parameters. Informally, a method-level dynamic data dependence graph (dynamic-DDG) . is a directed graph whose vertices represent methods and whose edges represent data dependences that were realized between methods during execution of a program on a set of inputs. The presence of an edge in . from - to - means that a computation carried out in method - directly depended on a value computed in method - . This means - used a variable (e.g. an object field) that was last defined by - . More precisely, let - and - be two methods in a call sequence / = - , - , … , -2 , where 3 < 4 ; let #(- ) be the set of program variables defined in method - ; and let 5(- ) be the set of program variables used in method - . Then - is directly dynamically data dependent [22] on - , denoted by - ##6## - , iff B.

MOTIVATING EXAMPLE

We now present a motivating example, which will be used later to illustrate key ideas. Consider a simple program consisting of ten methods, with a fault located in method #. A global variable is defined in method root ($). Assume that that this variable might be used and redefined in methods %, #, and &. Figure 1(a) shows the static call graph (static-CG) of the program, whose vertices correspond to methods and whose edge set represents all potential caller-callee relationships indicated in the program code. Figure 1(b) shows five method call sequences induced by five tests; “'” indicates a method call and “*” indicates a method return. Figure 1(c) shows the associated method coverage table (MCT). The first column lists the ten methods; the remaining columns characterize the 5 tests. In each cell (-, ) of the latter 4

t2

1 1 1 1 1 0 0 0 0 0 0

Motivating example

Definition [10]: A set of variables satisfies the Backdoor Criterion relative to an ordered pair of variables ( , ) in a causal DAG if a. No variable in is a descendant of ; and b. blocks every path between and that contains an arrow into . Confounding in observational studies is a form of lack of exchangeability between the treated and the untreated [13]. In the presence of confounding, the effects of covariates are mixed with the effect of the treatment, and therefore associational effect measures are not causal effect measures. The following theorem of Pearl justifies adjusting for confounding bias by conditioning on a set of variables that satisfies the Backdoor Criterion. Theorem [10]: If a set of variables satisfies the backdoor criterion relative to (, ), then the causal effect of on is identifiable and is given by this formula: 4

III

t1

7#( ) 8 59 :; #9/(3 + 1, 4 1): ? @

This formula has been rewritten to use potential outcome notation instead of Pearl’s do(x) notation.

127

where /(3 + 1, 4 1) is the (possibly empty) subsequence -A , … , -B of / and where #9/(3 + 1, 4 1): is the set of program variables defined in /(3 + 1, 4 1). The edgeset of the dynamic-DDG . is {(-, 6)|6 ##6## -} . Algorithms for computing dynamic data dependences can be found in [22]. Here we consider only dynamic data dependences involving object fields, because passing of values via method parameters is already characterized by the dynamic-CG. The dynamic-DDG for our motivating example is shown in Figure 2(b). V

of the following high-level steps: 1. Integrate the dynamic-CG and dynamic-DDG of a program. 2. Identify with each vertex - of the integrated graph a binary “treatment” variable D . 3. Add an outcome vertex to the graph, and for each method - add a directed edge (D , ). 4. Delete all duplicate edges. 5. Delete all edges from a vertex to itself. 6. Remove any remaining cycles (during confounder selection—see next section). Each vertex of the causal graph may be viewed as a binary random variable whose values are induced by program tests. For a method - and test the variable D is a coverage indicator for -, which is 1 iff - was executed by . D is viewed as a treatment variable when we estimate the failure-causing effect of method -, and D may serve as a covariate when we estimate the FCE of another method. The outcome variable is 1 for a test iff the program failed on . Figure 2(c) shows, for our motivating example, the integrated graph resulting from step (1). Solid edges represent method calls and dashed edges represent dynamic data dependences. The graph resulting from steps (1)-(5) is shown in Figure 2(d). Note that this graph is cyclic.

ESTIMATION OF FAILURE-CAUSING EFFECT

In this section, we present our approach to estimating the failure-causing effect (FCE) of a method, which will serve as its suspiciousness score. We first present a procedure for constructing a method-level causal graph for a program. We then described the criteria used for selecting confounders for adjustment and present our confounder selection algorithm. Finally, a statistical model is presented for estimating a method’s FCE. Causal Graph Construction Consider a method - and a method C that is -’s caller (direct or indirect) or its dynamic data dependence ancestor. Method C may contribute to either (a) triggering a fault in method - so that it causes a program failure or (b) propagating via - the effects of a fault outside - so that a failure occurs. Ancestor C might also cause a program failure without execution of -. Without knowledge of -’s or C’s semantics, C must be considered a possible common cause of execution of - and of program failure. Thus, C is a possible confounder of the failure-causing effect of -. For these reasons, our procedure for constructing a methodlevel causal graph for a program involves combining its dynamic-CG and dynamic-DDG. The procedure consists A.

Criteria for Confounder Selection Selecting confounders to adjust for is the central task in reducing confounding bias. Essentially, our approach to confounder selection involves application of Pearl’s Backdoor Criterion, with additional steps taken to ensure the positivity property is satisfied. However, the runtime characteristics of programs complicate the confounder selection problem. In particular, program recursion and

B.

(a) Dynamic Call Graph

(b) Dynamic-DDG

(c) Integrated Graph Figure 2

(d) Causal Graph Causal graph construction for motivating example

128

loops give rise to cycles in dynamic-CGs and dynamicDDGs, respectively, and program usage and internal semantics may lead to positivity violations. In this section, we present confounder selection criteria and heuristics to address these criteria. Acyclicity: Steps (4) and (5) above do not necessarily remove all cycles from the integrated causal graph. Causal inference theory deals mainly with causal DAGs and not cyclic graphs [10]. However, to the extent that faults are actually triggered by non-cyclic patterns of code coverage, it is reasonable to disregard cycles for the purposes of SFL. The utility of this must be demonstrated empirically, of course. Baah et al [9] argued that when selecting confounders to adjust for in estimating the failure-causing effect of a binary coverage-indicator for a statement s, it suffices to consider covariates along acyclic program dependence chains terminating at . We apply this idea to method-level causal graphs. If the treatment variable D belongs to a cycle, its edge that leaves D is ignored, and confounders are chosen from among the ancestors of D in the “modified” causal graph. There are no cycles involving D in this graph. For example, in Figure 2(d), there is a causal loop consisting of methods F, , and #. To estimate the failure-causing effect of method F, we can ignore the edge F and choose the variables G , H , and I as candidate confounders. Causal Ancestors and Positivity Checking: The direct callers and data-dependence predecessors of a method directly influence - by determining whether - is executed and by defining its inputs, respectively. Since adjusting for the direct causes of a variable permits its causal effect to be identified [10], it is natural to consider selecting as confounders the coverage-indicator variables associated with direct callers and ##6## predecessors of - . However, it is not always possible to do this and also satisfy the positivity property. Recall that positivity requires that there be both treated and untreated units for every combination of values of the observed confounder(s) in the population under study [19]. That is, for a discretevalued treatment and a covariate vector , positivity holds if Pr[ = | = ] > 0, for all and for all x such that Pr[ = ] > 0 . Positivity has been ignored in previous research on causal SFL. A given direct caller or ##6## predecessor of a method - may always be executed prior to -, or it may never be executed prior to -. More generally, for a chosen set % of dynamic-CG and/or dynamic-DDG ancestors of -, the coverage indicators J for C K % may take on a particular configuration (vector) of values (J | C K %) for which - is always executed or - is never executed. In this case, positivity will be violated with respect to the treatment variable D and the covariates J for C K %. Fortunately, positivity is an empirically checkable property, given data about the relevant variables. To circumvent violations of positivity involving the treatment variable D , it is natural to consider sets % of ancestors of method - that lie at progressively greater distance from m in the causal graph, until a set is found that satisfies the

positivity property. However, the number of such ancestor sets may be large, and the number of value combinations realized by the associated coverage indicators is likely to be much larger. Hence, checking them exhaustively for each method may be impractical. Heuristic: We therefore adopt a heuristic approach for generating and checking candidate sets of confounders from among the dynamic-CG and dynamic-DDG ancestors of a method m. This is to check for positivity only with respect individual ancestors; that is, to check whether or not Pr[D = |J = L] > 0 for individual ancestors C of - in the causal graph (and for all , L K {0,1}). If this holds for each ancestor C of - in a set % of ancestors of -, we say that the weak positivity criterion holds for % . Weak positivity is a necessary, but not sufficient, condition for positivity to hold. During confounder selection we eliminate candidate confounders that violate weak positivity. After a set of confounders is formed from the remaining candidates, we check (strong) positivity. For a given subject program, we check weak positivity for each pair of methods, using the following simple algorithm, to produce a weak positivity table for use in confounder selection: Algorithm 1: WeakPositivityValidation Input: integer array mct[1::m, 1::n]; // Method Coverage Table Output: boolean array wpt[1::m, 1::m]; // Weak Positivity Table for each method i := 1 until m do for method j := i + 1 until m do boolean flag := p11 := p10 := p01 := p00 := false; for each test k := 1 until n do if mct [i, k] = 1 and mct [j, k] = 1 then p11 := true; if mct [i, k] = 1 and mct [j, k] = 0 then p10 := true; if mct [i, k] = 0 and mct [j, k] = 1 then p01 := true; if mct [i, k] = 0 and mct [j, k] = 0 then p00 := true; if p11 and p10 and p01 and p00 then flag := true; break; end for wpt [i, j] := flag; wpt[j, i] := flag; end for end for

Stepwise Confounder-Selection Algorithm Our stepwise confounder selection algorithm takes the weak positivity table and the causal graph of a program as inputs, and it outputs the confounder set chosen for each method. The initial set of candidate confounders for a method - consists of the treatment variables (coverage indicators) for - ’s parents ( - ’s direct callers and data dependence predecessors). The algorithm checks weak positivity, Pr[D = |J = L] > 0 (, L K {0,1}), for each candidate confounder J in the candidate set. The covariate J is selected if weak positivity holds; otherwise its parents are considered instead. This procedure is repeated until the set of candidates is empty. Final verification of (strong) positivity is required to form a qualified confounder set.

C.

Algorithm 2: ConfounderSelection Inputs: boolean array wpt (1::m, 1::m); // Weak Positivity Table CausalGraph cg (V, E); // Causal Graph Output: List confounderSetList;

129

for each method i := 1 until m do boolean [] visited := new boolean[m]; visited[i] := true; VertexSet confounderSet := new VertexSet (); VertexSet initialSet := cg.getParentsOf( i ); Stack stack := new Stack ( initialSet ); while ! stack.isEmpty() do j := stack.pop (); if !visited[j] then visited[j] := true; if wpt[i, j] then //Weak positivity holds confounderSet.addDistinct(j); else //Weak positivity is violated stack.add (cg.getParentsOf (j) ); end if end if end while confounderSetList.add( i, confounderSet ); end for

reinforcement learning algorithms implemented in Java, along with several applications that employ the algorithms. Faults: We selected a sample of faults randomly from the bug database of each project. Six faults were selected for ROME, five for Xerces2, eight for XStream, and seven for HRL. Two of the faults were removed from HRL, leaving a total five, because they were in an inputvalidation method that was executed by every test. Such a method violates the positivity property. Output Checking: Each of the subject programs produces complex output and lacks an oracle. To reduce our effort and to eliminate subjective judgments about correctness of outputs, we instrumented the programs (manually) to detect the trigger conditions for the faults and to report which faults were triggered and affected program outputs. In practice, developers or users must check program outputs or write self-checking test cases if no oracle is available. Test Inputs: With the first two projects, we reused test inputs from Augustine et al.’s work [28]. For ROME, hundreds of Atom and RSS files were downloaded from Google Search results for use as inputs, using a custom web crawler. For Xerces2, XML files were collected from the system directories of an Ubuntu Linux 7.04 machine and from Google Search results. For XStream, we captured thousands of live objects, used in real programs, as inputs. For HRL, we created a number of test cases to ensure that every HRL component was tested. A summary of the subject programs and datasets is presented in Table 1.

Fault Localization Model Appropriately specified statistical regression models may be used to estimate causal effects [11]. Recall that Baah et al [9] used a linear regression model to estimate the failure-causing effect of an individual program statement. Similarly, we employ a linear regression model to estimate the failure-causing effect of method m: = D + D D + MD ND + OD where is the outcome variable, D is the treatment variable, ND is a × 1 vector of covariates J selected to adjust for confounding bias, and OD is a random error term that ideally does not depend on the values of D and ND . MD is a 1 × vector of unknown coefficients (parameters). The coefficient D is the FCE of -, that is, the average effect of executing method - on the occurrence of program failures. The model is fitted with data from a set of tests or operational executions. We use the fitted (estimated) value

D of D as a suspiciousness score for -. A program’s methods are sorted in non-increasing order of these values for examination by developers. If positivity cannot hold, as when a method has no ancestors or none of its candidate confounders satisfies positivity, we currently employ the naive model = D + D D + OD , unless D is constant, in which case m is given a negative (minimal) suspiciousness score. D.

VI

Table 1. Characteristics of subject programs Program

LOC

#Methods

Mean LOC per Method

# Faults

#Test Cases

Failure Rate

ROME Xerces2 XStream HRL

24K 167K 31K 11K

1,602 6,034 2,955 768

14.98 27.67 10.49 14.32

6 5 8 7

900 6,193 7,208 1,003

28.56% 38.83% 44.76% 57.32%

Tools: We instrumented the subject programs using the ASM framework [29], which is a tool designed to generate and manipulate Java classes. We also implemented the following support software: a tool for analyzing method call sequences and data flows; Masri et al.’s algorithm [22] for computing dynamic data dependences; our algorithm for confounder selection; and the statistical components of MFL and other fault localization techniques (using the statistical computing environment R [30]).

EMPIRICAL STUDY

We conducted an empirical study to assess our Methodlevel Fault Localization technique (denoted by MFL). It addressed two main research questions: x RQ-1: How well does MFL identify faulty methods? x RQ-2: How effective are the covariates chosen by Algorithm 2 for reduction of confounding?

Evaluation Methodology For each subject program and each SFL technique that we considered, the following procedure was followed, starting with a version of the program containing all of the faults. First, a suspiciousness score was computed for each method. Methods were then examined in nonincreasing order of their scores until the first faulty method was discovered. By fixing the fault(s) contained in this method, a new program version was obtained. These steps, which we automated, were repeated until no faulty methods remained. We kept track of the top-ranked faulty method and its rank in each debugging iteration.

B.

Subject Programs, Test Inputs, and Tools We applied our technique on four open-source software projects: ROME, Xerces2, XStream and HRL. ROME [23] is a Java library for parsing, generating and publishing RSS and Atom feeds. Xerces2 [24] is a Java XML parser. XStream [25] is a Java library to serialize objects to XML and back again. HRL [26, 27] is a collection of A.

130

Figure 3 Fault localization costs of MFL and baseline techniques for each subject program

suspiciousness metrics, which we adapted in the obvious way to apply to methods, namely Tarantula [3, 31], Ochiai [2], PFIC [8], and the F1-measure [4]. Each of these techniques requires only coverage information, so it was necessary only to substitute method-coverage profiles for statement-coverage profiles. Computation Times: The study was run on a Dell Vostro 410 with a 2.66 GHz Intel Core2 Quad CPU and 4GB RAM. The times for computing the Tarantula, Ochiai, PFIC, and F1 metrics were less than 2 minutes in each case. MFL and its variants (see below) took no more than one minute per subject program for confounder selection. Fitting regression models with R took 1-7 minutes per subject, depending on the number of test cases.

Cost Functions: We measured the cost of applying a SFL technique by the percentage of methods a developer would examine, proceeding in non-increasing order of suspiciousness scores, until all faulty methods were found. By using our prior knowledge of the actual faults to determine whether each method was faulty, we in effect assumed that all and only the faulty methods would be judged to be faulty by developers. Each fault was corrected immediately after it was found, whereupon suspiciousness scores were recomputed. No new faults were introduced in the study. In practice, developers cannot be certain when they examine a suspicious method and find no fault that the method is actually fault-free. Nor can they be certain that if they attempt to fix a fault they will succeed (and not introduce any new faults). Hence, they may or may not wish to reexamine a method that receives a high suspiciousness score more than once. Therefore, we computed two cost measures, CostMax and CostMin. Intuitively, CostMax is the cost to find all faults in a program given that a developer always examines a method when it receives a high suspiciousness score (even if it was examined previously); CostMin is the cost to find all faults given that a developer never reexamines a suspicious method after examining it and finding no faults: SX Y # -UVQW UC-36UW × 100% # QCZ -UVQW

(1)

# W336_ -UVQW UC-36UW × 100% # QCZ -UVQW

(2)

QRJ = QR^ =

C. RQ-1: Performance of MFL vs. Baselines For each of the subject programs, we determined the fault localization costs of MFL and the four other suspiciousness metrics, as measured by CostMin and CostMax. The results are summarized in Figure 3. 5

ROME: With ROME, MFL performed better than each of the other techniques. Figure 3 shows that CostMin for MFL was 12.63%, while for PFIC, the next most efficient technique, CostMin was much larger at 41.59%. The ratio these CostMin values is 0.303. For MFL, CostMax was just 24.34%, while for Tarantula, the next most efficient technique, CostMax was 49.89%. The ratio of these CostMax values is 0.487. Xerces2: Xerces2 was the most complex subject program, and it was the only one to contain a single fault involving two methods. With Xerces2, MFL performed better than each of the other techniques, though less

Here N is the number of debugging iterations. CostMax varies between 0% and ` × 100% , and CostMin ranges from 0% to 100%. Baseline Techniques: We compared our technique MFL against four well-known non-causal statement-level

5

131

The figure can also be found at: http://selserver.case.edu:8080/mfl/.

dramatically so than with ROME. For MFL, CostMin was 45.95%, while for PFIC, the next most efficient technique, CostMin was 48.72%. The ratio of these CostMin values is 0.943. CostMax for MFL was 101.43%, while for Ochiai, the most efficient baseline technique, CostMax was 128.71%. The ratio of these CostMax values is 0.788. XStream: With XStream, MFL again performed better than the other techniques. For MFL, CostMin was 30.08%, while for Tarantula, the next most efficient technique, CostMin was at 36.38%. The ratio these CostMin values is 0.826. CostMax for MFL was just 44.34%, while CostMax for the next most efficient technique, PFIC, was 47.39%. The ratio of these CostMax values is 0.935. HRL: HRL was the smallest of the subject programs, and it was the only numerical program among them. With HRL, the F1-measure performed best overall, and MFL was second best. For the F1-measure, CostMin was 11.67% and CostMax was 12.08%. For MFL, CostMin was 13.65% and CostMax was 21.28%. The ratio of the CostMin values for MFL and the F1-measure is 1.169; the ratio of their CostMax values is 1.761. Recall that two of the seven faults initially selected for HRL were discarded because they were located in a method that was executed by every test. Any SFL technique that scores a method - based on the contrast in failure frequency between tests that cover - and tests that don’t (i.e., between treated and controls) is not applicable to methods that are covered by all tests or by none. (Techniques that score statements based on contrasts are restricted similarly.) Methods that are always covered or never covered violate the positivity property. MFL by default gives them the minimum score. Statement-level causal SFL techniques can be applied within such methods if MFL does not localize the causes of observed failures elsewhere. Summary: For three of four subject programs, MFL performed better than each of the four baseline techniques. For HRL, MFL was somewhat more costly than the F1measure but less costly than the other baseline techniques. The superior performance of MFL was apparently due to reducing the effect of confounding on suspiciousness scores, through the use of causal inference methodology. On the other hand, the high absolute costs of MFL for XStream and especially for Xerces 2, even under CostMin, suggest that to achieve adequate precision, MFL must use a more informative set of causal variables, e.g., one including variables that characterize the runtime values of method parameters and object fields.

We compared these two variants to MFL with respect to their costs when applied to ROME, Xerces2, XStream and HRL. The results are summarized in the last two group bars for each program in the Figure 3. For ROME, MFLCall was a little more efficient than MFL with respect to CostMax (19.36% vs. 24.34%); in all other cases, MFL was more efficient than MFL-Call and MFL-Data. For Xerces2, MFL-Data was more efficient than MFL with respect to both CostMin (43.84% vs. 45.95%) and CostMax (53.49% vs. 101.43%). MFL performed better than MFL-Call, however. For XStream and HRL, MFL was more efficient than each of the other approaches respecting both CostMin and CostMax. Summary: MFL generally showed better and more stable performance in our study than the two alternatives. Threats to Validity Cost Measures: Although we used two cost measures, CostMax and CostMin, to better evaluate the efficiency of MFL, these measures do not fully characterize its cost in practice, especially since developers might choose to employ suspiciousness scores in combination with their background knowledge and intuitions about a program. Statistical Model: We used an ordinary least squares (OLS) linear regression model to estimate a method’s FCE. This may not be the best model for some methods, however. This could be addressed by employing other regression models (e.g., logistic regression) or other causal inference techniques (e.g., inverse probability weighting or matching). Also, our model might be improved by including predictors derived from the values of method parameters and variables, which could be obtained with additional instrumentation. Generalizability: The generalizability of our findings is limited by the fact that our empirical studies involved only four subject programs. Further empirical evaluation is clearly needed. Our study did extend the range of subject programs to which causal SFL techniques have been successfully applied. E.

VII

RELATED WORK

A few studies have addressed the effectiveness of method or function level SFL. However, those studies did not use causal inference methodology or control for confounding bias. Eichinger et al [32] proposed mining edge-weighted dynamic call graphs in order to localize faults. Their technique uses frequent subgraph mining and ranks methods by the average of an entropy-based scoring measure and a structural scoring measure. Jiang et al. [33] presented an empirical study of the impact of test case prioritization on the effectiveness of SFL. In their study, the effect of profile granularity is considered, among other factors. Jiang et al used statement coverage to represent fine granularity and they used function coverage to represent a coarser granularity. Their results suggest that the impact of test case prioritization on fault localization is affected little by granularity level. Cheng et al [34] applied discriminative graph mining, together with graph scoring based on information gain, to method-level (and to basicblock-level) software behavior graphs. Unfortunately, to date it has not been feasible for us to compare MFL

RQ-2: Effectiveness of Confounder Selection To evaluate effectiveness of our confounder selection algorithm, we compared it empirically to two alternatives: x MFL-Call uses only -’s direct callers to form the initial covariate set, and if a covariate _ violates weak positivity, it is replaced with its direct callers. x MFL-Data uses only -’s direct data dependence predecessors to form the initial covariate set, and if a covariate _ violates weak positivity, it is replaced with its direct data dependence predecessors. D.

132

empirically to these techniques, which are considerably more complex to implement than the baseline techniques we did compare MFL to. VIII

effects given observational data. In an empirical evaluation, MFL performed better overall than four alternative methodlevel techniques. To improve MFL’s precision, we intend to investigate adding causal variables to characterize the values of method parameters and fields. We also plan to investigate a two-stage approach, in which MFL is used in the first stage and a statement-level causal SFL technique is used in the second stage.

CONCLUSION AND FUTURE WORK

We have presented a novel technique, MFL, that applies causal inference methodology to the problem of statistically localizing faulty methods in large programs based on execution data. MFL employs a new confounder selection algorithm, which is guided by information about dynamic calls and inter-method data dependence and which attempts to satisfy both the conditional exchangeability and positivity properties that are essential to identifying causal X REFERENCES

IX

ACKNOWLEDGMENTS

This work was supported by NSF awards CCF-0820217 and CNS-1035602.

[1] I. Vessey, "Expertise in Debugging Computer Programs: A Process Analysis," Intl. Journal of Man-Machine Studies, vol. 23, pp. 459-494, 1985. [2] R. Abreu, et al., "On the Accuracy of Spectrum-based Fault Localization," in TAICPART-MUTATION, 2007, pp. 89-98. [3] J. A. Jones, et al., "Visualization of Test Information to Assist Fault Localization," the 24th Intl. Conf. on Software Eng., Orlando, Florida, 2002. [4] B. Liblit, et al., "Scalable Statistical Bug Isolation," the Conf. on Programming Language Design and Implementation, Chicago, IL, USA, 2005. [5] C. Liu, et al., "SOBER: Statistical Model-based Bug Localization," the 10th European Software Eng. Conf. held jointly with 13th ACM SIGSOFT Intl. Symp. on Foundations of Software Eng., Lisbon, Portugal, 2005. [6] S. Artzi, et al., "Directed Test Generation for Effective Fault Localization," the 19th Intl. Symp. on Software Testing and Analysis, Trento, Italy, 2010. [7] Lucia, et al., "Comprehensive Evaluation of Association Measures for Fault Localization," in Intl. Conf. on Software Maintenance (ICSM), 2010, pp. 1-10. [8] G. K. Baah, et al., "Causal Inference for Statistical Fault Localization," the 19th Intl. Symp. on Software Testing and Analysis, Trento, Italy, 2010. [9] G. K. Baah, et al., "Mitigating the Confounding Effects of Program Dependences for effective Fault Localization," the 19th ACM SIGSOFT Symp. and the 13th European Conf. on Foundations of Software Eng., Szeged, Hungary, 2011. [10] J. Pearl, "Causality: Models, Reasoning, and Inference", 2nd ed.: Cambridge University Press, 2009. [11] S. L. Morgan and C. Winship, "Counterfactuals and Causal Inference: Methods and Principles for Social Research": Cambridge University Press, 2007. [12] R. L. Fleurence, et al., "The Critical Role Of Observational Evidence In Comparative Effectiveness Research," in Health Affairs, vol. 29, pp. 1826-1833, 2010. [13] M. A. Hernán and J. M. Robins, "Estimating Causal Effects from Epidemiological Data," in Journal of Epidemiology and Community Health, vol. 60, pp. 578-586, 2006. [14] J. Ferrante, et al., "The Program Dependence Graph and its Use in Optimization," in ACM Trans. on Program. Lang. Syst.(TOPLAS), vol. 9, pp. 319-349, 1987. [15] S. L. Graham, et al., "Gprof: A Call Graph Execution Profiler," SIGPLAN Not., vol. 17, pp. 120-126, 1982. [16] T. Ball, "What's in a Region?: or Computing Control Dependence Regions in Near-linear Time for Reducible

[17]

[18]

[19]

[20]

[21] [22]

[23] [24] [25] [26] [27]

[28]

[29] [30] [31]

[32]

[33]

[34]

133

Control Flow," ACM Lett. Program. Lang. Syst., vol. 2, pp. 1-16, 1993. R. Gore and J. Paul F. Reynolds, "Reducing Confounding Bias in Predicate-level Statistical Debugging Metrics," the Intl. Conf. on Software Eng., Zurich, Switzerland, 2012. E. Yourdon and L. L. Constantine, "Structured Design: Fundamentals of a Discipline of Computer Program and Systems Design," Prentice-Hall, Inc., 1979. D. Westreich and S. R. Cole, "Invited Commentary: Positivity in Practice," American Journal of Epidemiology, vol. 171, pp. 674-677, 2010. M. L. Petersen, et al., "Diagnosing and Responding to Violations in the Positivity Assumption," Statistical Methods in Medical Research, vol. 21, pp. 31-54, 2012. J. Pearl, "Causal Diagrams for Empirical Research," Biometrika, vol. 82, pp. 669-688, December 1, 1995. W. Masri and A. Podgurski, "Algorithms and Tool Support for Dynamic Information Flow Analysis," Information and Software Technology, vol. 51, pp. 385-404, 2009. ROME Project. Available: http://rometools.org/ Xerces2 Project. Available: xerces.apache.org/xerces2-j/ XStream Project. Available: http://xstream.codehaus.org/ HRL Project. Available: https://github.com/bottleneck/HRL_Profiler/ F. Cao and S. Ray, "Bayesian Hierarchical Reinforcement Learning," in Advances in Neural Information Processing Systems(NIPS 2012), pp. 73-81. V. J. Augustine, "Exploiting User Feedback to Facilitate Observation-based Testing," Doctoral Dissertation ed. Case Western Reserve University, 2009. E. Bruneton, et al., "ASM: A Code Manipulation Tool to Implement Adaptable Systems," Fr. Telecom R&D, 2002. R Project Available: http://www.r-project.org/ J. A. Jones and M. J. Harrold, "Empirical Evaluation of the Tarantula Automatic Fault-localization Technique," the 20th IEEE/ACM intl. Conf. on Automated Software Eng., Long Beach, CA, USA, 2005. F. Eichinger, et al., "Mining Edge-Weighted Call Graphs to Localise Software Bugs," in Machine Learning and Knowledge Discovery in Databases. vol. 5211, W. Daelemans, Ed., ed: Springer 2008, pp. 333-348. B. Jiang, et al., "How Well Do Test Case Prioritization Techniques Support Statistical Fault Localization," in Computer Software and Applications Conference (COMPSAC), 2009, pp. 99-106. H. Cheng, et al., "Identifying Bug Signatures Using Discriminative Graph Mining," the 18th Intl. Symp. on Software Testing and Analysis, Chicago, IL, USA, 2009.

Gestalt: Fast, Unified Fault Localization for ... - Research at Google