Neural Networks 43 (2013) 63–71

Contents lists available at SciVerse ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Causal gene identification using combinatorial V-structure search Ruichu Cai a,b,∗ , Zhenjie Zhang c , Zhifeng Hao a a

Faculty of Computer Science, Guangdong University of Technology, Guangzhou, PR China

b

State Key Laboratory for Novel Software Technology, Nanjing University, PR China

c

Advanced Digital Sciences Center, Illinois at Singapore Pte. Ltd., Singapore

article

info

Article history: Received 15 May 2012 Received in revised form 23 January 2013 Accepted 31 January 2013 Keywords: Causal gene V-Structure Gene expression data Causality

abstract With the advances of biomedical techniques in the last decade, the costs of human genomic sequencing and genomic activity monitoring are coming down rapidly. To support the huge genome-based business in the near future, researchers are eager to find killer applications based on human genome information. Causal gene identification is one of the most promising applications, which may help the potential patients to estimate the risk of certain genetic diseases and locate the target gene for further genetic therapy. Unfortunately, existing pattern recognition techniques, such as Bayesian networks, cannot be directly applied to find the accurate causal relationship between genes and diseases. This is mainly due to the insufficient number of samples and the extremely high dimensionality of the gene space. In this paper, we present the first practical solution to causal gene identification, utilizing a new combinatorial formulation over V-Structures commonly used in conventional Bayesian networks, by exploring the combinations of significant V-Structures. We prove the NP-hardness of the combinatorial search problem under a general settings on the significance measure on the V-Structures, and present a greedy algorithm to find suboptimal results. Extensive experiments show that our proposal is both scalable and effective, particularly with interesting findings on the causal genes over real human genome data. © 2013 Elsevier Ltd. All rights reserved.

1. Introduction With the advances of biomedical techniques in the last decade, such as microarray (Bassett, Eisen, & Boguski, 1999), the cost of gene activity monitoring is coming down to several hundreds.1 In the near future, it is likely that microarrays will be used to test the gene activities for every person for disease diagnosis or gene therapy, forming a business market worth billions of dollars. Before the arrival of the new genomic age, biomedical researchers are now eager to look for killer applications in the huge genomic business. Causal gene identification is one of the most promising applications (Noble, 2008), which aims to help potential patients to accurately estimate their risk with respect to certain diseases. Unfortunately, identification of the causal genes related to genetic diseases is by no means an easy task in the biomedical domain (Cookson, Liang, Abecasis, Moffatt, & Lathrop, 2009). While traditional biological and pathological methods fail to effectively and efficiently discover the causal genes, computer scientists and

∗ Corresponding author at: Faculty of Computer Science, Guangdong University of Technology, Guangzhou, PR China. Tel.: +86 015800030523; fax: +86 20 39323163. E-mail address: [email protected] (R. Cai). 1 http://www.cincinnatichildrens.org/research/cores/gene-expression/fees/. 0893-6080/$ – see front matter © 2013 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2013.01.025

statisticians are trying to apply machine learning and data mining techniques to tackle the problem, e.g. Cai, Hao, Yang, and Wen (2009); Cai, Tung, Zhang, and Hao (2011), Kim, Wuchty, and Przytycka (2011) and Schadt et al. (2005). Given the gene expression data from humans with/without certain genetic diseases, algorithms are designed to automatically find out significant genes causing these diseases. In statistics and learning communities, Bayesian networks (BN) are a common tool used to analyze the correlation and causality relationships between variables. By running statistical significance tests on variable combinations, it is possible to construct a probabilistic graphical model to simulate and evaluate the impact of certain variables over others (Cai, Zhang, & Hao, 2011). However, existing BN methods suffer from three major drawbacks on the causal gene identification problem. Firstly, complete BN construction needs an exponential number of samples to support accurate estimation in statistical tests. Secondly, most of the BN learning methods focus only on building a probabilistic model with high likelihood, instead of finding the exact causality relationship. This potentially leads to a large number of false positive causality connections between genes and diseases, even when the probabilistic model achieves a high likelihood. Therefore, although BN structure learning methods are capable of finding partial causal genes, these methods tend to output more noisy results with diminishing accuracy and low robustness, due to the low signal–noise ratio, the high

64

R. Cai et al. / Neural Networks 43 (2013) 63–71

Fig. 1. Pathway related to Acute Myeloid Leukemia.

dimensionality in gene expression data and the limited number of samples. In Fig. 1, we present an example of a known pathway related to Acute Myeloid Leukemia (AML). All the genes’ expression levels are highly correlated to the variable AML, i.e. the disease status of the patient. However, the gene expression level of STAT, Grb2, Sos, Rsa and P13K does not affect the state of the disease, while FLT3 and c-KIT are the only direct causal genes of AML. It means that these two genes should be the target genes in any gene therapy on AML. In the traditional gene selection or Bayesian Network methods, it is difficult to distinguish FLT3 and c-KIT from the other related genes. This is due to the existence of probabilistic equivalent models different from the true pathway in the figure, which also fully fit the observations, but probably with completely different structures. In this paper, we present a new solution to tackle the causal gene identification problem, based on a combinatorial formulation over V-Structures in Bayesian network. V-Structure is a special type of local probabilistic model involving only three variables X , Y , Z in form of X → Y ← Z . Different from other local structures, V-Structures are supposed to be more robust and discriminating in causality identification problems, since it is not statistically equivalent to any other structures involving the same variables. On the other hand, the computation and verification of V-Structures is relatively cheap, compared to the complete construction of Bayesian network. V-structure thus plays an important role in conventional Inductive Causality methods commonly used in causal discovery methods (Pearl, 2009). However, in the gene expression data, a large number of false VStructures may be discovered, since it is hard to obtain a significant statistic test on small sample high dimensional data. Fortunately, we observe that the falsely discovered V-Structure can be detected for it is usually in conflict with true ones. Thus, we transform the causal gene selection problem into another optimization problem, targeting to identify a group of most significant V-Structures with maximal coverage and without conflict. Although the optimization formulation is proven NP-hard, we present a greedy algorithm effective on finding sub-optimal causality results with high accuracy. Our tests on synthetic datasets verify the effectiveness of our algorithm. Experiments on real gene expression data reveal interesting results on causal genes related to Prostate Cancer and Leukemia. The outline of the paper is listed as follows. First, we will review some existing studies on disease-related gene discovery and causality inference in Section 2. Second, we will discuss the problem definition and preliminary knowledge in Section 3. Third, we will show our theoretical analysis and details of the algorithm in Section 4. We will then present experimental results in Section 5 and finally conclude this paper in Section 6. 2. Related work Feature selection is the most commonly used tool for the disease-related gene discovery method (Saeys, Inza, & Larrañaga, 2007). Generally speaking, feature selection methods can be classified into three categories: filters, wrappers and embedded methods. Filters employ only intrinsic properties of the feature without

considering its interaction with the classifier. In wrapper methods, a classifier is usually built and employed as the evaluation criterion. If the feature selection criterion is derived from the intrinsic properties of a classifier, the corresponding method belongs to the embedded methods category. False discovery (Reiner, Yekutieli, & Benjamini, 2003) and feature set redundancy (Yu & Liu, 2004) are two problems we need to consider for all feature selection problems. A causality Bayesian network is part of the theoretical background of this work. A causality Bayesian network is a special case of a Bayesian network, whose edge direction presents the causality relations among the nodes (Pearl, 2009). The Causality Bayesian network is different from the Bayesian network used in the regulatory network reconstruction problem, such as Friedman, Linial, Nachman, and Pe’er (2000) and Kim, Imoto, and Miyano (2004). Structure learning of a Bayesian network is closely related to the algorithmic background of this work, e.g. the well-known PC algorithm (Kalisch & Bühlmann, 2007; Spirtes, Glymour, & Scheines, 2001) and Markov Blanket discovery methods (Zhu, Ong, & Dash, 2007). These methods provide the skeleton of causal structures, i.e. parent–child pairs and Markov Blanket. However, these methods usually cannot distinguish causes from consequences, which mostly relies on other techniques to conduct exact causal discovery. Pearl is the founder of the causality analysis theory (Pearl, 2009). Most causality inference works simply assume the acquisition of a sufficiently large sample set (Aliferis, Statnikov, Tsamardinos, Mani, & Koutsoukos, 2010a, 2010b), or expensive intervention experiments (He & Geng, 2008). Though there are some works aiming to solve the inference problem when a small number of samples are available (Bromberg & Margaritis, 2009), the exact sample sizes used in their empirical studies remain significantly larger than the scale of gene expression data. To the best of our knowledge, there does not exist a provable method to run robust causal inference on the real gene expression data. In this paper, we present the first practical algorithm to tackle the problems of small sample size and high dimensionality in gene data. Another concept that is closely related to our work is Granger’s causality (Lozano, Abe, Liu, & Rosset, 2009; Mukhopadhyay & Chatterjee, 2007), which uses Granger’s causality theory to infer the gene regulatory networks from the time series gene expression data. Granger’s work differs from traditional causality inference techniques in two aspects. Firstly, compared with the conventional definition of causality, Granger’s causality is more likely a regression method and does not reflect the true causality mechanism. Secondly, the temporal information is essential for Granger’s causality inference, which is hard to collect in the disease–gene relationship analysis context. 3. Preliminaries Assume that all samples from the problem domain contain information on m different genes, i.e. G = {g1 , g2 . . . , gm }, and the disease state of the sample y. Let D = {x1 , x2 , . . . , xn } denote the complete sample set. Each sample xi is denoted by a vector xi = (xi1 , xi2 , . . . , xim , yi ), where xij indicates the expression level of the sample xi on gene gj . And yi is the disease state associated with the sample xi . In particular, if P is the distribution defined on all the  genes’ expression level and the state of the disease, i.e. V = G {y}, we assume that there exists a Bayesian network BN faithful to the distribution P . A Bayesian network includes a directed acyclic graph which indicates conditional (in)dependent relationships among the variables, and conditional probability functions which simulate conditional probability distribution of each variable given the parent nodes. Following the common assumption of existing studies, we only consider a problem domain with the Faithfulness Condition (Koller & Friedman, 2009) as listed below.

R. Cai et al. / Neural Networks 43 (2013) 63–71

Fig. 2. Four types of basic local causal structures.

Definition 1 (Faithfulness Condition). Let P denote a joint probability distribution on the variable set V and BN is a Bayesian network also defined on V . P and BN are faithful to each other, iff every conditional independence entailed by BN corresponds to some Markov condition present in P . Parents, Children, Spouses are three fundamental relations among the variable nodes in a Bayesian network. Given a target node, e.g. disease state, the parents of the target node are the variable nodes with directed edges pointing to the target node. Similarly, the children of the target node include the variable nodes with directed edges from the target node. Finally, the spouses of the target node are the variable nodes which share at least one common child with the target node. Node AML in Fig. 1, for example, has parents {FLT3,c-KIT}, children {STAT,Grb2} and no spouse. Generally speaking, there can be a large number of Bayesian network structures satisfying a single joint probability. A causal Bayesian network (CBN) is a particular Bayesian network in which each arc is interpreted as a direct causal influence between a parent variable node and a child variable node, conditioned on the other nodes in the network. Given a target node in CBN, its causal node contains its parents. Assume Fig. 1 is the structure of a CBN, FLT3 and c-KIT are thus considered as the direct causes of AML. In gene–disease relation analysis, we are interested in identifying the genes which determine the state of the disease, i.e. the direct causal genes of a specific disease. The problem of diseasecausal gene discovery is thus formally defined as follows. Definition 2 (Causal Gene Identification). Given sample set D, the problem of disease causal gene identification is to select the smallest group of genes G′ ⊆ G which are direct causes of the target node y. 4. Combinatorial formulation and algorithm 4.1. V-structure A Bayesian Network (BN) consists of four types of primitive local structures, as shown in Fig. 2. Fig. 2(a)–(c) are independenceequivalent BNs, because the three local structures imply the same assertion that variable g1 is conditionally independent of variable g2 given variable y. However, Fig. 2(d) is not independenceequivalent to the other three. It is called a V-Structure or a Collider in BN. A V-Structure generally implies a different assertion such that variable g1 is independent of variable g2 not given variable y, but variable g1 is conditionally dependent of variable g2 given variable y. V-Structure is well known and studied, due the high importance of V-Structure in Judea’s Inductive Causation methods (Pearl, 2009). All four local BN structures shown in Fig. 2 can be used to model causality, but independence-equivalent BNs express the same dependency information. In other words, they are equally faithful to the joint probability distribution of the given variables, which are thus indistinguishable unless using domain-expert knowledge. In contrast, given the variables g1 , g2 and y, V-Structure in Fig. 2(d) can be exclusively and easily identified by testing the following conditional independence conditions, i.e. (1) g1 ⊥g2 ; and (2) g1 g2 |y.

65

Fig. 3. Conflict is observed in the V-Structures in the figure, with reversed edges between genes g2 and g3 . Table 1 A sample dataset generating the conflicted V-Structures shown in Fig. 3. Sample

g1

g2

g3

g4

x1 x2 x3 x4 x5 x6 x7 x8

0 0 1 1 1 1 0 0

0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1

0 0 1 1 1 1 0 0

In a more general setting, when BN contains more than 3 nodes, the above g1 and g2 are likely to be conditional independent with each other, leading to the following formal definition. Definition 3 (V-Structure). Given three variables g1 , g2 and y, g1 → y ← g2 is a V-Structure, if and only if there exists some variable set Z ⊆ G − {g1 , g2 } such that (1) g1 ⊥g2 |Z ; and (2) g1 g2 |{y, Z }. The definition above illustrates how to generally find VStructures. We can easily identify all significant V-Structures, following a two-phase strategy. In the first phase, all parentcandidates (PCs) with respect to the target node y are discovered, using the popular growth algorithm, e.g. Spirtes et al. (2001). In the second phase, V-Structures are verified by checking every pair of the PCs {gi , gj } based on Definition 3. 4.2. Conflicts between V-structures Despite of the stability of individual V-Structures, conflicts could occur between V-Structures, even when evaluating on datasets with a large number of samples. Specifically, a conflict between two V-Structures are observable, if they respectively contain two directed edges between the same pair of variables but with reversed directions. In Fig. 3, we present an example of the conflict between two V-Structures. To get a better understanding to how noisy data raise the conflicts between V-Structure, we hereby describe a simple constructive method which generates conflicted V-Structures following the structure in Fig. 3. Assume there are n (without loss of generality, n is divisible by 4) gene sequence samples on 4 genes, i.e. {g1 , g2 , g3 , g4 }. Our approach first generates the expression levels of the samples on g2 and g3 , such that (g2 = 0, g3 = 0) appears n/4 times in the samples. Similarly, (g2 = 0, g3 = 1), (g2 = 1, g3 = 0) and (g2 = 1, g3 = 1) also appear n/4 times. To generate the expression levels for the samples on g1 , our approach lets g1 = g3 if g2 = 0, otherwise g1 = 1 − g3 . In a similar manner, we set g4 = g2 if g3 = 0, otherwise g4 = 1 − g2 . In Table 1, we list a concrete example with n = 8 samples. It is straightforward to verify that the V-Structures calculated based on the table follow the exact structure in Fig. 3. We can further extend the capability of the constructive method above, by controlling the degree of conditional independence. This is achieved by replacing the entries on g1 (resp. g3 ) with new expression levels independent of g3 conditioned on g2 . Therefore, we can always simulate V-Structures with arbitrary strength, by running our constructive approach. This is summarized in the following lemma, which can be proved in a straightforward way.

66

R. Cai et al. / Neural Networks 43 (2013) 63–71

Lemma 1. Given a group of V-Structures with arbitrary conditional independence levels, there always exists at least one gene expression sample set accurately rendering all these V-Structures, when the number of samples is large enough. The analysis above implies that conflicts of V-Structures can happen in arbitrary form. This leads to difficulties in our causal gene selection problem, since we are unable to identify the correct direction of causality when conflicts are observed between two genes. To resolve the issue of conflict, we propose a combinatorial formulation to select V-Structure with maximal significance and zero conflict. 4.3. Combining V-structures Both latent variables and noises on data could be the underlying reason behind conflicts between V-Structures. In the gene expression data, a large number of V-Structures with wrong causation relationship may incur, because of the small sample cardinality, high dimensionality and low ratio of signal to noise probably render insignificant results in almost all statistical tests. The focus of this paper is to tackle the problem of conflicted V-Structures for more robust and scalable causation analysis. Fortunately, we observe that the falsely discovered V-Structure can be detected, since they are usually conflicted with the true ones. Moreover, when two conflicted V-Structures are discovered, the one with lower significance measure is usually falsely discovered. Based on the V-Structures, we believe the true causation can be refined by removing the falsely discovered V-Structure. Formally, we transform the causal gene selection problem into another optimization problem, targeting to identify a group of most significant V-Structures with maximal coverage and zero conflict. Given a set of related V-Structures S = {v1 , v2 , . . . , vk }, we assume that their corresponding significance weights {s1 , s2 , . . . , sk } are also available, such that each si ∈ [0, 1] indicates the significance of V-Structure vi . In particular, we are interested in significance weighting schemes satisfying the following definition. Definition 4 (Consistency Condition). The significance weighting scheme satisfies consistency condition, if given any significance weights s = {s1 , . . . , sk }, there always exists a group of feasible V-Structures with exactly the weight s by the weighting scheme. Specifically, the condition above implies that the weighting scheme is a specific mapping from the conditional independence level to the complete domain [0.5, 1]. To identify robust causal genes, given the V-Structure, we aim to find a subset R ⊆ S without any conflict pair of V-Structures and maximize the following objective function: F (R) =

 vi ∈R

si ·



(1 − si ).

(1)

vj ∈S−R

Intuitively, optimization over Eq. (1) obtains following three good properties: (1) if vi does not incur conflict with any other V-Structures, vi is contained in the final result, because the significance level si falls in domain [0.5, 1], the contribution of picking up this V-Structure for the result set, si , is larger than that of not selecting the V-Structure, (1 − si ); (2) If two V-Structures are in conflict with each other, the more significant V-Structure is selected, because V-Structures with a larger significance measure are more statistically reliable; (3) when the conflict happens among more than two V-Structures, the objective tries to maximize the significance of the selected V-Structures based on the assumption that each V-Structure is independent of the others. Optimization over Eq. (1) turns out a binary programming problem and naturally NP-hard. In the following theorem, we prove the above statement rigorously by constructing a polynomial reduction from the problem of Maximal Independent Set.

Fig. 4. Example of the reduction from Maximal Independence Set to V-Structure combination.

CombineVStructure (D, y, α ) Input: D: sample set, y: target variable,α : threshold of conditional independence test. Output: R: set of V-Structures Find PC for target node y; Find all significant V-Structures using threshold α and put them in S; /*Deal With Conflicts*/ Let R = S; Rank V-Structures in R in ascending order on si for each vi ∈ R; for each vi in R do for each vj (j > i) in the order do if vi conflict with vj then Remove vi from R ; return R; Algorithm 1: Algorithm Combining V-Structures Theorem 1. It is NP-hard to find the combination of V-Structures maximizing Eq. (1). Proof. Given a graph G(V , E ), the problem of Maximal Independence Set is to select the maximal subset V ′ of vertices in V , such that there is no edge (v1 , v2 ) ∈ E for each pair of v1 , v2 ∈ V ′ . To reduce the problem to our combinatorial formulation, for each edge in E, we generate a pair of genes such that conflicted casual relationships with revered edges are constructed. After the generation of the conflicted edges, we build another group of genes for each vertex in V and connect them to the conflict pairs correspondingly. Given the V-Structures, the next step is generating the significance weights {s1 , s2 , . . . , sk } for each V-Structure. We hereby simply use the same weight s > 0.5 for each V-Structure. Using the consistency condition on the significance weights, there always exists a reversed mapping from the weights to the conditional independence level of the V-Structures. Finally, using Lemma 1, we are able to generate a group of samples to simulate the conditional independence levels of the V-Structure among the genes. Fig. 4 presents an example of the reduction, which transforms the original graph with 4 nodes to a gene network with 8 genes. It is straightforward to show that the result of the optimization formulation with Eq. (1) is always equivalent to the optimal answer to the maximal independence set problem, which completes the proof of the theorem.  In this work, we propose a heuristic solution for this problem, by repeatedly removing the least significant V-Structure from S until there is no conflict pair. Moreover, the discovery of the related V-Structure in the high dimensional data set is also computational expensive. Our method thus only considers the V-Structures whose collider is the parent candidate (PC) of the target node. The details of the algorithm are given in Algorithm 1. It is obvious that the significance measure, i.e. si for each V-Structure vi , plays an important role in the performance of our causal gene identification algorithm. In the following, we

R. Cai et al. / Neural Networks 43 (2013) 63–71

discuss two different measures, ST and SI, based on the analysis on conditional independence and information theory respectively. Independence test approach: From the perspective of probabilistic graphical model, the significance of V-Structure depends on the significance of the following conditional independence statements: (1) firstly, whether g1 and g2 are independent of each other given the conditional set Z , i.e. g1 g2 |{Z }; (2) secondly, whether g1 and g2 are dependent of each other when the conditional set is Z ∪ y, i.e. g1 g2 |{y, Z } and g1 ⊥g2 |Z . To simplify the analysis, we assume these two events are independent of each other, the significance of a V-Structure is thus defined as: ST (g1 , g2 , y) = p(g1

g2 |{y, Z }) ∗ p(g1 ⊥g2 |Z )

(2)

in which p(g1 ⊥y|X ) is the p-value of the condition independence test g1 ⊥y|X . In the following, we discuss the applicability of the consistency condition on the measure based on the independence test. Assume that the confidence level of the (conditional) independence test  is2 0.95.  Given a specific significance value ST (g1 , g2 , y) ∈ 0.95 , 1 , we can construct a set of samples with V-Structure whose significance measure is ST (g1 , g2 , y) by applying the following steps. Firstly, we generate √ a set of sample on g1 , g2 , Z , ensuring that p(g1 ⊥g2 |Z ) = ST (g1 , g2 , y). Secondly, we create the values on variable y for each sample, based on the corresponding values of g1 , g2 , Z , which guarantees that p(g1 g2 |{y, Z }) = √ ST (g1 , g2 , y). The feasibility of the above construction is due to the fact that the P-Value of the (conditional) independence test can achieve any value in the interval (0, 1) when one variable is free. This implies that the measure satisfies the consistency condition. Information theory approach: From information theory aspects, a V-Structure stands for an information flow in which variances in Z block the flow from g1 to g2 . And the collider y reconnects the information flow between g1 and g2 . Based on this observation, the significance of a V-Structure g1 → y ← g2 can be measured based on two factors, including (1) the degree of Z blocking the information flow, and (2) the degree of y reconnecting the flow given the condition Z . In addition, the change of the information flow is also highly related to the information flow between g1 and g2 . In the extreme case, if the entropy with g1 or g2 is zero, the change of the information flow is also zero. We can thus summarize that the significance of the V-Structure can be modeled as the proportion of the changed information flow, as listed below. SI (g1 , g2 , y) =

MI (g1 , g2 |Z , y) − MI (g1 , g2 |Z ) min(H (g1 ), H (g2 ))

.

(3)

In the equation above, MI (g1 , g2 |Z ) and MI (g1 , g2 |Z , y) are the conditional mutual information between g1 and g2 conditioned on Z and {Z , y} respectively. H (g1 ) and H (g2 ) are the entropies of g1 and g2 . Similar to the independence test approach, we also verify the consistency condition with the Information theory approach. In particular, given any value of significance value SI (g1 , g2 , y), we can construct a set of samples to get V-Structures with significance value SI (g1 , g2 , y). Firstly, we simply generate the values of the samples on g1 and g2 satisfying H (g1 ) = H (g2 ) = 1. Secondly, our method generates the value of Z to ensure MI (g1 , g2 |Z ) = 0. Finally, the values of y are built, according to the existing value of g1 , g2 , Z such that MI (g1 , g2 |Z , y) = SI (g1 , g2 , y). The feasibility of the above construction is based on the fact that the entropy/mutual information function is likely to achieve any value given one freedom variable. Thus, using the combinatorial problem using the SI significance measure is NP-hard, due to the satisfaction of consistency condition.

67

GeneIdentification (D, y, R) Input: D: sample set, y: target variable, R Conflict-free V-Structures. Output: CG: Result causal genes PC = the PCs of target node y; Empty result causal genes CG = ∅; /*Find Explicit Causal Node*/ foreach gi → y ← gj ∈ R do PC = PC − {gi , gj }; CG = CG ∪ {gi , gj }; /* Remove the Explicit Non-Causal Node*/ foreach gi ∈ PC do if ∃y → gi ← gj ∈ R then PC = PC − {gi }; /*Find Implicit Causal Node*/ foreach gi ∈ PC do if ∃gj → gi ← gl ∈ SVS then CG = CG ∪ {gi } return CG; Algorithm 2: Gene Identification 4.4. Gene identification Given the V-Structures without conflict, the final step of our method is to identify the causal genes related to the target node y. This step is not that straightforward, due to the existence of three types of nodes in the V-Structures, as discussed below. Explicit causal nodes: The explicit causal nodes are the nodes with strong causal association with the target node, which can be easily obtained from the V-Structures. In Fig. 5(a), for example, we have g1 → y ← g2 . This tells us that g1 and g2 are the explicit causal node of y. Explicit non-causal nodes: In Fig. 5(b), g1 is an explicit non-causal node of y. Therefore, g1 can be removed from the causal candidate node set, since the V-Structure y → g1 ← g2 exists. Implicit causal node: The implicit causal node can be discovered with the help of the V-Structures of target node y’s PC set. It is based on the MDL principle in the causal inference. As is shown in Fig. 5(c), g2 → g1 ← g3 exists but neither g2 → g1 ← y nor g3 → g1 ← y exists. This information helps us to infer that there is a direct edge g1 → y. Otherwise, there must be a directed edge y → g1 , which will form a new V-Structure together with the edge g2 → g1 and g3 → g1 . In Algorithm 2, we list the details of the gene identification procedure, which returns both explicit causal nodes and implicit causal nodes by exploring R, the conflict free V-Structure set which is generated by the function CombineVStructure. Thus we can summarize the full map of our algorithm in Fig. 6. In the following, we provide some analysis on the completeness of causal gene identification using our algorithm. In particular, we give a sufficient condition on the Bayesian network over the genes, which guarantees the identification of causal genes. Theorem 2. When the noise on the samples are sufficiently small, Algorithm 2 always returns the exact causal genes, if (1) y has more than one parent in the Bayesian network, or (2) y has only one parent gi , and gi has more than one parent. Proof. We prove the theorem based on the two cases separately. In the first case, when y has more than two parents in the Bayesian network, for each casual gene gi , there is at least one another gene gj , such that gi → y ← gj is a valid V-Structure. Therefore, our algorithm is capable of identifying gi by investigating all explicit casual nodes. In the second case, when y has only one parent gi and gi has at least two parents gj and gk , there is a valid V-Structure, i.e. gj → gi ← gk , involving gi , gj and gk . Since gi is the only casual factor for y, it is impossible to find either V-Structures such that gj → gi ← y or gk → gi ← y. This shows that gi must be found as an implicit causal nodes in our algorithm. 

68

R. Cai et al. / Neural Networks 43 (2013) 63–71

(a) Explicit causal node.

(b) Explicit non-causal node.

(c) Implicit causal node.

Fig. 5. Three cases of causality inference.

Fig. 6. Sketch map of the full solution of causal gene identification problem.

While the theorem above implies our algorithm finds all casual genes only when y has more than one parents or y’s parent has more than one parents, we can further extend the applicability by applying Algorithm 2 recursively on the causal nodes until there are two causes identified for certain gene in the chain. It is not difficult to show that the causal genes cannot be found, only when there is a single chain directing to the disease status variable y. However, note that the recursive method usually does not work well in practice, mainly because the random noises on the samples. In our experiments, therefore, we only test our Algorithm 2 with respect to the direct causes of the disease status y. 5. Experiments In this section, we evaluate our proposal, called SVS in this section, on both low-dimensional simulated dataset and highdimensional real gene expression data. In all the implementations, the same G2 conditional independence test (Spirtes et al., 2001) is employed, with conditional independence threshold at 95%. 5.1. Results on simulated data The main purpose of the experiments on simulated datasets is to evaluate the accuracy and scalability of our proposal, against existing causal inference techniques applicable only on moderate dimensionality. Specifically, we simulate a network with 10 genes with binary values only, following the causal structure presented in Fig. 7. The same causal structure has been used in Mukhopadhyay and Chatterjee (2007) to test Granger’s causality discovery method. Although the causal network consists of only 10 variables, it contains all types of causal structures discussed in the previous section. For example, g2 → g3 ← g10 is an Explicit Causal Node, g4 can be removed from the causal candidate set with respect to g3 due to Explicit Non-Causal Node, and g1 → g2 is found to be an Implicit Causal Node. The independent nodes in the network, namely g1 , g7 , g8 , g9 , are randomly generated following Bernoulli distribution outputting 0 and 1 with probability 0.5. The states of the other nodes follow certain random conditional probability tables, which assigns probabilities to the variables conditioned on every combination of causal variables. The details of the data generation process is given in Appendix. In the experiments, the results are averaged over 100 runs, i.e. using 100 different sample tables built on different conditional probability tables. Since the causal structure is known, we are allowed to evaluate the performance using standard recall and precision on the result genes given by the algorithms. Specifically, precision is the fraction of result causal genes which are the true causes of the target node, |{Discov eredCausalNode}∩{CausalNode}| i.e. Precision = . Similarly, recall is |{Discov eredCausalNode}| the fraction of true causal genes found by the algorithm, i.e. recall

Fig. 7. Causal structure of the simulated data set.

}∩{CausalNode}| . In our experiments, each gi , (1 ≤ = |{DiscoveredCausalNode |{CausalNode}| i ≤ 10) is used as the target node y = gi .

Comparison on significance measures: In Fig. 8, we report the correct percent, recall and precision of two versions of our algorithm with two different significance measures. Here, correct percent refers to the number of conflict V-Structures pairs whose correct V-Structure’s significance is higher over the number of all conflicted V-Structure pairs. As show in the Figure, the correct percent of both the two significance measure is higher than 0.5, which proves the usefulness of both significance measures. In terms of the significance measures, the information theory based approach, SI, outperforms the independence test approach ST on all correct percent, recall and precision, when the sample size is small. SI can catch important information even when very few samples are available. When sample size grows, although ST shows slightly better performance on its recall, SI achieves an excellent balance between recall and precision. We conclude that SI is better at identifying significant V-Structures. By increasing the number of samples used in the testing, SVS achieves the most significant improvements when the sample size grows from 40 to 80. It implies that 80 samples are enough for SVS to capture the important causal relations. Moreover, the samples needed to heavily depend on the local connectivity of the structure. We believe that when SVS is used in problems with similar connectivity structures as the simulated data set, 80 samples are enough to catch the main causal relations. As results show superiority of SI over ST , we will only test our algorithm with SI in the rest of the section. Comparison on causation search methods: To show the effectiveness of the object function (Formula (1)) and our heuristic algorithm, we report the recall and precision of three different methods to find causations based on the conflicted V-Structures in Fig. 9. In terms of search method, SVS denotes the proposed heuristic method; Optimal runs exhaustive search of the object function presented in Formula (1), the same objective function of SVS applies exhaustive

R. Cai et al. / Neural Networks 43 (2013) 63–71

69

Fig. 10. Comparison with MMHC on the simulated data set.

Fig. 8. Comparison of significance on the simulated data set.

Table 2 Statistics on wrong results.

Fig. 9. Comparison of search method on the simulated data set.

search strategy using the same objective function as SVS does; Rand randomly removes one of the conflict V-Structures without considering the significance. As show in the figure, SVS performs much better than Rand, and obtains acceptable recall and precision when compared with Optimal. It is interesting to observe that SVS achieves the same recall and precision as Optimal does when the sample size is larger than 320. This is because, when sample size is larger than 320, the conflicts only occur between two VStructures, thus SVS can get the optimal solution of the object function. Moreover, the large improvement of SVS and Optimal compared with Rand also provides an empirical justification on the objective function employed in our combinatorial formulation. Comparison with MMHC: In the following, our proposed method SVS is compared with MMHC (Tsamardinos, Brown, & Aliferis, 2006), the mainstream causal inference method. Causal Explorer’s2 implementation of MMHC is used in our experiments. When compared against MMHC, our SVS algorithm with either significance measure is better than MMHC on recall and precision. Since MMHC constructs the complete DAG, the parent node of the target node are all returned as causal genes. Fig. 10 shows that SVS outperforms MMHC by a huge margin, especially when the sample size

2 html.

http://discover1.mc.vanderbilt.edu/discover/public/causal_explorer/index.

Edge

Error type

SVS

MMHC

g1 → g2

Wrong direction False Negative

18 4

32 2

g6 → g4

Wrong direction False Negative

12 8

30 3

g4 → g5

Wrong direction False Negative

14 3

37 11

g9 → g6

Wrong direction False Negative

3 9

22 12

g7 → g6

Wrong direction False Negative

2 10

25 9

g8 → g6

Wrong direction False Negative

2 9

23 10

g10 → g4

False Positive

10

8

is not large. It shows that our SVS algorithm is robust even when the samples are insufficient. False causality result analysis: In the following, we provide a detailed case-by-case analysis on the wrong results reported by SVS. In particular, there are three different types of wrong results, including False Negative, False Positive and Wrong direction. Assuming g2 in Fig. 7 is the target node, if g1 is not discovered, the causal relation g1 → g2 is a False Negative; if g3 is discovered as the causal node of g2 , then g3 → g2 is a Wrong direction; if g10 is detected as the causal node of g2 , then the edge g10 → g2 is a False Positive. All wrong results with appearance frequency larger than 10 in our experiments are listed in Table 2. Table 2 shows that SVS reports Wrong direction results much less frequently than MMHC, while the frequencies on False Positive and False Negative results are almost the same. We thus conclude that SVS possesses a strong ability on correctly inferring the directions of the edges, i.e. the causal relations among the variables. Moreover, the direction of g1 → g2 and g4 → g5 is the most frequent wrong result for the SVS method, mainly because the implicit inference step is not sufficiently reliable. 5.2. Results on real data To test the effectiveness of our proposal in real causal gene studies, we test our method on a real dataset and verify the result on the biological pathway database. In particular, we run our analysis on the prostate cancer dataset (Singh et al., 2002) and the

70

R. Cai et al. / Neural Networks 43 (2013) 63–71

Table 3 General information of the data sets. Data set

Classes and sample #

Study goal

Prostate Leukemia

Prostate (52): ordinary (50) ALL (47) : AML (25)

Genes causing prostate cancer Genes differentiating AML and ALL

Table 4 Casual gene discovered on the prostate data set. Probe ID

Gene name

Description

37327_at 32748_at 33244_at 33139_s_at

EGFR RPS27 CHN2 MLF1

Epidermal growth factor receptor Ribosomal protein S27 Chimerin 2 Myeloid leukemia factor 1

Leukemia data set. The prostate dataset contains 52 prostate cancer patients samples and 50 ordinary samples, each of which contains the expression level of 12600 genes. Similarly, The Leukemia dataset contains 52 prostate cancer patients samples and 50 ordinary samples, with expression data on 12 600 genes segments. The Leukemia dataset consists of 47 samples belonging to ALL subtype of Leukemia and 25 samples of AML subtype under Leukemia, each of which contains the expression level of 7129 gene segments. Note that the aim of the studies on the prostate dataset is to find the causal genes deciding target node, i.e. prostate cancer status, with binary values on ‘‘normal’’ or ‘‘cancer’’; while the goal of Leukemia study is to identify the causes responsible to the state on ‘‘ALL’’ or ‘‘AML’’. The general information of the datasets is summarized in Table 3. We emphasize here again that no existing algorithm, such as MMHC, is capable of running causality analysis on such highdimensional dataset. Discretization is an important preprocessing step in the causal gene discovery. Entropy discretization procedure (Dougherty, Kohavi, & Sahami, 1995) is well accepted in gene expression data analysis. However, entropy discretization may cause over-fitting problems in our high-dimensional setting. Instead, we employ two-state Gaussian Mixture Model to discretize the data (Xing, Jordan, & Karp, 2001), this strategy exploits the expressed/suppressed two stage assumption of the gene. Prostate cancer: We found that all the result causal genes in Table 4 have been reported in previous prostate cancer researches. EGFR, i.e. epidermal growth factor receptor, has been reported highly correlated to prostate cancer (Mimeault, Pommery, & Hénichart, 2003). EGFR’s role in prostate’s gene therapy also shows that it is a direct causal gene of prostate cancer. RPS27, i.e. ribosomal protein S27, also plays an important role in genetic information processing.3 It is believed that the misregulation of its genetic processing is generally the cause of cancer. CHN2 encodes a protein of the chimerin family, which plays a role in the proliferation and migration of smooth muscle cells.4 MLF1, the myeloid leukemia factor 1, was firstly discovered as a factor of Leukemia; recent research has found that MLF1 is also involved in other cancers, because of its role in transcriptional misregulation (Badano, Teslovich, & Katsanis, 2005). In order to provide further biological interpretation of the discovered causal genes, we try to locate the results in the known pathway of prostate cancer. We verify our results with KEGG pathways,5 which provides an easy-to-use visualization interface. One of the discovered causal gene, EGFR, is found in the Prostate pathway. Fig. 11 shows the location of EGFR on the skeleton of prostate

3 http://en.wikipedia.org/wiki/RPS27. 4 http://en.wikipedia.org/wiki/Chimerin_2. 5 http://www.genome.jp/kegg/pathway/hsa/hsa05215.html.

Fig. 11. Discovered Gene’s role in prostate pathway. Table 5 Casual gene discovered on the leukemia data set. Probe ID

Gene name

Description

U57342_at

MLF2

HG3991-HT4261_at

MEF2A

X59350_at X14850_at

CD22 HISTONE H2A.X

Myelodysplasia/myeloid leukemia factor 2 mRNA Human myocyte-specific enhancer factor 2A CD22 antigen One of several genes coding for histone H2A

pathway. EGFR appears in the very origin of the prostate pathway, which controls almost all the functions related to prostate cancer, such as Cell cycle procession, Cell survival, Cell proliferation and Tumor growth. Thus, we can believe that EGFR is the causal gene of prostate and verifies our results. Leukemia: In this group of studies, we try to find genes distinguishing between two subclasses of Leukemia, i.e. Acute myeloid leukemia (AML) and Acute lymphoblastic leukemia (ALL). The causal genes making the difference of AML and ALL are listed in Table 5. Among all result genes, MLF2 is the myeloid leukemia factor, which has been verified in existing research as one of the causes deciding if AML or ALL happens (http://www.wikigenes.org/e/gene/e/8079.html, 0000). MEF2A is one of MEF2 family. The other version of MEF2, MEF2C has also been found as an important factor with respect to AML, as Schler calls ’A causal role of up-regulated MEF2C expression in myelomonocytic acute myeloid leukemia (AML) has recently been demonstrated’ (Schler et al., 2008). CD22 plays an important role in B cell receptor signaling pathway, which is important pathway in the lymphoblastic system (Hasler & Zouali, 2001). Thus it is reasonable that CD22 is one of the causes to the difference between AML and ALL. Histone H2A.X is reported to be responsible to render different classes of Leukemia, because of its impact on the promotion of B-cell tumorigenesis (Walsh & Rosenquist, 2005). 6. Conclusions and discussion In this work, we presented the first practical solution to causal gene identification problem, using a new formulation of combinatorial V-Structure search. To fully utilize the framework, we discuss how to choose significance measures and how to pick up causal genes based on the V-Structure combination. Our proposal facilitates us to run analysis on extremely high-dimensional gene expression data. Our results on synthetic data and real data show that our proposal is much more effective than any existing solution in the literature to tackle the problems of high dimensionality and small sample size at the same time. While our methods turn out to be effective, there remains room for more improvements. Our current solution to the combinatorial

R. Cai et al. / Neural Networks 43 (2013) 63–71

formulation only applies a simple greedy selection heuristic. It is interesting to attempt other combinatorial algorithms with performance guarantees. We are also keen to test on more human genomic data for new findings of causal genes. Acknowledgments This work is financially supported by the Natural Science Foundation of China (61070033, 61100148), the Natural Science Foundation of Guangdong province (S2011040004804), Key Technology Research and Development Programs of Guangdong Province (2010B050400011), Opening Project of the State Key Laboratory for Novel Software Technology (KFKT2011B19), Foundation for Distinguished Young Talents in Higher Education of Guangdong, China (LYM11060), Science and Technology Plan Project of Guangzhou City (12C42111607), Science and Technology Plan Project of Panyu District Guangzhou (2012-Z-03-67). Appendix. Pseudocodes of synthetic data generation Given the parents set of a variable g, the Conditional Probability Table is presented in Algorithm 3.

CPTGenerator (gparents , g) Input: gparents : the parents variables, g: target variable. Output: CPT : Conditional Probability Table Set CPT to an empty table with 2|gparents | rows and |gparents | + 1 columns; foreach row of CPT do Set the first |gparents | columns to the state of parent variables; Set the |gparents | + 1 column to 0 or 1 with equal probability; return CPT ; Algorithm 3: Conditional Probability Table Generator. With the states of parents variables xparents , the conditional probability table CPT and the noise ratio pnoise , Algorithm 4 gives the generation of x. In this work, pnoise = 0.05, which means 5% samples are noise.

DataGenerator (CPT , xparents , x) Input: xparents : the state of parents variables, CPT : Conditional Probability Table, pnoise : noise ratio. Output: x: the state of target variable Uniformly generate a number p ∈ [0, 1]; if p > pnoise then Lookup for the row in CPT whose first |gparents | columns is xparents ; Set x to |gparents + 1|th columns of the current row; else Set x to 0 or 1 with equal probability; return x; Algorithm 4: Random Data Generator

References http://www.wikigenes.org/e/gene/e/8079.html. Aliferis, C. F., Statnikov, A., Tsamardinos, I., Mani, S., & Koutsoukos, X. D. (2010a). Local causal and markov blanket induction for causal discovery and feature selection for classification part i: algorithms and empirical evaluation. Journal of Machine Learning Research, 11, 171–234.

71

Aliferis, C. F., Statnikov, A., Tsamardinos, I., Mani, S., & Koutsoukos, X. D. (2010b). Local causal and Markov blanket induction for causal discovery and feature selection for classification part II: analysis and extensions. Journal of Machine Learning Research, 11, 235–284. Badano, J., Teslovich, T., & Katsanis, N. (2005). The centrosome in human genetic disease. Nature Reviews Genetics, 6(3), 194–205. Bassett, D., Eisen, M., & Boguski, M. (1999). Gene expression informaticsłit’s all in your mine. Nature genetics, 21, 51–55. Bromberg, F., & Margaritis, D. (2009). Improving the reliability of causal discovery from small data sets using argumentation. Journal of Machine Learning Research, 10, 301–340. Cai, R., Hao, Z., Yang, X., & Wen, W. (2009). An efficient gene selection algorithm based on mutual information. Neurocomputing, 72(4–6), 991–999. Cai, R., Tung, A. K. H., Zhang, Z., & Hao, Z. (2011). What is unequal among the equals? ranking equivalent rules from gene expression data. IEEE Transactions on Knowledge and Data Engineering, 23(11), 1735–1747. Cai, R., Zhang, Z., & Hao, Z. (2011). Bassum: a Bayesian semi-supervised method for classification feature selection. Pattern Recognition, 44(4), 811–820. Cookson, W., Liang, L., Abecasis, G., Moffatt, M., & Lathrop, M. (2009). Mapping complex disease traits with global gene expression. Nature Reviews Genetics, 10(3), 184–194. Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In Proceeding of international conference on machine learning (pp. 194–202). Friedman, N., Linial, M., Nachman, I., & Pe’er, D. (2000). Using bayesian networks to analyze expression data. In RECOMB (pp. 127–135). Hasler, P., & Zouali, M. (2001). B cell receptor signaling and autoimmunity. The FASEB Journal, 15(12), 2085–2098. He, Y., & Geng, Z. (2008). Active learning of causal networks with intervention experiments and optimal designs. Journal of Machine Learning Research, 9, 2523–2547. Kalisch, M., & Bühlmann, P. (2007). Estimating high-dimensional directed acyclic graphs with the pc-algorithm. The Journal of Machine Learning Research, 8, 613–636. Kim, S., Imoto, S., & Miyano, S. (2004). Dynamic bayesian network and nonparametric regression for nonlinear modeling of gene networks from time series gene expression data. Biosystems, 75(1–3), 65, 57. Kim, Y., Wuchty, S., & Przytycka, T. (2011). Identifying causal genes and dysregulated pathways in complex diseases. PLoS Computational Biology, 7(3), e1001095. Koller, D., & Friedman, N. (2009). Probabilistic graphical model: principles and techniques (2nd ed.). The MIT Press. Lozano, A., Abe, N., Liu, Y., & Rosset, S. (2009). Grouped graphical granger modeling for gene expression regulatory networks discovery. Bioinformatics, 25(12), i110. Mimeault, M., Pommery, N., & Hénichart, J. (2003). New advances on prostate carcinogenesis and therapies: involvement of egf–egfr transduction system. Growth Factors, 21(1), 1–14. Mukhopadhyay, N., & Chatterjee, S. (2007). Causality and pathway search in microarray time series experiment. Bioinformatics, 23(4), 442. Noble, D. (2008). Genes and causation. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 366(1878), 3001–3015. Pearl, J. (2009). Causality: models, reasoning and inference (2nd ed.). Cambridge Univ Press. Reiner, A., Yekutieli, D., & Benjamini, Y. (2003). Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics, 19(3), 368–375. Saeys, Y., Inza, I., & Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507. Schadt, E., Lamb, J., Yang, X., Zhu, J., Edwards, S., GuhaThakurta, D., et al. (2005). An integrative genomics approach to infer causal associations between gene expression and disease. Nature genetics, 37(7), 710–717. Schler, A., Schwieger, M., Engelmann, A., Weber, K., Horn, S., Mller, U., et al. (2008). The mads transcription factor mef2c is a pivotal modulator of myeloid cell fate. Blood, 111(9), 4532–4541. Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2), 203–209. Spirtes, P., Glymour, C., & Scheines, R. (2001). Causation, prediction, and search (2nd ed.). The MIT Press. Tsamardinos, I., Brown, L. E., & Aliferis, C. F. (2006). The max–min hill-climbing bayesian network structure learning algorithm. Machine Learning, 65(1), 31–78. Walsh, S. H., & Rosenquist, R. (2005). Absence of h2ax gene mutations in b-cell leukemias and lymphomas. Leukemia, 19(3), 464. Xing, E.P., Jordan, M.I., & Karp, R.M. (2001). Feature selection for high-dimensional genomic microarray data. In ICML (pp. 601–608). Yu, L., & Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research, 5, 1205–1224. Zhu, Z., Ong, Y., & Dash, M. (2007). Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognition, 40(11), 3236–3248.

Causal gene identification using combinatorial V ...

and efficiently discover the causal genes, computer scientists and. ∗. Corresponding .... A causality Bayesian network is part of the theoretical back- ground of this work. ..... picking up this V-Structure for the result set, si, is larger than that.

644KB Sizes 3 Downloads 188 Views

Recommend Documents

Gene name identification and normalization using a ...
Oct 8, 2004 - creation of high quality text mining tools that can be ap- plied to specific ..... the system developer with no particular annotation experience ...

Species Identification using MALDIquant - GitHub
Jun 8, 2015 - Contents. 1 Foreword. 3. 2 Other vignettes. 3. 3 Setup. 3. 4 Dataset. 4. 5 Analysis. 4 .... [1] "F10". We collect all spots with a sapply call (to loop over all spectra) and ..... similar way as the top 10 features in the example above.

Identification and characterization of a PHGPX gene ...
vine serum albumin (BSA) as a standard. SDS-PAGE analysis. An equal amount of each protein sample (35 µg) was loaded on 12–. 15 % gradient SDS-PAGE ...

Identification of polymorphic variants of modifier gene ...
Apr 1, 2002 - Cigarette Smoking. Am J Respir Crit Care Med, no. 163 : 1404-1409. 14. Zoraqi G, Shundi L, Vevecka E, Kosova H.: Cystic Fibrosis mutation testing in Albania. Balkan Journal of Medical Genetics, 9 (3 – 4),. 2006. Supplement of the 7th

Identification of genetic variants and gene expression ...
Affymetrix Inc., Santa Clara, California, USA) with ... University of Chicago, Chicago, IL 60637, USA ... Only 2098437 and 2286186 SNPs that passed Mende- ..... 15 Smith G, Stanley L, Sim E, Strange R, Wolf C. Metabolic polymorphisms and.

Gene Identification via Phenotype Sequencing Version 1.5 ... - GitHub
1. Gene Identification via Phenotype Sequencing. Version 1.5. User manual. Zhu Z, Wang WT, Zhu JH, and Chen X. 2015-08-01 ...

Gene Regulatory Network Reconstruction Using ... - ScienceOpen
Dec 27, 2011 - The Journal of Machine Learning Research 5: 1287–1330. 34. Efron B .... Zou H (2006) The adaptive lasso and its oracle properties. Journal of ...

Gene Regulatory Network Reconstruction Using ...
Dec 27, 2011 - networks (BN) using genetic data as prior information [23] or multivariate regression in a ..... distributions would probably be a good tool for assessing network overall quality. .... Network1-A999 visualisation. (A) to (C) are ...

Gene Regulatory Network Reconstruction Using ...
Dec 27, 2011 - functional properties [5,6] all use the representation of gene regulatory networks. Initially, specific ..... bluntly implemented using a general linear programming solver. The use of a dedicated ..... php/D5c3. Directed networks of ..

speaker identification and verification using eigenvoices
approach, in which client and test speaker models are confined to a low-dimensional linear ... 100 client speakers for a high-security application, 60 seconds or more of ..... the development of more robust eigenspace training techniques. 5.

Sparse-parametric writer identification using ...
grated in operational systems: 1) automatic feature extrac- tion from a ... 1This database has been collected with the help of a grant from the. Dutch Forensic ...

Sparse-parametric writer identification using heterogeneous feature ...
Retrieval yielding a hit list, in this case of suspect documents, given a query in the form .... tributed to our data set by each of the two subjects. f6:ЮаЯвбЗbзбйb£ ...

Identification Using Stability Restrictions
algebra and higher computational intensity, due to the increase in the dimen- sion of the parameter space. Alternatively, this assumption can be motivated.

LANGUAGE IDENTIFICATION USING A COMBINED ...
over the baseline system. Finally, the proposed articulatory language. ID system is combined with a PPRLM (parallel phone recognition language model) system ...

Multipath Medium Identification Using Efficient ...
proposed method leads to perfect recovery of the multipath delays from samples of the channel output at the .... We discuss this connection in more detail in the ...

Causal Video Segmentation Using Superseeds and Graph Matching
advantage of the proposed approach over some recently reported works. Keywords: Causal video segmentation · Superseeds · Spatial affinity ·. Graph matching.

Sparse-parametric writer identification using heterogeneous feature ...
The application domain precludes the use ... Forensic writer search is similar to Information ... simple nearest-neighbour search is a viable so- .... more, given that a vector of ranks will be denoted by ╔, assume the availability of a rank operat

Sparse-parametric writer identification using ...
f3:HrunW, PDF of horizontal run lengths in background pixels Run lengths are determined on the bi- narized image taking into consideration either the black pixels cor- responding to the ink trace width distribution or the white pixels corresponding t

SPEAKER IDENTIFICATION IMPROVEMENT USING ...
Air Force Research Laboratory/IFEC,. 32 Brooks Rd. Rome NY 13441-4514 .... Fifth, the standard error for the percent correct is zero as compared with for all frames condition. Therefore, it can be concluded that using only usable speech improves the

Electromagnetic field identification using artificial neural ... - CiteSeerX
resistive load was used, as the IEC defines. This resistive load (Pellegrini target MD 101) was designed to measure discharge currents by ESD events on the ...

Methods for Using Genetic Variants in Causal Estimation
Bayes' Rule: A Tutorial Introduction to Bayesian Analysis · Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction · Explanation in ...

Inferring causal impact using Bayesian structural time ... - Project Euclid
estimate for the effect; and it uses model averaging to construct the most appro- ...... number of times a user was directed to the advertiser's website from the ...

Combinatorial Nullstellensatz
Suppose that the degree of P as a polynomial in xi is at most ti for 1 ≤ i ≤ n, and let Si ⊂ F be a ... where each Pi is a polynomial with xj-degree bounded by tj.

High-throughput gene silencing using cell arrays - Nature
Cell array as a functional genomics tool ... statistical analysis of the data acquired by microscope or ..... Brummelkamp TR, Bernards R and Agami R. (2002).