Encyclopedia of Data Warehousing and Mining Second Edition John Wang Montclair State University, USA

Volume II Data Pro-I

Information Science reference Hershey • New York

Director of Editorial Content: Director of Production: Managing Editor: Assistant Managing Editor: Typesetter: Cover Design: Printed at:

Kristin Klinger Jennifer Neidig Jamie Snavely Carole Coulson Amanda Appicello, Jeff Ash, Mike Brehem, Carole Coulson, Elizabeth Duke, Jen Henderson, Chris Hrobak, Jennifer Neidig, Jamie Snavely, Sean Woznicki Lisa Tosheff Yurchak Printing Inc.

Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue, Suite 200 Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com/reference and in the United Kingdom by Information Science Reference (an imprint of IGI Global) 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 0609 Web site: http://www.eurospanbookstore.com Copyright © 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Encyclopedia of data warehousing and mining / John Wang, editor. -- 2nd ed. p. cm. Includes bibliographical references and index. Summary: "This set offers thorough examination of the issues of importance in the rapidly changing field of data warehousing and mining"--Provided by publisher. ISBN 978-1-60566-010-3 (hardcover) -- ISBN 978-1-60566-011-0 (ebook) 1. Data mining. 2. Data warehousing. I. Wang, John, QA76.9.D37E52 2008 005.74--dc22 2008030801

British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this encyclopedia set is new, previously-unpublished material. The views expressed in this encyclopedia set are those of the authors, but not necessarily of the publisher.

If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating the library's complimentary electronic access to this publication.

1058

Section: Bioinformatics

Integrative Data Analysis for Biological Discovery Sai Moturu Arizona State University, USA Lance Parsons Arizona State University, USA Zheng Zhao Arizona State University, USA Huan Liu Arizona State University, USA

INTRODUCTION As John Muir noted, “When we try to pick out anything by itself, we find it hitched to everything else in the Universe” (Muir, 1911). In tune with Muir’s elegantly stated notion, research in molecular biology is progressing toward a systems level approach, with a goal of modeling biological systems at the molecular level. To achieve such a lofty goal, the analysis of multiple datasets is required to form a clearer picture of entire biological systems (Figure 1). Traditional molecular biology studies focus on a specific process in a complex

biological system. The availability of high-throughput technologies allows us to sample tens of thousands of features of biological samples at the molecular level. Even so, these are limited to one particular view of a biological system governed by complex relationships and feedback mechanisms on a variety of levels. Integrated analysis of varied biological datasets from the genetic, translational, and protein levels promises more accurate and comprehensive results, which help discover concepts that cannot be found through separate, independent analyses. With this article, we attempt to provide a comprehensive review of the existing body of research in this domain.

Figure 1. Complexity increases from the molecular and genetic level to the systems level view of the organism (Poste, 2005).

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

Integrative Data Analysis for Biological Discovery

BACKGROUND The rapid development of high-throughput technologies has allowed biologists to obtain increasingly comprehensive views of biological samples at the genetic level. For example, microarrays can measure gene expression for the complete human genome in a single pass. The output from such analyses is generally a list of genes (features) that are differentially expressed (upregulated or downregulated) between two groups of samples or ones that are coexpressed across a group of samples. Though every gene is measured, many are irrelevant to the phenomenon being studied. Such irrelevant features tend to mask interesting patterns, making gene selection difficult. To overcome this, external information is required to draw meaningful inferences (guided feature selection). Currently, numerous high-throughput techniques exist along with diverse annotation datasets presenting considerable challenges for data mining (Allison, Cui, Page & Sabripour, 2006). Sources of background knowledge available include metabolic and regulatory pathways, gene ontologies, protein localization, transcription factor binding, molecular interactions, protein family and phylogenetic information, and information mined from biomedical literature. Sources of high-throughput data include gene expression microarrays, comparative genomic hybridization (CGH) arrays, single nucleotide polymorphism (SNP) arrays, genetic and physical interactions (affinity precipitation, two-hybrid techniques, synthetic lethality, synthetic rescue) and protein arrays (Troyanskaya, 2005). Each type of data can be richly annotated using clinical data from patients and background knowledge. This article focuses on studies using microarray data for the core analysis combined with other data or background knowledge. This is the most commonly available data at the moment, but the concepts can be applied to new types of data and knowledge that will emerge in the future. Gene expression data has been widely utilized to study varied things ranging from biological processes and diseases to tumor classification and drug discovery (Carmona-Saez, Chagoyen, Rodriguez, Trelles, Carazo & Pascual-Montano, 2006). These datasets contain information for thousands of genes. However, due to the high cost of these experiments, there are very few samples relative to the thousands of genes. This leads to the curse of dimensionality (Yu & Liu 2004). Let M be the number of samples and N be the number of

genes. Computational learning theory suggests that the search space is exponentially large in terms of N and the required number of samples for reliable learning about given phenotypes is on the scale of O(2N) (Mitchell, 1997; Russell & Norvig, 2002). However, even the minimum requirement (M=10*N) as a statistical “rule of thumb” is patently impractical for such a dataset (Hastie, Tibshirani & Friedman, 2001). With limited samples, analyzing a dataset using a single criterion leads to the selection of many statistically relevant genes that are equally valid in interpreting the data. However, it is commonly observed that statistical significance may not always correspond to biological relevance. Traditionally, additional information is used to guide the selection of biologically relevant genes from the list of statistically significant genes. Using such information during the analysis phase to guide the mining process is more effective, especially when dealing with such complex processes (Liu, Dougherty, Dy, Torkkola, Tuv, Peng, Ding, Long, Berens, Parsons, Zhao, Yu & Forman, 2005; Anastassiou, 2007). Data integration has been studied for a long time, ranging from early applications in distributed databases (Deen, Amin & Taylor, 1987) to the more recent ones in sensor networks (Qi, Iyengar & Chakrabarty, 2001) and even biological data (Lacroix, 2002; Searls, 2003). However, the techniques we discuss are those using integrative analyses as opposed to those which integrate data or analyze such integrated data. The difference lies in the use of data from multiple sources in an integrated analysis framework. The range of biological data available and the variety of applications make such analyses particularly necessary to gain biological insight from a whole organism perspective.

MAIN FOCUS With the increase in availability of several types of data, more researchers are attempting integrated analyses. One reason for using numerous datasets is that high-throughput data often sacrifice specificity for scale (Troyanskaya, 2005), resulting in noisy data that might generate inaccurate hypotheses. Replication of experiments can help remove noise but they are costly. Combining data from different experimental sources and knowledge bases is an effective way to reduce the effects of noise and generate more accurate hypotheses. Multiple sources provide additional information that 1059

I

Integrative Data Analysis for Biological Discovery

when analyzed together, can explain some phenomenon that one source cannot. This is commonly observed in data mining and statistical pattern recognition problems (Troyanskaya, 2005; Rhodes & Chinnaiyan, 2005; Quackenbush, 2007). In addition, the systems being studied are inherently complex (Figure 1). One dataset provides only one view of the system, even if it looks at it from a whole organism perspective. As a result, multiple datasets providing different views of the same system are intuitively more useful in generating biological hypotheses. There are many possible implementation strategies to realize the goals of integrative analysis. One can visualize the use of one dataset after another in a serial model or all the datasets together in a parallel model. A serial model represents classical analysis approaches (Figure 2a). Using all the datasets at once to perform an integrative analysis would be a fully parallel model (Figure 2b). By combining those approaches, we arrive at a semi-parallel model that combines some of the data early on, but adds some in later stages of the analysis. Such an analysis could integrate different groups of data separately and then integratively process the results (Figure 2c). The fully parallel model combines all the datasets together and performs the analysis without much user intervention. Any application of expert knowledge is done at the preprocessing stage or after the analysis. An algorithm that follows this model allows the use of heterogeneous datasets and can therefore be applied in later studies without need for modification. The semi-parallel model is used when expert knowledge is needed to guide the analysis towards more specific goals. A model that can be tailored to a study and uses expert knowledge can be more useful than a generic model. An algorithm that follows this model requires tuning specific to each experiment and cannot be applied blindly to diverse datasets. These models represent the basic ways of performing an integrative analysis.

REPRESENTATIVE SELECTION OF ALGORITHMS Integrative analyses using data from diverse sources have shown promise of uncovering knowledge that is not found when analyzing a single dataset. These studies analyze disparate datasets to study subjects ranging 1060

from gene function to protein-protein interaction. Common to all these studies is the use of expert knowledge and an increase in prediction accuracy when executing integrative analysis of varied datasets compared to results obtained when executing classical analysis with a single data source. More specifically these studies use data sources that include both high-throughput and conventional data to perform an analysis and generate inferences. This is in contrast to conventional studies where a high-throughput data source is used for analysis and additional information is later sought to help in drawing inferences. A variety of data mining techniques including decision trees, biclustering, Bayesian networks, kernel methods have been applied to the integrated analysis of biological data. Troyanskaya, Dolinksi, Owen, Altman & Botstein (2003) developed a framework called Multisource Association of Genes by Integration of Clusters (MAGIC) that allows the combination of different biological data with expression information to make predictions about gene function using Bayesian reasoning. The structure of the Bayesian network is created using expert input. The inputs to this network are matrices representing gene-gene relationships resulting in groups of functionally related genes as output. This approach would fall under the fully parallel model with the use of expert knowledge at the preprocessing stage. The data is processed individually to create matrices and all these matrices are analyzed together integratively. Jansen, Yu, Greenbaum, Kluger, Krogan, Chung, Emili, Snyder, Greenblatt & Gerstein (2003) predicted protein-protein interactions using experimental and annotation data separately. Bayesian networks were used to obtain an experimental probabilistic interactome and a predicted probabilistic interactome respectively. These two were then combined to give a final probabilistic interactome. This experiment would fall under the semi-parallel model. Jiang & Keating (2003) developed a multistage learning framework called Annotation Via Integration of Data (AVID) for prediction of functional relationships among proteins. High-confidence networks are built in which proteins are connected if they are likely to share a common annotation. Functional annotation is treated as a classification problem and annotations are assigned to unannotated proteins based on their neighbors in the network. This method also falls into the fully parallel model but in a multi stage implementation.

Integrative Data Analysis for Biological Discovery

Figure 2. Integrative Analysis Models: a. Serial, b. Fully Parallel, c. Semi-Parallel a. Model for analysis of a single high-throughput dataset with the aid of knowledge base(s) (serial analysis) Knowledge Knowledge Knowledge Base(s) Base(s) Base(s)

Experimental Analysis Experimental Experimental Data Data Data

Statistically Statistically Statistically Analysis Significant Analysis Analysis Significant Significant Gene List Gene List Gene List

Biologically Hypotheses Biologically Biologically HypothesesHypotheses Analysis Inference Relevant Inference Inference Analysis Relevant Relevant Gene List Gene List Gene List

b. Model for Integrative analysis of multiple high-throughput datasets with the aid of knowledge base(s) (fully parallel analysis) Knowledge Knowledge Knowledge Base(s) Base(s) Base(s) Experimental Data Experimental Data Experimental Data Experimental Data Experimental Data Experimental Data

Integrative Integrative Integrative Analysis Analysis Analysis

Hypotheses Biologically HypothesesHypotheses Biologically Biologically Inference Relevant Inference Inference Relevant Relevant Gene List Gene List Gene List

Experimental Data Experimental Data Experimental Data

Knowledge Knowledge Knowledge Base(s) Base(s) Base(s)

c. Model for Integrative analysis of multiple high-throughput datasets with the aid of knowledge base(s) (semiparallel analysis) Knowledge Knowledge Knowledge Base(s) Base(s) Base(s) Experimental Data Experimental Data Experimental Data Results Experimental Data Integrative Experimental Data Experimental Data Integrative Integrative Analysis Analysis Analysis

Results Integrative Knowledge Knowledge KnowledgeIntegrative Integrative Analysis Base(s) Base(s) Base(s) Analysis Analysis Experimental Data Experimental Data Experimental Data

Results Results Biologically Integrative Biologically Biologically Integrative Integrative Relevant Analysis Relevant Relevant Analysis Gene List Results AnalysisGene List Gene List Results Inference Inference Inference Hypotheses HypothesesHypotheses

Experimental Data Experimental Data Experimental Data

1061

I

Integrative Data Analysis for Biological Discovery

Tanay, Sharan, Kupiec & Shamir (2004) proposed Statistical-Algorithmic Method for Bicluster Analysis (SAMBA) to analyze mixed datasets to identify genes that show statistically significant correlation across these sources. The data is viewed as properties of the genes and the underlying biclustering algorithm looks for statistically significant subgraphs from the genesproperties graph to predict functional groupings for various genes. This method again falls under the fully parallel model with some preprocessing. Lanckriet, Bie, Cristianini, Jordan & Noble (2004) proposed a computational framework where each dataset is represented using a kernel function and these kernel functions are later combined to develop a learning algorithm to recognize and classify yeast ribosomal and membrane proteins. This method also could be categorized as a fully parallel model. Zhang, Wong, King & Roth (2004) used a probabilistic decision tree to analyze protein-protein interactions from high-throughput methods like yeast two-hybrid and affinity purification coupled with mass spectrometry along with several other protein-pair characteristics to predict co-complexed pairs of proteins. This method again follows a fully parallel model with all the datasets contributing to the analysis. Carmona-Saez, Chagoyen, Rodriguez, Trelles, Carazo, & Pascual-Montano (2006) developed a method for integrated analysis of gene expression data using association rules discovery. The itemsets used include the genes along with the set of experiments in which the gene is differentially expressed and the gene annotation. The generated rules are constrained to have gene annotation as an antecedent. Since the datasets used were well studied, the generated rules could be compared with the known information to find that they really are insightful.

APPLICATION TO A SPECIFIC DOMAIN: CANCER Cancer studies provide a particularly interesting application of integrated analysis techniques. Different datasets from studies interrogating common hypotheses can be used to perform meta-analyses enabling the identification of robust gene expression patterns. These patterns could be identified in a single type or across different cancers. Annotation databases and pathway resources can be used for functional enrichment of cancer signa1062

tures by reducing large signatures to a smaller number of genes. Such additional knowledge is useful to view the existing information in a new light. Understanding a complex biological process requires us to view such knowledge in terms of complex molecular networks. To understand such networks, protein-protein interactions need to be mapped. Information on such protein interaction networks is limited. In addition to protein interaction networks, global transcriptional networks can be used to enhance our interpretation of cancer signatures. Knowledge of transcription factor binding site information allows us to identify those that might be involved in particular signatures. Studying model oncogene systems can further enhance our understanding of these signatures (Rhodes & Chinnaiyan, 2005; Hu, Bader, Wigle & Emili, 2007). Examples of studies employing these approaches show the usefulness of integrative analyses in general as well as for their application to cancer.

FUTURE TRENDS The future trends in this area can be divided into three broad categories: experimental techniques, analysis methods and application areas. Though, we are more interested in the analysis methods, these categories are interlinked and drive each other in terms of change and growth. In addition to existing high-throughput techniques, advancement in newer approaches like SNP arrays, array comparative genome hybridization, promoter arrays, proteomics and metabolomics would provide more high quality information to help us understand complex biological processes (Rhodes & Chinnaiyan, 2005). With such an aggregation of data, integrative analyses techniques would be a key to gaining maximum understanding of these processes. Apart from experimental techniques, there is tremendous unrealized potential in integrative analysis techniques. A majority of the current methods are fully parallel and generic. They can be applied to different studies using dissimilar datasets but asking the same biological question. Though domain knowledge guides the development of these algorithms to an extent, optimal use of such knowledge is not seen. With the huge amount of diversity in the datasets, the use of expert knowledge is needed to develop a method specific to a domain based on the types of data being used and the questions being asked. This means alternate semi-

Integrative Data Analysis for Biological Discovery

parallel integrative analysis techniques might be seen in the future that are better suited to a specific study. Such methods can then be adapted to other studies using knowledge about that particular domain and those datasets. Apart from the semi-parallel analysis model seen earlier (Figure 2c), there can be other variations of the model. One such model (Figure 3a) uses few datasets together to perform the analysis and uses background knowledge (left out of the analysis phase intentionally based on the nature of the study or unintentionally based on the nature of the data) to further refine hypotheses.

Yet another variation of the semi-parallel model (Figure 3b) analyzes some of these datasets individually and uses the generated results to guide the integrative analysis of the other datasets. An example would be to use the results for each gene from an analysis as annotation while analyzing the other datasets. Though many techniques for integrative analysis are seen, a comprehensive evaluation methodology still needs to be adopted to allow for comparisons between techniques and evaluation across varying datasets. In addition, there is room to expand the application of these techniques in the current domains as well

Figure 3. Alternate semi-parallel analysis models a. Alternate model 1 for semi-parallel integrative analysis of multiple high-throughput datasets with the aid of knowledge base(s) Experimental Data Experimental Data

Knowledge Knowledge Base(s) Base(s) Experimental Data Experimental Data

Integrative Integrative Analysis Analysis

Statistically Statistically Significant Significant Gene List Gene List

Integrative Integrative Analysis Analysis

Inference Inference Hypotheses Hypotheses

Knowledge Knowledge Base(s) Base(s)

Experimental Data Experimental Data

Biologically Biologically Relevant Relevant Gene List Gene List

b. Alternate model 2 for semi-parallel integrative analysis of multiple high-throughput datasets with the aid of knowledge base(s)

Experimental Experimental Data Data

Experimental Experimental Data Data

Analysis Analysis

Analysis Analysis

Results Results

Results Results

Integrative Integrative Analysis Analysis

Biologically Biologically Relevant Relevant Gene List Gene List Inference Inference

Knowledge Knowledge Base(s) Base(s)

Hypotheses Hypotheses

Experimental Data Experimental Data Experimental Data Experimental Data

1063

I

Integrative Data Analysis for Biological Discovery

as to apply them to newer domains like personalized medicine. In most of the existing studies, the prominent datasets have been at the gene and protein expression level. In the future, we can foresee other datasets like SNP data and clinical data contributing equally to biological discovery.

CONCLUSION The field of biological and biomedical informatics has been developing at a searing pace in recent years. This has resulted in the generation of a voluminous amount of data waiting to be analyzed. The field of data mining has been up to the challenge so far, with researchers taking a keen interest and providing impressive solutions to research questions. However, with the generation of heterogeneous datasets, the scope for analysis from a systems biology perspective has increased. This has stimulated the move from conventional independent analyses to novel integrated analyses. Integrative analysis of biological data has been employed for a short while with some success. Different techniques for such integrative analyses have been developed. We have tried to summarize these techniques in the framework of the models described in Figure 2 and Figure 3. However, varied research questions and applications still exist and new data is being created continuously. Hence, there is a lot of scope for improvement and development of new techniques in this fledgling area, which could ultimately lead to important biological discoveries.

REFERENCES Allison, D. B., Cui, X., Page, G. P, & Sabripour, M. (2006). Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet, 7(1), 55-65. Anastassiou, D. (2007). Computational analysis of the synergy among multiple interacting genes. Molecular Systems Biology, 3, 83. Carmona-Saez, P., Chagoyen, M., Rodriguez, A., Trelles, O., Carazo, J. M., & Pascual-Montano, A. (2006). Integrated analysis of gene expression by association rules discovery. BMC Bioinformatics, 7, 54.

1064

Deen, S. M., Amin, R. R., & Taylor, C. C. (1987). Data integration in distributed databases. IEEE Transactions on Software Engineering, 13(7), 860-864. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning. New York, NY: Springer. Hu, P., Bader, G., Wigle, D. A., & Emili, A. (2007). Computational prediction of cancer-gene function. Nature Reviews Cancer, 7, 23-34. Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N. J., Chung, S., Emili, A., Snyder, M., Greenblatt, J. F., & Gerstein, M. (2003). A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data. Science, 302(5644), 449-453. Jiang, T., & Keating, A. E. (2005). AVID: An integrative framework for discovering functional relationships among proteins. BMC Bioinformatics, 6(1), 136. Lacroix, Z. (2002). Biological Data Integration: Wrapping data & tools. IEEE Transactions on Information Technology in Biomedicine, 6(2), 123-128. Lanckriet, G. R. G., Bie, T. D., Cristianini, N., Jordan, M. I., & Noble, W. S. (2004). A statistical framework for genomic data fusion. Bioinformatics, 20(16), 2626-2635. Liu, H., Dougherty, E. R., Dy, J. G., Torkkola, K., Tuv, E., Peng, H., Ding, C., Long, F., Berens, M., Parsons, L., Zhao, Z., Yu, L., & Forman, G. (2005). Evolving feature selection. IEEE Intelligent Systems, 20(6), 64-76. Mitchell, T. M. (1997). Machine Learning. New York, NY: McGraw Hill. Muir, J. (1911). My First Summer in the Sierra. Boston, MA: Houghton Mifflin. Poste, G. (2005, November). Integrated Biosystems Research. Biodesign Institute Fall Workshop. Retrieved March 30, 2006, from http://www.biodesign. asu.edu/news/99/ Qi, H., Iyengar, S. S., & Chakrabarty, K. (2001). Multiresolution data integration using mobile agents in distributed sensor networks. IEEE Transactions on Systems, Man, and Cybernetics – Part C: Applications and Reviews, 31(3), 383-391.

Integrative Data Analysis for Biological Discovery

Quackenbush, J. (2007). Extracting biology from high-dimensional biological data. The Journal of Experimental Biology, 210, 1507-1517. Rhodes, D.R., & Chinnaiyan, A.M. (2005). Integrative analysis of the cancer transcriptome. Nature Genetics, 37, S31-S37. Russell, S. J., & Norvig, P. (2002). Artificial Intelligence: A Modern Approach. Upper Saddle River, NJ: Prentice Hall. Searls, D. B. (2003). Data integration-connecting the dots. Nature Biotechnology, 21(8), 844-845. Tanay, A., Sharan, R., Kupiec, M., & Shamir, R. (2004). Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci U S A, 101(9), 2981-2986. Troyanskaya, O. G., Dolinksi, K., Owen, A. B., Altman, R. B., & Botstein, D. (2003). A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci U S A, 100(14), 8348-8353. Troyanskaya, O. G. (2005). Putting microarrays in a context: Integrated analysis of diverse biological data. Briefings in Bioinformatics, 6(1), 34-43. Yu, L., & Liu, H. (2004). Efficient Feature Selection via Analysis of Relevance and Redundancy. Journal of Machine Learning, 5, 1205-1224.

KEY TERMS Annotation: Additional information that helps understand and interpret data better. Biological Relevance: Relevance from a biological perspective as opposed to a statistical perspective. Data Integration: Integration of multiple datasets from disparate sources. DNA Microarrays: A technology used to measure the gene expression levels of thousands of genes at once. Genomics/Proteomics: The study of the complete collection of knowledge encoded by DNA and proteins, respectively, in an organism. High-throughput techniques: Biological techniques capable of performing highly parallel analyses that generate a large amount of data through a single experiment. Integrated/Integrative Data Analysis: The analysis of multiple heterogeneous datasets under an integrated framework suitable to a particular application. Meta-analyses: The combination and analysis of results from multiple independent analyses focusing on similar research questions.

Zhang, L. V., Wong, S. L., King, O. D., & Roth, F. P. (2004). Predicting co-complexed protein pairs using genomic and proteomic data integration. BMC Bioinformatics, 5, 38.

1065

I

Encyclopedia of Data Warehousing and Mining - Semantic Scholar

Encyclopedia of data warehousing and mining / John Wang, editor. -- 2nd ed. p. cm. ... technologies allows us to sample tens of thousands of features of ...

818KB Sizes 1 Downloads 348 Views

Recommend Documents

Encyclopedia of Data Warehousing and Mining
Web site: http://www.eurospanbookstore.com ... automatic process because “data-mining tools auto- ..... nual Meeting of the Association for Computational.

data mining and warehousing pdf
data mining and warehousing pdf. data mining and warehousing pdf. Open. Extract. Open with. Sign In. Main menu. Displaying data mining and warehousing ...

The Role Of Data Mining, Olap,Oltp And Data Warehousing.
The designer must also deal with data warehouse administrative processes, which are complex in structure, large in number and hard to code; deadlines must ...

data warehousing & data mining -
1 (a) Describe three challenges to data mining regarding data mining methodology and user interaction issues. (b) Draw and explain the three-tier architecture ...

Summarizing and Mining Skewed Data Streams - Semantic Scholar
SIAM Symposium on Discrete Algorithms, pages 623–632,. 2002. [7] J. Baumes, M. .... Empirically derived analytic models of wide-area. TCP connections.

application of data mining techniques in on-line ... - Semantic Scholar
mushrooming of small intermediaries concentrating on niche segments and offering products at a cheaper rate. The competition between the intermediaries and ...

STAGGER: Periodicity Mining of Data Streams ... - Semantic Scholar
proaches used for discovering periodicity rates, STAGGER not only discovers a wider, ... ∗Work done while at Department of Computer Sciences, Purdue Uni- versity ..... bounded by the buffer size allowed by the system for buffer- ing the data ...

what is data mining and data warehousing pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. what is data ...

data mining and data warehousing pdf
data mining and data warehousing pdf. data mining and data warehousing pdf. Open. Extract. Open with. Sign In. Main menu. Displaying data mining and data ...

data warehousing and data mining pdf free download
data warehousing and data mining pdf free download. data warehousing and data mining pdf free download. Open. Extract. Open with. Sign In. Main menu.

MC7403-Data Warehousing and Data Mining question bank_edited ...
MC7403-Data Warehousing and Data Mining question bank_edited.pdf. MC7403-Data Warehousing and Data Mining question bank_edited.pdf. Open. Extract.

Encyclopedia of Information Ethics and Security - Semantic Scholar
Information technology--Moral and ethical aspects--Encyclopedias. 3. Computer crimes- .... ing state, which do not require any degree of mental effort as ...

Data mining in course management systems - Semantic Scholar
comings and in incorporating possible improvements. Traditional data analysis in e-learning is hypothesis or assumption driven (Gaudioso & Talavera, 2006) in ...

On Approximation Algorithms for Data Mining ... - Semantic Scholar
Jun 3, 2004 - Since the amount of data far exceeds the amount of workspace available to the algorithm, it is not possible for the algorithm to “remember” large.

Data mining in course management systems - Semantic Scholar
orientation towards obtaining parameters and measures to improve site ..... Page 10 .... mining, text mining, outlier analysis and social network analysis.

On Approximation Algorithms for Data Mining ... - Semantic Scholar
Jun 3, 2004 - problems where distance computations and comparisons are needed. In high ..... Discover the geographic distribution of cell phone traffic at.

On Approximation Algorithms for Data Mining ... - Semantic Scholar
Jun 3, 2004 - The data stream model appears to be related to other work e.g., on competitive analysis [69], or I/O efficient algorithms [98]. However, it is more ...

UPTU B.Tech Data Mining & Data Warehousing ECS 075 Sem ...
UPTU B.Tech Data Mining & Data Warehousing ECS 075 Sem 7_2011-12.pdf. UPTU B.Tech Data Mining & Data Warehousing ECS 075 Sem 7_2011-12.pdf.

Simulated and Experimental Data Sets ... - Semantic Scholar
Jan 4, 2006 - For simplicity of presentation, we report only the results of apply- ing statistical .... identify the correct synergies with good fidelity for data sets.

Simulated and Experimental Data Sets ... - Semantic Scholar
Jan 4, 2006 - and interact with a highly complex, multidimensional environ- ment. ... Note that this definition of a “muscle synergy,” although common ... structure of the data (Schmidt et al. 1979 .... linear dependency between the activation co

Fast data extrapolating - Semantic Scholar
near the given implicit surface, where image data extrapolating is needed. ... If the data are extrapolated to the whole space, the algorithm complexity is O(N 3. √.

Reactive Data Visualizations - Semantic Scholar
of the commercial visualization package Tableau [4]. Interactions within data visualization environments have been well studied. Becker et al. investigated brushing in scatter plots [5]. Shneiderman et al. explored dynamic queries in general and how