D2PM: a framework for mining generic patterns

Viewer
Transcript

D2PM: a framework for mining generic patterns Cláudia Antunes Instituto Superior Técnico

1

Introduction

The growing interest in data mining and its deployment in several distinct business contexts have contributed to enlarge its application areas, bringing a set of new challenges into the arena, like dealing with complex data, but also old ones, like the need to incorporate background knowledge into the mining process [1]. The importance of introducing existing knowledge in the core of the process is even stronger in the case of pattern mining, where the balance between the quantity and quality of the results are far from being satisfactory. The goal of pattern mining is to find the set of recorded facts that co-occur in some dataset, a significant number of times. Naturally, this number is a user-defined parameter, and depending on it the mining process returns a few or thousands of patterns. Indeed, this potential and usual explosion of discovered patterns has impaired the generalized use of pattern mining in business analysis. The solution generally accepted to reach balanced results, in particular a reasonable number of discovered patterns, is the use of domain knowledge to guide the mining process. This role has been played by constraints, both applied to avoid some elements in the discovery as a pre-processing step, or just for eliminating part of the discovered patterns in a post-processing operation, for example through the use of interestingness measures. A third way is to apply constraints in the mining algorithms during the discovery and reducing the patterns generated and discovered, and consequently, minimizing time and memory consumptions. Some good results as been reached, for example in sequential pattern mining, where structural constraints are defined by formal languages [2]. However, and despite algorithms to use constraints are more efficient than non-constrained ones [3], constraints have been used seldom. This is mainly due to the lack of a standard way for defining constraints, which implies that most of the times they are very difficult to define, either on defining its structure (for example as automata) or on choosing the right thresholds with precision. The Onto4AR framework [4] proposed a first frame for using domain knowledge in the pattern mining process, but doesn’t explain clearly how constraints are defined based on the available knowledge. In this paper, we go further and present a reformulation of the referred framework, proposing the D2PM framework. Here we formulate the problem of pattern mining in the presence of domain knowledge, represented by an ontology, and define the mining process guided by constraints defined over the represented domain knowledge. In particular we describe what constraints are, and how can they be applied. The rest of the paper is organized as follows: next in section 2, we overview the problem of pattern mining and the use of constraints (sub-section 2.1). Along this, we discuss the difficulties that the use of constraints introduce in pattern mining without a standard way to represent them and address the key concepts of ontologies and knowledge bases (sub-section 2.2). In sub-section 2.3, we review the Onto4AR framework, explaining its principles and the main difficulties identified so far, discussing what are its major weak points. Section 3 introduces the D2PM framework, a new formulation of the previous framework where those issues are solved, and describes how to define constraints based on the available knowledge. The report ends with some conclusions and pointing out some directions for future research.

2

Pattern Mining and D3M

Pattern mining is a subtask of mining association rules, a problem that was formulated in 1993 in the context of basket analysis [5] and further developed from then. Formally, let L={i1, i2, … im} be a set of m distinct literals, called items and X ⊆ L a subset of items, therefore known as itemset. Let D be a set of transactions, i.e., itemsets transacted in the same conditions, under a unique identifier. The goal of pattern mining is to find all frequent itemsets in D, where X is said to be frequent in D, if it is

contained in at least σ%, with σ a minimum support threshold defined by the user. From these frequent itemsets is then possible to generate all the association rules with support and confidence above userspecified thresholds. Association rules are just rules in the form AB, where A and B are itemsets [6]. In a more generic formulation, items correspond to propositions, pairs attribute / value, most of the times representing only one attribute for some entity, instead of an entire entity as in basket analysis. At this point, some remarks have to be made about the pattern mining phase. First, the minimum support threshold, chosen by the user, is the only factor that controls the mining process, which means that all the responsibility of mining results remains on the user hands. Certainly, users are who better know the problem domain, and are the best actors to establish the threshold under what there is nothing interesting to see, but the wrong choice can lead to the abandon of the technique. The second aspect to refer is related to the choice of minimum support threshold. In fact, high levels of minimum support leads to small sets of discovered patterns, but most of the times, only trivial information is found. At the other hand are low levels of minimum support, that usually lead to very large sets of discovered patterns, which to their huge number, makes their analysis impossible. Finally yet important, the user has been also responsible for describing the transactions at the proper abstraction level. Indeed, in the basket analysis context, items do not correspond to real instances but to some abstraction: when a customer buys a Heineken Lager Beer, it can be bought by other customer. Indeed, user has to choose if he want to deal with the specific beer from Heineken, with Heineken beers, or just with beers in general. Again, the decision of the right level of abstraction conditions the number and relevance of each discovered pattern. Note that the wrong choice, leads to the necessity of re-describing the data and re-running the entire mining process. If it is undeniable that user should be in the center of the pattern mining process, by defining its parameters and context, it is also certain that users desire an integrated environment to control the process, which provides simple mechanisms both for choosing parameters, and for evaluating the results, in an iterative way. 2.1

Constraints Usage

In order to provide such environment, and makes pattern mining easier to the final user, constraints have been proposed. A constraint has been defined as a predicate on the powerset of the set of items L, which means, that it is a function c: 2L{true, false}. An itemset S is said to satisfy c, if and only if, c(S) is true. In fact, constraints are the most effective technique to reduce the number of discovered patterns. As pointed by Bayardo, constraints play a critical role in solving the trade-offs of the generality of data mining algorithms, by focusing "the algorithm on regions of the trade-off curves (or space) known (or believed) to be most promising" [7]. The greatest advantage of constraints is to maintain the control of the mining process in the hands of the user. Since he continues to assume the responsibility of choosing which of aspects are most important for the current analysis. In addition to this responsibility, the user becomes to have a tool to help him on choosing those aspects. Its greatest risk is to reduce the discovery to a hypothesis testing task [8], where the constraint has a too high level of restriction, and filters off all the unknown information. In this situation, the process only confirms the constraint introduced, given a very little contribution to increase the knowledge for addressing the problem in analysis. The advances in the area of knowledge management, verified in the last years, made possible the incorporation of richer constraints in non-structured problems. In particular, ontologies have been used often. However, they have mostly been used for post-processing purposes, which means that instead of reducing the number of discovered patterns and processing times, the goal is to present just a subset of the discovered patterns to the user, on accordance to his specification. The Onto4AR framework [4] is an exception, and aims to provide the means to define constraints to reduce the number of discovered patterns, and simultaneously, improving processing times. Before reviewing the framework, lets overview the relevant notions in ontologies and knowledge bases. 2.2

Ontologies and Knowledge Bases

The development of the Semantic Web contributed considerably to advance the area of ontologies, and now they are commonly accepted as a mean to represent and share existing knowledge.

An ontology is an explicit specification of a conceptualization [9] aiming for describing a set of representational primitives with which to model a domain of knowledge or discourse [10]. In other words, an ontology is a specification of an abstract, simplified view of a domain. Formally, it corresponds to a triple O:={C, R, AO}, where C is a set of concepts, which represent the entities in the domain; R is a set of attributes that characterize the concepts and AO a set of axioms that describe constraints on the ontology, making explicit implicit facts. Among R elements, there are relations, a particular case of attributes, whose range are concepts. The is-a is a generic relation that establishes that one concept (say c1) is a sub-concept of another one (say c2,), or in other words c2 is a parent of c1. In the counterpart of ontologies are knowledge bases, which specify the known members of each concept in a particular ontology. A knowledge base is a tuple KB:={O, I, inst-1 }, where O is a ontology as defined above; I a set of instances and inst-1 a function from I to C in ontology O, that identifies the relation among instances and concepts. Actually, inst-1 is just the inverse of the concept instantiation function. Note that while ontologies describe generic elements, by defining its properties and describing the relations among them, a knowledge base defines a set of instances, which can be understood in that context. 2.3

The Onto4AR framework

The Onto4AR framework [4] was proposed, aiming for addressing the problem of transactional pattern mining in the presence of domain knowledge. It was centered on the use of an ontology and assumed a new formulation of the problem, where the meaning of an item was clearly defined in the context of the ontology and the mining process was guided by a constraint. In its origin the Onto4AR framework proposed several types of constraints, from interestingness measures, such as support and confidence, to content constraints based on the relations defined in the ontology. Applications developed under the Onto4AR framework showed that axiomatic constraints could also play an important role in the mining process, delegating the meaning of equality to the ontology, in particular to its axioms [11]. Actually, in this context an item x occurs in a transaction T, if it is equal to some element of T, where the equality among concept members was defined by some of the axioms in the ontology. In [12] the meaning of pattern was clarified, making use of abstract instances, instances that could have some of their key attributes missing. In this manner, patterns would correspond to abstractions of instances, making possible to discover non-complete information. However and despite those efforts, the relation among the elements in the mining problem and the knowledge representation was diffuse and needs to be clarified. Moreover, the framework did not contemplate other patterns despite transactional ones. Moreover, the available algorithm for mining transactional patterns in this context, D2Apriori [12], suffers from a large consumption of memory resources since it was not able to deal with patterns at the right level of abstraction.

3

The D2PM framework

The D2PM framework born from that need, and reach three goals: to clarify the correspondence among items / itemsets and concepts / instances; to precisely define what is a pattern; to deal with structured pattern mining, in its simplest form – sequences; to provide efficient algorithms for working in the framework. Next we propose the formulation of transactional pattern mining in the presence of a domain ontology. 3.1

Problem formulation

As in the Onto4AR framework, the problem of discovering patterns in a transactional datasets involves the existence of an ontology and some constraint. Let KB:={O, I, inst-1} be a knowledge base and O:={C, R, Ao} a domain ontology as defined above. Consider A the subset of R that exclude the relations, this is, the set of attributes whose range is not a concept. Let F be the set of features that includes all possible values for attributes in A.

As in the original formulation of transactional pattern mining, let LD = {i1, i2, …, im} be a set of items that appear in D, the dataset containing a set of transactions. With each transaction T being a set of items such that T ⊆ LD. In order to address items in the context of the domain knowledge available, through ontology O and knowledge base KB, consider a d2item to be a link to a feature from F and L to be the set of all d2items. In this new context, an item is just a d2item that occurs in a dataset D, and LD is just a subset of L (LD ⊆ L). Similarly, consider a d2itemset as a set of d2items, and an itemset as a d2itemset whose elements are all items, which means that all of its items occur in the dataset. Now let c be a constraint, which implements a predicate over d2itemsets, or in other words, a predicate on the powerset of the set of d2items L. It is said that a d2itemset X occurs in D under c, if and only if some transaction T in D contains X under c; by its hand, T is said to contain X under c, if for all d2item x belonging to X there is an item t in T such t is equal to x in the context of c. Given an ontology O, a dataset D and a constraint c, the problem of mining all patterns in D under c, corresponds to the discovery of the set of all d2itemsets that occur in D under c and satisfy c, known as D2PM 1.0 d2patterns. Ontology

1..*

1

Knowledge Base

Constraint

Dataset 1

1 1 *

1

Axiom

1

1

1..*

* Attribute

*

1

1

1..*

1 domain

1 D2Itemset

*

Concept

Instance *

1 instantiates * 1

1

set of

domain

1

Relation

D2Item

1 *

D2Pattern

Itemset

*

Feature

1

Item

range

instantiates

*

1

1

Figure 1 Elements in the pattern mining process in the D2PM framework

Figure 1 illustrates in UML the relations among elements in the ontology and datasets, where general relations are represented as red named lines; inheritance is represented by an empty large arrow and composition by diamond arrows. Green shapes represent elements of transactional datasets, blue shapes denote elements in ontology and knowledge base, and brown shapes express elements that Wednesday, November 16, dataset 2011 elements, to be present in patterns in the new context. by Cláudia Antunes generalize 3.2

Constraints in the D2PM framework

Constraints are the key in the process of mining in the presence of domain knowledge, since they establish how this knowledge will be used in the process. Moreover, constraints play many roles: to filter off uninteresting itemsets, to define what makes a d2itemset to be a d2pattern and to map items in the dataset to features in the knowledge base. For this reason, we will concentrate our efforts on constraints on the next research steps.

4

Conclusions

D2PM extends the Onto4AR framework, clarifying the relations among items in the dataset and features in the knowledge base. This clarification allows for a more efficient implementation of available algorithms, reducing both its time and memory consumptions, dealing with fewer d2itemsets for discovering the same patterns. The new formulation brings another advantage: its extension for dealing with events and structured pattern mining can be easier. This is due to its simplicity, and since both events and structures can be defined using itemsets and d2itemsets as bricks in a more complex structure.

References . .

.

. .

1 Yang, Q., Wu, X.: 10 Challenging Problems in Data Mining Research. Int'l Journal of Information Technology & Decision Making 5(4), 597–604 (2006) 2 Garofalakis, M., Rastogi, Shim: SPIRIT: Sequential Pattern Mining with Regular Expression Constraint. In Atkinson, M., Orlowska, M., Valduriez, P., Zdonik, S., Brodie, M., eds. : Int’l Conf. Very Large Databases, Edinburgh, Scotland, pp.223-234 (1999) 3 Antunes, C., Oliveira, A.: Constraint Relaxations for Discovering Unknown Sequential Patterns. In Goethals, B., Siebes, A., eds. : Knowledge Discovery in Inductive Databases: Third International Workshop, KDID 2004 (Revised Selected and Invited Papers) LNCS 3377. Springer, New York (September 2005) 11-32 4 Antunes, C.: An Ontology-based Framework for Mining Patterns in the Presence of Background Knowledge. In : Int'l Conf. on Advanced Intelligence, Beijing, China, pp.163-168 (October 2008) 5 Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases. In Buneman, P., Jajodia, S., eds. : SIGMOD Conference, Washington, D.C., pp.207-216 (1993)

.

6 Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In : Int'l Conf on Very Large Data Bases, Santiago de Chile, Chile, pp.487-499 (1994)

.

7 Bayardo, R.: The Many Roles of Constraints in Data Mining. SIGKDD Explorations 4(1), i-ii (2002)

.

.

8 Hipp, J., Güntzer, U.: xIs pushing constraints deeply into the mining algorithms really what we want?: an alternative approach for association rule mining. SIGKDD Explorations 4(1), 50-55 (2002) 9 Gruber, T.: A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition 5(2), 21–66 (1998) 1

Gruber, T.: Ontology. Springer-Verlag (2009)

0. 1 Antunes, C.: Mining Patterns in the Presence of Domain Knowledge. In : Int'l Conf on 1. Enterprise Information Systems, Milan, Italy, pp.188-193 (May 2009) 1 Antunes, C.: Pattern Mining over Star Schemas in the Onto4AR Framework. In Saygin, Y., Yu, 2. J., Kargupta, H., Wang, W., Ranka, S., Yu, P., Wu, X., eds. : 2nd Int'l Workshop on Semantic Aspects in Data Mining in the Int'l Conf on Data Mining, Miami, pp.453-458 (December 2009) 1 Češpivová, H., Rauch, J., Svátek, V., Kejkula, M., Tomečková, M.: Roles of Medical Ontology 3. in Association Mining CRISP-DM Cycle. In : Workshop on Knowledge Discovery and Ontologies () in European Conf on Principles and Practice of Knowledge Discovery in Databases, Pisa, Italy (September 2004) 1 Antunes, C., Oliveira, A.: Inference of Sequential Association Rules Guided by Context-Free 4. Grammars. In : Grammatical Inference: Algorithms and Applications, 6th Int'l Colloquium, Amsterdam, The Netherlands, vol. LNCS 2484, pp.1-13 (September 2002) 1 Srikant, R., Agrawal, R.: Mining association rules with item constraints. In : Int'l Conf 5. Knowledge Discovery and Data Mining, Newport Beach, pp.67-73 (August 1997)

Microbase2.0 - A Generic Framework for Computationally Intensive ...

D2PM: Domain Driven Pattern Mining

Generic Load Regulation Framework for Erlang - GitHub

A generic probabilistic framework for structural health ...

NaBoo : A Generic, Evolutive CAD Framework for ...

A Web Service Mining Framework

A Generic Framework to Model, Simulate and Verify ...

Mining Recent Temporal Patterns for Event ... - Semantic Scholar

gApprox: Mining Frequent Approximate Patterns from a ...

Mining Recent Temporal Patterns for Event ... - Semantic Scholar

A Proposed Framework for Proposed Framework for ...

gApprox: Mining Frequent Approximate Patterns from a ...

Mining Frequent Neighborhood Patterns in a Large ...

Mining Compressing Patterns in a Data Stream

Concept Map Mining: A definition and a framework for ...

Mining Sequential Patterns - Department of Computer Science

Perturbation Based Guidance for a Generic 2D Course ...

A Generic Language for Dynamic Adaptation

PFA: A Generic, Extendable and Efficient Solution for ... -