A Web Service Mining Framework

Viewer
Transcript

A Web Service Mining Framework George Zheng† and Athman Bouguettaya†‡ † Virginia Tech, Blacksburg, Virginia, USA ‡ CSIRO ICT Centre, Canberra, ACT, Australia {gzheng,athman}@vt.edu Abstract

tivities; 3. develop measures that can be used to objectively evaluate the interestingness of the mining results; and 4. demonstrate the usefulness of the framework through a motivating use case.

We propose a service mining framework for exploring interesting compositions of existing Web services. The framework first screens Web services for composition leads using a “coarse-grained” filtering approach. It then verifies these leads based on runtime conditions. Top candidates are selected from the verified leads and evaluated for their interestingness. We present algorithms to automate the screening phase of the framework. Finally, we study the effects of key variables on lead compositions’ interestingness. As a motivating example, we apply these algorithms to the field of biological pathway discovery and rely on knowledge obtained from reverse engineering online resources to assess their effectiveness.

A key characteristics distinguishing Web service mining from traditional Web service composition approaches as governed by standards such as WSFL, XLANG, BPEL4WS, DAML-S and OWL-S is that Web service mining is driven by the desire to find any unanticipated and interesting compositions of existing Web services. Traditional composition approaches are usually driven by a top down strategy, which first requires a user to provide a goal containing a fixed set of specific criteria. It then uses these criteria to search for matching component Web services. Since the goal provided by the user already implies what type of compositions the user anticipates, the evaluation of composition interestingness is not a major concern in these approaches. In the absence of such top-down query, Web service mining techniques need to address how interestingness of service compositions can be determined. The lack of specific goals in Web service mining also lends itself naturally to being carried out using the bottom-up strategy. The simplest approach following this strategy would be an exhaustive search for composability between all Web services. This approach does not scale well since it would inevitably result in a “combinatorial explosion” problem when faced with a large number of Web services. As we look for efficient techniques, similarities between Web services and molecules offer some interesting insights. Web services can be thought of in many ways as similar to molecules in the natural world. Like a molecule, a Web service has both attributes and dynamic behaviors. Like a molecule formed from constituent atoms and/or simpler molecules, a composite service is composed of component services. Under the right conditions, certain molecules can recognize each other and form bonds in between. The concept of recognition can be easily extended to the Web service world to help devise mining techniques on Web services. The similarity between molecules and Web services also motivated us to apply our mining framework back to the field of bioinfor-

1 Introduction The Web is poised to transition from a data Web to a service Web where Web applications, aka, Web services, would be the first-class objects. As the Web service technology continues to mature, it is expected that there will be a large number of Web services deployed to the Web. The increase in their availability is also expected to lead to the natural next step in the evolution of Web services, spurring both the need and opportunities to break new ground on Web service mining, much like the easy access to a glut of data that has provided a fertile ground for data mining research. We define Web service mining as a search process aiming at the discovery of unanticipated and interesting compositions of Web services. We believe that Web service mining would be key to leveraging the large investments in applications that have so far operated as non-interoperable silos. In anticipation of the need and opportunities to mine Web services, our research focuses on developing a framework that would facilitate the related mining activities. Our objectives include: 1. identify the type of activities involved in the mining process; 2. develop effective strategies to streamline and efficient algorithms to automate much of these ac1 2007 IEEE International Conference on Web Services (ICWS 2007) 0-7695-2924-0/07 $25.00 © 2007

matics. As a result, the discovery of biological pathways came across as a natural application of our mining framework. The idea is to model biological entities as individual Web services [4] and use the mining framework to help discover linkages across isolated lab findings as captured in these models. The mining of these linkages (i.e., pathways) through the vehicle of Web service models is expected to complement and, when enough details are captured in the models, present an inexpensive and accessible alternative to existing in vitro and/or in vivo exploratory mechanisms. The discovery of these pathways is expected to deepen our understanding of how diseases come about and help expedite drug discovery. Although our experiments are limited in scope, we have confirmed the effectiveness of our screening algorithms in identifying potential pathway leads. The verification and evaluation of these leads are still ongoing work. We organize the remainder of the paper as follows. In Section 2, we present our mining framework and introduce several concepts used as its basis. In Section 3, we describe in detail the generation of the focused library and our screening algorithms. In Section 4, we use simulation to study the effects of key variables on the interestingness of lead compositions. In addition, we show how the framework can be used to mine pathways linking biological Web services provided by entities such as Aspirin. We conclude the paper with discussion of future work in Section 5.

Interesting Compositions/ Pathways of Web Services

Start

eights Assign W

Evaluation Invocation Plans Conditions Locale

Domain Ontologies Scope Specification

gn in i M pu m ott oB

Verification Lead Compositions Screening Search Space Determination

Mining Context

Service Registry

Find

Publis h

Focused Library

Service Service Consumers Consumers Service Service Providers Providers

Figure 1. Web Service Mining Framework verification phase (or the weeding phase) using invocation plans and additional matching characteristics such as runtime conditions, which require more intensive computations. The evaluation phase (or the harvest phase) is used to evaluate the interestingness of initial invocation plans proposed by the verification module, devise modifications to the plans, and direct the verification module to verify the validity of these modified plans.

2 Web Service Mining Framework 2.1

Our Web service mining framework, as shown in Figure 1, can be figuratively described using the “sow, grow, weed and harvest” analogy. The framework starts with scope specification (left of figure) by a domain expert for defining the context of mining. We expect the domain expert to have a general idea about the “seeds” of Web service functional areas (e.g., cell enzyme and drug functions) that he/she is interested in mining. Such seeds are expected to grow into fruitful compositions (e.g., Aspirin pathways) as the mining progresses. Weights may be assigned to these seeds to differentiate the user’s interest in them and to stimulate the growth of compositions encompassing the corresponding functional areas. To curb the problem of combinatorial explosion, the mining context is used in the search space determination phase for defining a focused library of existing Web services as the initial pool for further mining. Web services in the focused library are then filtered through the screening phase (or the growing phase) used to identify potentially interesting composition leads of Web services. This is achieved using a “coarse-grained” ontology-based filtering mechanism (see Section 3.2), which inspects only a subset of matching Web service characteristics (i.e., operation signature, message semantics) that can be quickly processed. The composition leads are then verified in the

In this section, we describe the service ontology and several other concepts that serve as the basis of our mining framework. These include operation/service recognition used in our screening algorithms; operation similarity and interestingness, which are two related concepts used in the evaluation phase to objectively measure how interesting a composition really is. We use Figure 2 containing Web service models of biological entities to illustrate some of these concepts. 2.1.1 Web Service Ontology We rely on OWL-S to define our Web services with a WSDL grounding. We refer to the applicability contained in the OWL-S service profile as locale in this paper. To recognize the fact that certain services (e.g., payment) may be involved in multiple OWL-S categories of services (e.g., healthcare, travel, legal), we use the concept of domain to group relevant operations, or more appropriately, operation interfaces, into the same category. An operation interface specifies a shared functionality implemented by operations from different Web services. A Web service’s involvement with a domain is reflected by whether it supplies or consumes an implementation of an operation interface in such 2

2007 IEEE International Conference on Web Services (ICWS 2007) 0-7695-2924-0/07 $25.00 © 2007

Mining Concepts

Domain Ontology Indices component

Web Service Registry

energy

block COX2

Enzymes COX1

COX1

PGG2

produce PG

Stomach Cell

Fatty Acids

produce mucus

Arachidonic Acid Omega-3

Mucus

cover stomach wall

Legends Web Service

Function/ behavior Function/ behavior

(NodeAgent keeps track of parameter pub/sub in Alg-2 & Alg-3 as described in Section 3.2)

PGI2

Web service of ontology node type Ontology node

Mucus

Extension Input of ontology node type meeting pre-condition Function having output of ontology node type and post-condition

Figure 2. Web Service Models a domain. We assemble a hierarchy of indices (middle of Figure 2) to existing domain ontologies (e.g., Enzymes) to unambiguously categorize the type of operation inputs and outputs.

• Exact match: na = nb • Is-a: na is a child of nb • Has-a: na has a component nb

2.1.2 Recognition and Composition Much like molecules in the natural world where they can recognize each other and form bonds in between [2], Web services and operations can also recognize each other through both syntax and semantics. Consequently, they can compose and bring about potentially interesting behaviors. We identify two types of operation recognition: direct and indirect recognition, and two types of service recognition: promotion and inhibition. Direct Recognition. A direct recognition is established between operations opa and opb , if opa consumes an operation interface opintf , which is implemented by opb . In addition, opa and opb must be mode, binding and message composable [3]. Indirect Recognition. A target operation opt indirectly recognizes a source operation ops , if ops generates some or all input parameters of opt . An example of indirect recognition is shown in Figure 2 between operation produce mucus from the Stomach Cell service and polymorphic operation produce PG from the service of a type of enzyme called COX1. We use the term indirect to indicate the fact that there is a potential need to relay parts of the output message from ops to parts of the input message to opt at the composition level. A bond is established between ops and opt for each input parameter opt can receive from ops . We denote the set of bonds between ops and opt as B(ops → opt ). If we refer to the set of all operations that opt recognizes as

We assume that the above relationships among parameter types are already declared in domain ontologies and thus can be automatically detected. Composition Validity. Various measures [3] have been proposed to determine whether two operations are composable at both syntactic and semantic levels. These measures can be used to determine whether a direct recognition-based composition is actually valid. For promotion and inhibitionbased compositions, they are valid because the entities of interest provide the corresponding services by declaration. In this section, we focus on how the validity of an indirect recognition-based composition can be determined in the verification phase. We denote comp(OPs , opt ) as an operation composition involving a set of source operations OPs providing input parameters to target operation opt , where OPs ⊂ OPs (→ opt ). In order for comp(OPs , opt ) to be valid, the following must be true: (2) ∀ops ∈ OPs , Γ[ops .L, opt .L] = 0 In Eq. 2, Γ is a domain expert-determined correlation function that measures the relevancy (i.e., 1 for the same and non-zero for related) of two locales. Eq. (2) states that in order for the composition to be valid, each of the source operations must have a locale that correlates to that of the target operation. A relevant bioinformatic example would be to make drug molecules effective (or compose with disease 3

2007 IEEE International Conference on Web Services (ICWS 2007) 0-7695-2924-0/07 $25.00 © 2007

(1)

Promotion When operation op1 of service sa produces an entity (i.e., output parameter) that in turn provides service sb , we say that sa : op1 promotes sb . In a bioinformatic setting, the increase in quantity of a service providing entity increases the availability of the service. For example, the quantity of Mucus is increased by (Stomach Cell service:produce mucus) in Figure 2. Consequently, the Mucus service becomes more available on the wall of the stomach, which becomes better protected from erosion and ulcer caused by gastric juice that is present there. Thus by definition, (Stomach Cell service:produce mucus) promotes Mucus service. Inhibition Similarly, when operation op1 of service sa consumes an entity (i.e., input parameter) that in turn provides service sb , we say that sa : op1 inhibits sb . Figure 2 shows an example of inhibition between Aspirin service:block COX1 and COX1 service. For promotion, inhibition and indirect recognition, we identify three types of matching between parameters p1 and p2 , whose data types refer to domain ontology index nodes na and nb , respectively:

NodeAgent

Celebrex

block COX1

energy

Node

NSAIDs Aspirin

Aspirin

OPs (→ opt ), then OPs (→ opt ) = {op | B(op → opt ) = ∅}

inheritance parent

causing molecules) by sending them to where the disease cells are located. t (ops → opt ) to denote a set of If we use Bselected opt ’s input parameters (i.e., target parameters) covered by a selected subset of B(ops → opt ) for comp(OPs , opt ), s (ops → opt ) the corresponding output parameBselected ters from ops , and EAcomp external attributes involved in comp(OPs , opt ), then in order for the composition to be valid, the following must also be true:

I = A(cn N + cs S) • • • •

∃F (OPs ) : ∀Bselected (ops → opt ) where ops ∈ OPs , s {(Cpost (f (ops )) on Bselected (ops → opt ) and EAcomp )}

ops ∈OPs f ∈F (OPs ) t

⊃ {Cpre (opt ) on Bselected (ops → opt ) and EAcomp }

(3)

F (OPs ) refers to composition-specific mediations that need to be applied to OPs so that the combined postconditions Cpost of OPs cover the space carved out by the s (ops → opt ) are pre-conditions CP re of opt if all Bselected t replaced by corresponding Bselected (ops → opt ). Consequently, invocation of opt is activated.

The concept of operation similarity is relevant when we study the interestingness of a indirect recognition-based composition. The similarity of two operations can be measured by comparing their input parameter set, output parameter set, pre-conditions and post-conditions. We use the following function to measure the similarity between opi and opj :

+ cc (

 N =

|Pout (opi ) ∩ Pout (opj )| |Pin (opi ) ∩ Pin (opj )| × ) |Pin (opi ) ∪ Pin (opj )| |Pout (opi ) ∪ Pout (opj )|

|Cpre (opi ) ∩ Cpre (opj )| |Cpost (opi ) ∩ Cpost (opj )| × ) (4) |Cpre (opi ) ∪ Cpre (opj )| |Cpost (opi ) ∪ Cpost (opj )|

where cp and cc are weights such that 0 ≤ cp , cc ≤ 1 and cp + cc = 1. |P | and |C| give the size of parameter set P and condition set C, respectively. According to Eq. (4), Sim(opi , opj ) ranges from 0 to 1, with 1 indicating that the two operations have the same parameters and conditions. 2.1.4 Interestingness In the context of Web service mining, interestingness indicates how interesting a Web service composition is. For a direct recognition-based composition, it is interesting if it exhibits better qualities than all previously discovered similar operations. For indirect recognition, promotion and inhibition-based compositions, their interestingness may be less certain and subjective knowledge from a domain expert may be needed to help with the determination. Due to the potential possibilities of large number of such compositions, we devise the following objective measure of interestingness, I, to help reduce the candidate pool for final consideration:

1, 1 − Maxop∈D Sim(comp(OPs , opt ), op),

promot. or inhibit. indirect recog. (6)

For both promotion and inhibition, the novelty is set to 1 due to the validity of the composition. For indirect recognition, D is a reference set of domains. Obviously the more similar the composed operation is to an existing operation, the less novel it is regarded. In this case, N can vary between 0 and 1. Since D needs to be large enough to ensure the uniqueness of the composition, the check of novelty in the case of indirect recognition for all compositions found in D could be a very expensive task. For this reason, we carry it out in the final phase of the mining process, where the number of leads is presumably small. Surprisingness. Surprisingness S indicates how unexpectedly a Web service composition is achieved. We use the following objective function to measure it:  ε , promot. or inhibit.   minds ∈D(s) Γ[d(op),ds ] ε×nd indirect recog. nd ≤ Nt S= Nt ×minop∈OPs Γ[d(op),d(opt )] ,   ε , indirect recog. n > N minop∈OPs Γ[d(op),d(opt )]

d

t

(7) For promotion or inhibition involving op and s, D(s) is a set of domains s is involved in. For indirect recognition, nd is the number of different domains involved in the composition, Nt is the expected maximum number of domains a composition may have and used for normalization purpose. In all cases, d() gives the domain of a given service operation. ε is a small number (0 < ε < 1) used to normalize the 4

2007 IEEE International Conference on Web Services (ICWS 2007) 0-7695-2924-0/07 $25.00 © 2007

A is the actionability of the composition, N is the novelty of the composition, S is the surprisingness of the composition, cn and cs are weights such that 0 ≤ cn , cs ≤ 1 and cn + cs = 1. • m is the number of expert-assigned weights wi (wi > 1) to operation interfaces and domain ontology index nodes that are involved in a composition. We choose to multiply all such weights involved in a composition to reflect their subjective interestingness-enhancing effect. Actionability. We define actionability as a binary (i.e., 1 for actionable, 0 for non-actionable) representing whether the composability of a composition can be verified through simulation or live execution. A non-actionable composition is considered uninteresting. Thus actionability contributes multiplicatively towards the overall interestingness. Novelty. Novelty, N , measures how unique and new a composition is. We use the following function to calculate novelty:

2.1.3 Operation Similarity

Sim(opi , opj ) = cp (

(5)

wi

i=1

where

t

X

m

value of S. Γ is a correlation function that measures the relevancy of two domains or the cohesion of the same domain (when i = j). It is defined for domains di and dj as: 1 − Γ[di , dj ] = e ε0 (n+1) (8)

phase takes advantage of necessary subjective interestingness measures when defining the locale of interest and a list of domains to be considered for mining. For example, the locale may be the brain and the domains may include cell enzyme functions and drug functions. Based on such mining interest, the scope of the mining, or mining context can be determined using:

where n is the number of unique pairs of operations, {(opi , opj ) | opi ∈ di , opj ∈ dj }, that are previously known to have been involved in a composition. When n = 0, the correlation between two domains in Eq. (8) is − 1 assigned an initial value of ε (ε = e ε0 ). This helps bound the surprisingness of a service composition in Eq. (7). Eq. (8) also shows that Γ for two domains quickly approaches 1 as n increases. Obviously, a composition achieved by combining component services from relatively few domains or domains that are previously known to be very relevant is less surprising than one from a diverse set of domains or domains that are previously known to be less relevant. Eq. (7) aims at objectively measuring surprisingness. However, surprisingness is sometimes subjective, i.e., the user evaluating it may choose to use subjective measures. The reference base of such measures may be personal knowledge, belief, bias and needs. Unfortunately, approaches based solely on subjective measures tend to inhibit us from getting interesting compositions that were not thought of. An extreme case of relying on subjective measures to carry out Web service mining is the traditional composition approach where the user issues a query specifying the composition in pursuit to start the search process. A reasonable compromise between a purely objective approach and a purely subjective approach would be to use a Bayesian approach to refine the subjective reference base and converge it to the reality of composition opportunities. We envision that this involves an iterative process of the following steps:

C = {d(L) | d ∈ D}

Consequently, the set of all operation interfaces included in C is denoted as OPintf (C): OPintf (C) = {opintf | ∃d ∈ D ∧ opintf ∈ d(L)}

F = {s | s ∈ R ∧ (s.Operations ∩ OPintf (C) = φ ∨ ∃op ∈ s.Operations : opconsume (OPintf ) ∩ OPintf (C) = φ)} (11)

3.2

Screening

To address the combinatorial explosion problem mentioned earlier, our screening phase uses a publish/subscribe mechanism to convert the traditional combinatorial search problem into a spontaneous operation recognition problem. This is achieved using two steps: operation level filtering and parameter level filtering. We list algorithms of both in Figure 3 Alg-1. Operation Level Filtering. At the operation level, operation interfaces within the mining context serve as the medium for Web service operations to plug into each other. Figure 3 Alg-1 (a) shows the operation level filtering algorithm. Service operations that implement a particular interface publish their implementation through that interface (lines 06 - 09). Service operations that need to invoke the implementation of an interface subscribe to that interface (lines 10 - 15). An operation agent is created (lines 01-03) for each operation interface to keep track of references to it from various operations. When publishing an operation that implements an interface, function publish(op) of the corresponding agent checks whether there is any subscriber to the interface. If so, it tries to establish a service composition lead using direct recognition between the publisher and the subscriber. Similarly, when an operation subscribes to an interface that it consumes, function subscribe(op) checks whether there is any publisher that implements the interface. If so, it tries to

detection of potential presence of a composition, conception of an invocation plan, execution of the invocation plan, evaluation of execution results, modification and re-execution of invocation plans if necessary. We focus on objective measures in our research but take advantage of subjective measures at the beginning of our mining process when they might be considered necessary to bootstrap the process.

3 Framework Details/Algorithms In this section, we describe in detail the generation of the focused library and our screening algorithms.

Focused Library Generation

The mining process starts with a domain expert specifying a scope for the mining activity. The scope specification 5 2007 IEEE International Conference on Web Services (ICWS 2007) 0-7695-2924-0/07 $25.00 © 2007

(10)

Since different domains may rely on different ontologies to describe relevant concepts or constructs within them, the specification of the mining context essentially determines a set of domain ontologies (e.g., NSAIDS, Enzymes in Figure 2) to use for the mining process. Assume R is the Web service registry. The focus library of Web services, F , can be calculated using:

• • • • •

3.1

(9)

where • D is a set of Web service domains, • L is a set of locale attributes of mining interest, • d(L) is a domain carved out by L.

Alg-1: Operation and Parameter Level Filtering Input: Context operation interfaces OPintf (C), focused library F , ontology O. Output: Leads of composed Web services L. Variables: Leads from publication and subscription Lps , operation interfaces consumed by op, opconsume (OPintf ). (a). Operation Level Filtering (01) For each opintf ∈ OPintf (C) (02) create Agent(opintf ); (03) EndFor (04) For each s ∈ F (05) For each op ∈ s.Operations (06) If (∃opintf ∈ OPintf (C): op impl opintf ) (07) Lps ← Agent(opintf ).publish(op); (08) L.add(Lps ); (09) EndIf (10) For each opintf ∈ opconsume (OPintf ) (11) If (opintf ∈ OPintf (C)) (12) Lps ← Agent(opintf ).subscribe(op); (13) L.add(Lps ); (14) EndIf (15) EndFor (16) EndFor (17) EndFor (b). Parameter Level Filtering (18) For each opintf ∈ OPintf (C) (19) For each pout ∈ opintf .messageout (20) k ← type(pout ); (21) If (k ∈ O) (22) If (¬∃Agent(k)) (23) create Agent(k); (24) EndIf (25) Agent(k).publish(pout );

(26) EndIf (27) EndFor (28) For each pin ∈ opintf .messagein (29) k ← type(pin ); (30) If (k ∈ O) (31) If (¬∃Agent(k)) (32) create Agent(k); (33) EndIf (34) Agent(k).subscribe(pin ); (35) EndIf (36) EndFor (37) EndFor (c). Lead Generation (38) For each opintf ∈ OPintf (C) (39) If (opintf .isBound()) (40) Lps ← opintf .generateLeads()); (41) L.add(Lps ); (42) EndIf (43) EndFor (44) return L; Alg-2: Node Agent registering publication of pout publish(pout ) Input: Output parameter pout of data type that this agent represents. (01) publishers.add(pout ); (02) If (CompositionalChildren = φ) (03) For each n ∈ CompositionalChildren (04) If (n ∈ O ∧ ¬∃Agent(n)) (05) create Agent(n); (06) EndIf (07) Agent(n).publish(pout ); (08) EndFor (09) EndIf (10) If (inheritanceP arent = φ) (11) If (inheritanceP arent ∈ O)

(12) If (¬∃Agent(inheritanceP arent)) (13) create Agent(inheritanceP arent); (14) EndIf (15) Agent(inheritanceP arent).publish(pout ); (16) EndIf (17) EndIf (18) If (subscribers = φ) (19) For each pin ∈ subscribers (20) pin .bonds.add(pout ); (21) EndFor (22) EndIf Alg-3: Node Agent registering subscription of pin subscribe(pin ) Input: Input parameter pin of data type that this agent represents. (01) subscribers.add(pin ); (02) If (CompositionalP arents = φ) (03) For each n ∈ CompositionalP arents (04) If (n ∈ O ∧ ¬∃Agent(n)) (05) create Agent(n); (06) EndIf (07) Agent(n).subscribe(pin ); (08) EndFor (09) EndIf (10) If (inheritanceChildren = φ) (11) For each n ∈ inheritanceChildren (12) If (n ∈ O ∧ ¬∃Agent(n)) (13) create Agent(n); (14) EndIf (15) Agent(n).subscribe(pin ); (16) EndFor (17) EndIf (18) If (publishers = φ) (19) For each pout ∈ publishers (20) pin .bonds.add(pout ); (21) EndFor (22) EndIf

Figure 3. Screening Algorithms scription propagates up a composition tree (lines 02 to 09 in subscribe(pin )) and down an inheritance tree (lines 10 to 17 in subscribe(pin )) in the ontology hierarchy. Note that to help reduce overhead, lines 04-06 and 12-14 in both publish(pout ) and subscribe(pin ) instantiate a node agent only when the node is referenced by at least one parameter.

establish a service composition lead between the subscriber and the publisher. Parameter Level Filtering. We distinguish two types of mining: Fixed scope mining and incremental mining. In fixed scope mining, the parameter level filtering is triggered after all the Web services in the focused library are introduced (lines 18 - 37 in Figure 3 Alg-1). Fixed scope mining can be used when the mining context is clearly defined and the search space can be easily determined. In incremental mining, instead of identifying OPintf (C) before introducing Web services into the mining process, OPintf (C) grows as Web service operations are identified and introduced. The incremental mining is more flexible than the fixed scope mining, since it does not require a predefined mining context. While it may involve a more diverse range of Web services and thus take longer during the screen phase, incremental mining offers a greater potential of discovering more interesting compositions than the fixed scope mining.

3.2.1 Performance Analysis We compare the computation complexity of the screening algorithms against a naive exhaustive search algorithm using operation recognition at both the operation and parameter levels. Table 1 lists relevant variables used in our complexity analysis. Table 1. Symbols and Parameters Variables Nop Number of operation interfaces in the mining context Average # of input parameters to an operation Npin Average # of output parameters from an operation Npout Number of Web services in the focused library Nws Average # of operation interfaces each Web service implements Nsi Average # of operation interfaces each operation consumes Noc |Ont| Size of domain ontologies Performance measurement parameters Top Time for operation filtering Time for message/parameter filtering Tmp T Total screening time (T = Top + Tmp )

Function generateLeads() generates a lead tree rooted at operation opintf listing as its child nodes operations whose output parameters match its input parameters. Fig. 3 Alg-2 and Alg-3 show the algorithms used by an ontology index node agent to register the publication of an output parameter (publish(pout )) or the subscription of an input parameter (subscribe(pin )). Within the ontology index hierarchy as shown on the right of Figure 2, publication and subscription on a node can sometimes propagate to other nodes. This happens when the node is involved in an inheritance or compositional relationship with other nodes. In general, publication propagates down a composition tree (lines 02 to 09 in publish(pout )) and up an inheritance tree (lines 10 to 17 in publish(pout )), while sub-

Table 2. Performance Comparison Our Screening Algorithm Top = O[Nop + Nws (min(Nsi logNop (1 + Noc ), Nop logNsi (1 + Noc ))] Tmp = O[Nop (Npin + Npout )log(|Ont|)]

Exhaustive Search

2 min(N N logN , N 2 logN )] Top = O[Nws oc si oc si si 2 N 2 min(N Tmp = O[Nws pin logNpout , Npout logNpin )] si

Figure 3 Alg-1 (a) assumes that Nop > Nsi . If Nop falls well under Nsi , then for improving the performance, we can easily change Alg-1 (a) to iterate through opera6

2007 IEEE International Conference on Web Services (ICWS 2007) 0-7695-2924-0/07 $25.00 © 2007

3500

0.6 |Ont| |Ont| |Ont| |Ont|

5000 10000 20000 50000

|Ont| |Ont| |Ont| |Ont|

= = = =

5000 10000 20000 50000

0.55

2500

(b) Average Interestingness

(a) Number of Bound Operations

3000

= = = =

2000

1500

0.5

0.45

0.4

1000

0.35

500

0

0

500

1000

1500

2000 2500 3000 Number of Operations

3500

4000

4500

0.3

5000

9

0

500

1000

1500

2000 2500 3000 Number of Operations

3500

4000

4500

5000

0.8 |Ont| |Ont| |Ont| |Ont|

8

= = = =

5000 10000 20000 50000

|Ont| |Ont| |Ont| |Ont|

0.7

= = = =

5000 10000 20000 50000

(d) Percent of Interesting Compositions

(c) Number of Interesting Compositions

7

6

5

4

3

0.6

0.5

0.4

0.3

0.2

2 0.1

1

0

0

500

1000

1500

2000 2500 3000 Number of Operations

3500

4000

4500

0

5000

0

500

1000

1500

2000 2500 3000 Number of Operations

3500

4000

4500

5000

Figure 4. Effects of Key Variables tion interfaces in OPintf (C) and check if they are implemented by s.Operations. If we refer to the size of collection s.Operations as |C|, then the time to carry out a hashtable-based check of the ∈ operation is O[log(|C|)]. Table 2 shows the performance difference between the algorithms used in our screening phase for a fixed scope mining and a traditional exhaustive search algorithm. Note we choose fixed scope mining in the comparison since it yields a performance that corresponds to the upper bound of that of incremental mining, given the same number of Web services involved in the mining. Table 2 shows that when Nop is relatively small and stable as compared to Nws , T in our filtering algorithm is linear to Nws , while T in a traditional exhaustive search is exponential to Nws .

through indirect recognition since they require more computation according to Eqs. (5), (6) and (7). For each domain operation, we generate its input/output parameters such that the number of these parameters uniformly falls in the range of 0 to 5. Each of these parameters is associated with a Domain Ontology Index Node (DOIN), which is identified with a sequence number. For simplicity, we flatten all the DOINs (i.e., no inheritance and composition relationships among ontology nodes) so that only exact matches will be considered. We place these DOINs in a circular buffer so that the last sequence number is next to the first one. To simulate the cohesive nature of DOINs in a domain, we pick them for the domain using a Gaussian distribution around a mean sequence number randomly chosen for the domain according to a uniform distribution. We assume that each parameter has an equal chance of being associated with a DOIN. To simulate the pre- and postconditions, each parameter is symbolically given a range randomly chosen between 0 and 1 using a uniform distribution. We use the overlap of two such ranges (see Eq. (4)) to calculate the contribution of these conditions towards the similarity of two operations. When calculating interestingness, we chose to assign a bigger weight to surprisingness (cs ) than novelty (cn ) due to cs ’s higher sensitivity towards the increase of the total number of operations. Finally, we use an interestingness threshold of 0.5 to determine whether a composition is interesting. This threshold can be changed as needed, to suit what is an acceptable interestingness.

4 Simulation Results We study the effects of variables listed in Table 3 on discovery output variables including the number of completely bound operations, the number of interesting compositions, and the average values of their interestingness. We focus on the study of interestingness of compositions obtained Table 3. Experiment Settings Variable ε0 Number of domains Expected max # domains in comp. Operation interfaces per domain Input parameters per operation Output parameters per operation Pre/Post-condition range (float) Domain ontology index nodes |Ont| cc /cp /cn /cs

Value or Range 0.1 50 5 5 − 100 0−5 0−5 0−1 5000, 10000, 20000, 50000 0.4/0.6/0.4/0.6

7 2007 IEEE International Conference on Web Services (ICWS 2007) 0-7695-2924-0/07 $25.00 © 2007

Legends

P S

Entity providing the Web Service

O

liberate AA

Web Service providing the operation Input to operation meeting precondition

P

O

Arachidonic Acid on Endoplasmic Reticulum P Stomach Cell Service S

probability covered f(qm)

O P Liberated Arachidonic Acid

produce PG

O produce mucus

O

locale = platelet t c e f f e

P

output substance

e id

Aspirin Aspirin Service S

PGI2 P

S

P

s

COX1 block COX1

P

produce TxA2 O

PGG2 P

COX1 Service

locale = platelet S

fit ne be

TBXAS1 Service

COX1 inhibited P

O

COX1Aspirin Compound

Stomach Cell Gastric Juice P

erode stomach cell

cover stomach wall O

S

PLA2 Service

locale = stomach cell

operation

operation postcondition

P

O

probability not covered: 1- f(q m)

S Mucus Service

O deplete mucus

P Mucus (of quantity q m) TxA2 P S

S Gastric Juice Service

P

O

Blood Vessel

vasoconstriction

TxA2 Service

contributes to heart attack and stroke

Figure 5. Discovered Pathways to the damaging vasoconstriction operation of TxA2 and another pathway from COX1 to the beneficial service provided by Mucus. It is interesting to see that Aspirin can not only benefit us by blocking the pathway leading to problems such as heart attack and stroke, but also bring some unwanted side effects of stomach ulcer by blocking the pathway leading to mucus production. Due to page limitation, we leave out details on how an invocation plan can be constructed for verification purposes from the root operation in a pathway lead tree.

Figure 4 (a) shows that the average number of bound operations tends to increase as the number of operations increases. This is because as more operations become available, their parameters have a higher chance of meeting at a DOIN. However, such chance tends to decrease as the number of DOINs increases. Figure 4 (b) shows that the average interestingness decreases significantly at the beginning as the number of operations increases. It eventually approaches a steady number as the number of operation continues to increase. The number of DOINs has a dampening effect on the initial rate of decrease in average interestingness. In fact, as more DOINs become available, operations are less likely to bind with one another (see Figure 4 (a)). However, compositions involving those that do become bound tend to be more interesting since they involve domains that are less correlated. Figure 4 (c) illustrates the relationship between the number of interesting compositions and the number of DOINs. Because of the same dampening effect by the number of DOINs as illustrated in (b), we see that as the number of DOINs increases, not only does the peak move to the right, but the reduction in the number of interesting compositions also decreases. While the number of interesting compositions declines with some randomness as the number of operations increases, such randomness disappears if we look at how the percentage of interesting compositions over number of bound operations changes (Figure 4 (d)). The smoothing effect observed in (d) is due to the fact that the expected increase in the number of bound operation follows similar random changes as manifested in (c). In a separate setting, we have conducted preliminary experiments to assess the effectiveness of our screening algorithms. We applied our algorithms to a list of simplified Web services put together based on reverse engineering online resources such as [1] on Aspirin, COX, prostaglandin (PG), etc. and were able to identify potential pathway leads shown in Figure 5. Figure 5 shows a pathway from COX1

5 Conclusion In this paper, we proposed to use Web service mining to discover interesting compositions of existing Web services. We presented a novel Web service mining framework and algorithms that can be used to automatically screen for Web service compositions. We also presented the concept of interestingness of these compositions and proposed objective measures to evaluate it. Ongoing work includes verification and evaluation of pathway leads. We will also extend the framework to take advantage of a post-mining usefulness measure to help re-adjust the mining process and steer it along a more fruitful course.

References [1] Aspirin. http://www3.interscience.wiley.com:8100/legacy/ college/boyer/0471661791/cutting edge/aspirin/ aspirin.htm. [2] P. Ball. Designing the Molecular World - Chemistry at the Frontier. Princeton University Press, Princeton, New Jersey, 1994. [3] B. Medjahed, A. Bouguettaya, and A. K. Elmagarmid. Composing web services on the semantic web. The VLDB Journal, September 2003. [4] G. Zheng and A. Bouguettaya. Web service modeling for biological processes. In 3rd International workshop on Biological data Management (BIDM), Copenhagen, Denmark, August 2005.

8 2007 IEEE International Conference on Web Services (ICWS 2007) 0-7695-2924-0/07 $25.00 © 2007

Web Usage Mining: A Review

Web Mining -

D2PM: a framework for mining generic patterns

Web Social Mining

Service-Centric Framework for a Digital Government Application..pdf ...

(>

A Comprehensive Framework for Dynamic Web ...

Towards a Framework for Social Web Platforms: The ...

A Semantic QoS-Aware Discovery Framework for Web ...

Towards a Framework for Social Web Platforms: The ...

Fresco: A Web Services based Framework for ...

Web Mining and Social Networking.pdf

Web Mining Tutorial 21.pdf

web mining techniques pdf