PhD Dissertation

International Doctorate School in Information and Communication Technologies

DISI - University of Trento Q UERY A NSWERING OVER C ONTEXTUALIZED RDF/OWL K NOWLEDGE WITH E XPRESSIVE B RIDGE RULES : D ECIDABLE CLASSES Mathew Joseph

Advisor: Dr. Luciano Serafini Centre for Information Technology Fondazione Bruno Kessler-IRST April 2015

Defense Comittee Prof. Alex Borgida Faculty of Computer Science, Rutgers University, NJ, USA Prof. Paolo Bouquet Department of Information Science & Engineering (DISI), University of Trento, TN, Italy Dr. Jerome Euzenat INRIA, Grenoble, Rhˆone Alpes, France Prof. Enrico Franconi Faculty of Computer Science, Free University of Bozen-Bolzano, Italy

Abstract In this thesis, we study the problem of reasoning and query answering over contextualized knowledge in quad format augmented with expressive forallexistential bridge rules. Such bridge rules contain conjunctions, existentially quantified variables in the head, and are strictly more expressive than the bridge rules considered so far in similar setting. A set of quads together with forallexistential bridge rules is called a quad-system. We show that query answering over quad-systems in their unrestricted form is undecidable, in general. We propose various subclasses of quad-systems, for which query answering is decidable. Context-acyclic quad-systems do not allow the context dependency graph of the bridge rules to have cycles passing through triple-generating (valuegenerating) contexts, and hence guarantees the chase (deductive closure) to be finite. Csafe, msafe and safe classes of quad-systems restricts the structure of descendance graph of Skolem blank nodes generated during chase process to be directed acyclic graphs (DAGs) of bounded depth, and hence has finite chases. RR and restricted RR quad-systems do not allow for the creation of Skolem blank nodes, and hence restrict the chase to be of polynomial size. Besides the undecidability result of unrestricted quad-systems, tight complexity bounds has been established for each of the classes we have introduced. We then compare the problems, (resp. classes,) we address (resp. derive) in this thesis, for quad-systems with analogous problems (resp. classes) in the realm of forall-existential rules. We show that the query answering problem over quad-systems is polynomially equivalent to the query answering problem over ternary forall-existential rules, and the technique of safety, we propose, is strictly more expressive than existing well known techniques such joint acyclicity and model maithful acyclicity, used for decidability guarantees, in the realm of forall-existential rules.

Keywords [Contextualized RDF/OWL, Contextualized Knowledge Bases, Quads, Query Answering, Multi-Context Systems, Forall-Existential Rules, Datalog+-, Description Logics, Semantic Web, Knowledge Representation]

6

Acknowledgements Firstly, I thank the almighty for extending all these gifts in this life. I express my gratitude to the members of the thesis defence committee, for the careful reading of the manuscript, for all the critics and comments that led me to improve the quality of this thesis. Also important is all the mentoring and personal advises received from Prof. Gabriel Kuper over these years. Would like to remember all the nostaligic memories spent with all the former and current members of the DKM, Shell group in FBK, namely Loris Bozzato, Francesco Corcoglionitti, Chiara Ghidini, Chiara di Francesco Marino, Marco Rospocher, Martin Homola, Nahid Mahbub, Andrei Tamilin, Volha Bryl, Gaetano Calabrese, Tahir Khan, Zolzaya Dashdorj, Giulio Petrucci, and Roberto Tiella (SE group). Also memories of the time spent at university of Bremen, where I did my internship under the guidance of Prof. Till Mossakowski, and time spent in close vicinity with Oliver Kutz, Christoph Lange in Spring 2012 was invaluable. Also the night bashes with my most lovable friends Matteo Aluigi, Gideon Njarko, Guido Sbrogio, Paolo Calanca, Aurora Sartori, Elisa Abetini, and Orlazzo Orlazzi is unforgettable. Also, I gratefully acknowledge all of my friends from indian community in Trento, Anil Kumar, Pradeep Warrier, Ajay tripathy, Manish Jain, Nainesh, Rupali Patel, Rohan, Deepa Fernandez, Soudip, Niyati Roy Chowdhury, Tinku, Sajna Basheer, Swaytha Sasidharan, Lejo Joseph, Rahul with whom we organized all the indian festivals, cooked and shared so many recipes. Also remember my friends Anna and Adam from Wroclaw, Christian and Lisa from Innsbruck who often visited me and made my days in Trento a “gem of my life”. Most and most importantly, I would like to thank my scientific guru, Luciano Serafini, for all the encouragement and scientific guidelines, for having been always open for discussions, personal advises, and for being such a super cool advisor, over these years. Foremostly, I am deeply indepted

to my parents, especially my mother who passed away recently, whose overwhelming love, warmth, and advises have given me the strength to overcome the thicks and thins of this life. I also acknowledge my sister and family for the immense moral support in my downs.

8

ii

Contents

1

Introduction

1

1.1

The Context . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

The Problem and Solution Overview . . . . . . . . . . . . . . .

1

1.2.1

Thesis Applications and Similar Problem Formulations .

8

1.2.2

Publications . . . . . . . . . . . . . . . . . . . . . . . . 14

1.3 2

Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . 15

Semantic Web Languages and Query Answering 2.1

17

Semantic Web Languages . . . . . . . . . . . . . . . . . . . . . 17 2.1.1

RDF Preliminaries . . . . . . . . . . . . . . . . . . . . 18

2.1.2

OWL Preliminaries . . . . . . . . . . . . . . . . . . . . 21

2.1.3

OWL 2 RL Profile . . . . . . . . . . . . . . . . . . . . 24

2.1.4

OWL 2 EL Profile . . . . . . . . . . . . . . . . . . . . 25

2.1.5

OWL-Horst Extension to RDF . . . . . . . . . . . . . . 26

2.1.6

Translations of OWL Statements to First Order Logic Statements . . . . . . . . . . . . . . . . . . . . . . . . 28

2.1.7 2.2

2.3

Forall-Existential (∀∃) Rules . . . . . . . . . . . . . . . 30

Query Answering over Ontologies . . . . . . . . . . . . . . . . 31 2.2.1

Chase of an Ontology . . . . . . . . . . . . . . . . . . . 33

2.2.2

Complexity Measures of Query Answering . . . . . . . 38

Computational Complexity Fundamentals . . . . . . . . . . . . 40 iii

3

4

Contextual Representation and Reasoning for Semantic Web: A Review on Existing Frameworks 45 3.1

Distributed Description Logics . . . . . . . . . . . . . . . . . . 46

3.2

E-connections . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.3

Contextualized Knowledge Repository . . . . . . . . . . . . . . 54

3.4

Thesis Advancements . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.1

Conjunctive Bridge Rules . . . . . . . . . . . . . . . . 58

3.4.2

Heterogeneous Bridge Rules . . . . . . . . . . . . . . . 59

3.4.3

Value Inventing Bridge Rules . . . . . . . . . . . . . . 60

3.4.4

Contextual Conjunctive Queries . . . . . . . . . . . . . 61

Query Answering over Quad-Systems and its Undecidability 4.1

Quad-Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2

Query Answering on Quad-Systems . . . . . . . . . . . . . . . 67 4.2.1

5

6

Undecidability of Query Answering on Quad-Systems . 69

Context Acyclic Quad-Systems: Decidability via Acyclicity

73

5.1

Context Acyclic Quad-Systems: A Decidable Class . . . . . . . 77

5.2

Context Acyclic Quad-Systems: Computational Properties . . . 79

Csafe, Msafe, and Safe Quad-Systems: Restricting the Descendency Structure of Skolem Blank-nodes

7

63

91

6.1

Csafe, Msafe, and Safe Quad-Systems: Decidable Classes . . . 96

6.2

Csafe, Msafe, and Safe Quad-Systems: Computational Properties 103

6.3

Procedure for Detecting Safe/Msafe/Csafe Quad-Systems . . . . 113

Range Restricted Quad-Systems

121

7.1

Restricting to Range Restricted BRs . . . . . . . . . . . . . . . 121

7.2

Restricted RR Quad-Systems . . . . . . . . . . . . . . . . . . . 125 iv

8

Quad-Systems vs Forall-Existential rules 127 8.1 Weak Acyclicity . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.2 8.3

9

Joint Acyclicity . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Model Faithful Acyclicity (MFA) . . . . . . . . . . . . . . . . . 138

Related work

141

9.1 9.2 9.3 9.4

Contexts and Distributed Logics . . . . . . . . . . . . . . Temporal/Annotated RDF . . . . . . . . . . . . . . . . . Description Logic Rules . . . . . . . . . . . . . . . . . . ∀∃ rules, Tuple Generating Dependencies, Datalog+- rules

. . . .

. . . .

. . . .

141 143 144 145

9.5 9.6

Data integration . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Distributed/Federated SPARQL Querying . . . . . . . . . . . . 148

10 Summary and Conclusion

149

Bibliography

153

A Appendix 167 A.1 Appendix of Chapter 2 . . . . . . . . . . . . . . . . . . . . . . 167 A.1.1 RDF and RDFS Inference Rules . . A.1.2 Ontology with only Infinite Models A.2 Appendix of Chapter 4 . . . . . . . . . . . A.3 Appendix of Chapter 6 . . . . . . . . . . .

v

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

167 167 169 175

List of Tables 1.1

Domain expansion inference rules of CKRRDF [75]. . . . . . . . 14

2.1 2.2 2.3

Semantics of OWL constructs . . . . . . . . . . . . . . . . . . 23 First order translation of DL concepts . . . . . . . . . . . . . . 29 First order translation of DL statements for DLs with simple roles 29

8.1

Edges induced in the dependency graph due to OWL-Horst inferencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

10.1 Complexity info for various quad-system fragments . . . . . . . 150 A.1 RDF rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 A.2 RDFS rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

vii

List of Figures 1.1

Three different contexts resulting from three different viewpoints on the same object . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Architecture of a Data Integration System . . . . . . . . . . . .

9

1.3

Architecture of a P2P Data Exchange System. . . . . . . . . . . 11

1.4

CKR architecture . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1

Visual rendering of a sample RDF graph . . . . . . . . . . . . . 19

2.2

Finite and infinite model of an EL ontology . . . . . . . . . . . 26

2.3

RDF graph translation of OWL ontology . . . . . . . . . . . . . 32

4.1

A CCQ over quad-system . . . . . . . . . . . . . . . . . . . . . 68

4.2

A sample CCQ: Intersecting objects in different contexts . . . . 68

5.1

Bridge rule: A mechanism for specifying propagation of knowledge between contexts. . . . . . . . . . . . . . . . . . . . . . . 77

5.2

Context dependency graph . . . . . . . . . . . . . . . . . . . . 78

5.3

Saturation of contexts . . . . . . . . . . . . . . . . . . . . . . . 80

6.1

Descendance graph of :b4 in Example 7. Note: n.d. labels are not shown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.2

Descendance graph of Fig. 6.1 unraveled into a tree. Note: n.d. labels are not shown . . . . . . . . . . . . . . . . . . . . . . . . 105

8.1

Dependency graph of the quad-system in Example 7 of Chapter 6.136 ix

8.2

Context dependency graph of the quad-system in Example 7 of Chapter 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

10.1 Landscape of classes for quad-systems and ternary ∀∃ rules . . . 151 A.1 An infinite model . . . . . . . . . . . . . . . . . . . . . . . . . 168

x

Chapter 1 Introduction 1.1

The Context

In this thesis, we describe the challenges faced, techniques applied, and results obtained, in an attempt to extend the frontiers of query answering, representation, and reasoning on contextually sensitive knowledge with particular focus on their applications in the realm of Semantic Web (SW) – the extension of the world wide web that adds the power of logical reasoning leveraging knowledge representation (KR) [18] languages with semantics such as resource description framework (RDF) and web ontology language (OWL). With the proliferation of semantic knowledge on the web contributed from disparate users, pushing the envelopes of contextual reasoning and query answering is the key to the goal of successful realization of the SW. Keeping this goal in mind, during this endeavor we have applied techniques, tools and already developed theories from disciplines such as artificial intelligence, knowledge representation, and databases.

1.2

The Problem and Solution Overview

Businesses and organizations leveraging SW, its languages such as RDF and OWL, its interlinked constellation of ontologies, and exploiting semantic tech1

nologies to provide a richer set of services to their consumers are increasing, more than ever before. One of the main reasons for this widespread acceptance of SW is its ‘open’ model. The model is called open as it seamlessly allows anyone, anywhere in the world to freely publish knowledge artifacts, and allow web portals/repositories to unrestrictedly open access points to their semantic data. It is presumed that any knowledge contributor publish his/her perception about a particular domain as an ontology in the SW, which is very much relative to the contributor. Moreover, SW imposes no arbitration mechanism or stipulated qualifying criteria for the user provided content. Hence, the knowledge in the SW is referred to as ‘context-dependant’, as the truth value of any piece of knowledge is often associated with an implicit context in which the piece of knowledge is assumed to hold. Example 1. Consider the triple (Bologna, rdf:type, Sausage) from the Ontology available at link: http://athena.ics.forth.gr:9090/RDF/ VRP/Examples/tap.rdf. Note that the truth value of this statement is relative, and depends on the view point or socio-cultural background of its interpreter. Since for most people ‘Bologna’ is an Italian city, and for people who hail from Italy or not aware of culinary jargons, the statement is absurd. Hence, unless one makes the context of the statement explicit, and is written as Cookery : (Bologna,rdf:type, Sausage) or UScuisine : (Bologna, rdf:type, Sausage). As a result, a large number of initiatives have already been taken for extending SW and its languages for the support for explication of contexts. One of the major outcomes of such initiatives is a knowledge format called quads that extend the standard RDF triple with a fourth component which indicates the context in which the triple holds. A quad is a tuple c : (s, p, o), where (s, p, o) is a standard RDF triple and c is the identifier of the context in which the triple holds. As a result, more and more triple-stores are becoming quad-stores. Some 2

cfv Top view (ctv )

csv

e

Sid

csv w( e i v

) ctv

Front view (cfv ) Figure 1.1: Three different contexts resulting from three different viewpoints on the same object

of the popular quad-stores are 4store1 , Openlink Virtuoso 2 , and some of the currently popular triple-stores like Sesame3 , Allegrograph4 internally keep track of the contexts of triples. Some of the recent initiatives in this direction have also extended existing formats like N-Triples to N-Quads [28], which the RDF 1.1 has introduced as a W3C recommendation. The latest Billion triple challenge datasets have all been released in the N-Quads format. The following example exemplifies a situation in which quads can be handy to model multiple aspects of the same world when viewed from different angles. Example 2. Consider three knowledge creators viewing the same 3D object A from three different sides of the object as shown in Fig: 1.1. Since these persons are viewing the object from orthogonally different angles, their cognitive perception of the object varies depending on the side from which the object is viewed. Suppose that the knowledge corresponding to these different percep1

http://4store.org http://virtuoso.openlinksw.com/rdf-quad-store/ 3 http://www.openrdf.org/ 4 http://www.franz.com/agraph/allegrograph/ 2

3

tions are encoded in three different contexts – cfv , csv , and ctv . The fact that the person that takes the front/side views perceive the object A as a triangle, and the person that takes the top view perceives the object A as a rectangle is depicted in the following set of quads: cfv : (A, rdf:type, Triangle), csv : (A, rdf:type, Triangle), ctv : (A, rdf:type, Rectangle) Suppose if we also need to enforce the fact that an object cannot simultaneously be a triangle and a rectangle, then this can be enforced by the following quad Suppose if we also need to enforce the fact that an object cannot simultaneously be a triangle and a rectangle in any of the views, then this can be enforced by the following quads. cfv : (Triangle, owl:disjointWith, Rectangle), csv : (Triangle, owl:disjointWith, Rectangle), ctv : (Triangle, owl:disjointWith, Rectangle) It should be noticed that the above situation cannot be modeled compactly, without the lose of information, using plain RDF triples. Another benefit of quads over triples are that they allow knowledge creators to specify meta-knowledge (which as specified in Schueler et al. [85], Mylopoulus et al. [77], and Lenat et al. [36] are various attributes of contexts) that further qualify knowledge [30], and also allow users to query for this meta knowledge [85]. These attributes, which explicate the various assumptions under which knowledge holds, are also called context dimensions [36]. Examples of context dimensions are provenance, creator, intended user, creation time, validity time, geo-location, and topic. Having defined knowledge that is contextualized, as in c1 : (Renzi, primeMinsiterOf, Italy) , one can now declare in a meta-context mc, statements such as mc : (c1 , creator, John), mc : (c1 , expiryTime, “jun-2016”) that qualifies the knowledge in context c1 , in this case 4

its creator and expiry time. Another benefit of quads is the possibility of interesting ways for querying a contextualized knowledge base. For instance, if context c1 contains knowledge about football world cup 2014 and context c2 about football euro cup 2012, then the query “who beat Italy in both world cup 2014 and euro cup 2012” can be formalized as the conjunctive query: c1 : (x, beat, Italy) ∧ c2 : (x, beat, Italy), where x is a variable. From a reasoning point of view, since the contextual demarcation in a set of quads allows for context-wise grouping and division of knowledge into decoupled components that can simultaneously be fed to parallel reasoners, the approach thus increases both scalability and efficiency enabling applications to do practical reasoning on the mammoth amount of knowledge in SW [17]. Besides the above flexibility, bridge rules [14] can be provided for inter-operating the knowledge in different contexts. Such rules are primarily of the form: c : φ → c0 : φ0

(1.1)

where φ, φ0 are both atomic concept (role) symbols, c, c0 are contexts. The semantics of such a rule is that if, for any ~a, φ(~a) holds in context c, then φ0 (~a) should hold in context c0 , where ~a is a unary/binary vector depending on whether φ, φ0 are concept/role symbols. Example 2 (Contd.). Going back to our 3D object example, suppose cactual be the context that describes the object from an actual side-independent perspective. Then the following fact “if an object is a triangle from both the front view and the side view, then the object is actually a pyramid” can intuitively be specified using the following bridge rule: cfv : (x, rdf:type, Triangle) ∧ csv : (x, rdf:type, Triangle) → cactual : (x, rdf:type, Pyramid) 5

Although bridge rules of the form (1.1) serve the purpose of specifying knowledge interoperability from a source context c Research Objectives of the Thesis

to a target context c0 , in many practical situations (a) there is the need of interoperating multiple source contexts with multiple target contexts, for which the bridge rules of the form (1.1) are inadequate. Besides, (b) one would also want the ability of creating new values in target contexts for the bridge rules. Hence, more expressive bridge rules are required to address these aforementioned issues. The main research focus of the thesis is the problem of (contextual) reasoning and query answering over contextualized RDF/OWL knowledge (generally, quads) in the presence of forall-existential bridge rules. We bring to the notice of the reader that although the contextual reasoning problem, to some extent, has been touched by works such as Distributed Description Logics (DDL) [14], Klarman et al. [59], McCarthy et al. [74], and Distributed First Order Logic (DFOL) [42] in the Description Logic (DL) [5] and first order logic (FOL) settings, the bridge rules we consider in this thesis are more expressive, with conjunctions and existential quantifiers in them, and satisfy requirements (a) and (b) mentioned above.

From a computer science perspective, as with any problem, one of the first questions that we posed on a first rendezvous with our problem is that, “is the problem solvable at all?”. Meaning, can we devise algorithms on a general purpose computer or a turing machine for solving the Overview of the Solution Approach

problem. As we later show, query answering and reasoning over quad-systems (which are a set of quads plus forall-existential bridge rules) is undecidable. This means that there cannot exist an algorithm with soundness, completeness, and termination properties for the problem. Hence, one of the immediate questions that arises is whether one can find large meaningful subclasses of the quadsystems for which the reasoning and query answering problem is decidable. The bulk of this thesis describes and exemplifies such classes. 6

One of the first steps to be taken is to provide a semantics for interpreting a quad-system. The semantics should be broad enough to interpret arbitrary queries from a commonly accepted query language, such as conjunctive queries. As we focus on the decision version of the problem, once a semantics is fixed, then the query answering problem is to decide for a given query, a vector of fixed size, and a quad-system, whether the quad-system entails (w.r.t to the fixed semantics) the expression that results from substituting the vector on the variables of the query. We now briefly glimpse through our solution approach. We first formulate a basic semantics for interpreting and reasoning with knowledge in a quad-system. For this, we follow existing approaches such as Distributed Description Logics [14], CKR [86, 16], E-connections [64], and two-dimensional logic of contexts [59], to use a set of interpretation structures as a model for contextualized knowledge. In this way, knowledge in each context is separately interpreted to a different interpretation structure. Also based on the semantics provided, we derive procedures for conjunctive query answering. For this, we formulate the notion of a distributed chase, which is an extension of the standard chase [56, 1] that is widely used in the KR and DB settings, for similar purposes. The main contributions of this thesis work are: 1. We extend the standard RDF/OWL semantics to a context-based semantics that can be used for reasoning over contextualized RDF/OWL knowledge. Studying conjunctive query answering over quad-systems, we show that the entailment problem of conjunctive queries is undecidable for the most general class of quad-systems, called unrestricted quad-systems. 2. We define a class of quad-systems called context acyclic quad-systems, for which query answering is decidable and can be done by a forward chaining procedure. The quad-systems in this class have the property that the dependency graph of the set of bridge rules do not have cycles going through triple generating contexts. We give both data and combined complexity of 7

conjunctive query entailment for the same. 3. We further extend the class of context acyclic to larger decidable classes called csafe, msafe, and safe quad-systems, for which we give both data and combined complexities of conjunctive query entailment. These classes are based on the constrained DAG structure of Skolem blank nodes generated during the chase construction. We also provide decision procedures to decide whether an input quad-system is safe (csafe, msafe) or not. Also in this case, a forward chaining procedure based on the restricted version of standard chase is provided for checking entailment of queries. 4. Subsequently, we derive less expressive classes, RR and restricted RR quad-systems, for which no Skolem blank nodes are generated during the chase construction. This class is characterized by the property that any quad-system in this class does not contain existentially quantified variables in their bridge rules. 5. We also show that the class of unrestricted quad-systems is equivalent to the class of ternary ∀∃ rule sets, which are the class of ∀∃ rule sets whose predicates have arity less than or equal to three. We compare the derived classes of quad-systems with well known subclasses of ∀∃ rule sets, such as weakly acylic, jointly acyclic and model faithful acyclic rule sets. An important result is that the technique of safety that we propose subsumes these other techniques, in expressivity, and hence, can be used in the ∀∃ settings to derive expressive recognizable classes. 1.2.1

Thesis Applications and Similar Problem Formulations

At the time when the computer science discipline is challenged with the problem of managing the massive and continuous data generation rates, techniques for accessing, integrating, exchanging, and inferencing over this data is the key 8

Mediated Global Schema

Query

... Source 1

Source 2

Source n

Figure 1.2: Architecture of a Data Integration System

for the success of present day information systems. Hence, also the corresponding RDF variants of these problems are key bottlenecks that need to addressed by the SW community. We identify the following areas namely: (i) RDF data integration, (ii) RDF data exchange, (iii) distributed and contextual RDF frameworks, that have similar problem formulations, and we exemplify below how the results we derive in this thesis are relevant in these domains. RDF Data Integration

Data integration [65, 37, 25] is the problem of accessing/querying a set of heterogeneous distributed local data sources using an intermediate global schema that acts as the mediator of access. The schema of local sources Σl , known as the local schema, and the global schema Σg are mapped with integration rules. The pictorial depiction in Fig. 1.2 shows the architecture of a typical data inte9

gration scenario. In the traditional version of the problem, both Σl and Σg are sets of relation symbols. A typical solution approach is to translate queries over Σg to queries over Σl that can be evaluated on the local sources. Yet another solution approach is to materialize the global schema so that queries can directly be executed on it. Off late the variant of the problem pertaining to the Semantic Web case has drawn significant attention [26]. In this case, Σl is an (indexed) set of RDF/OWL graphs, whose members represent the local sources and Σg is an indexed set of RDF/OWL graphs, whose members represent the global data sources. Furthermore, the typical architecture of the data integration can be of one of the following three types: (i) Global as view (GAV) architecture, (ii) Local as view (LAV) architecture, and GLAV (Global and Local as view) architecture. In the GAV type, each global RDF graph is mapped on to a (conjunctive) query over the set of local RDF graphs. Whereas in the LAV variant, each local RDF graph is mapped on to a (conjunctive) query over the set of global RDF graphs. In a GLAV setting, which is a generalization of both GAV and LAV, (conjunctive) queries over the set of local graphs are mapped on to (conjunctive) queries over the set of global RDF graphs. As a set of quads Q can be seen as an indexed set of RDF graphs indexed by the context identifiers in Q, the correspondence between quad-systems and GLAV RDF data integration lies in the fact that both the integration rules and bridge rules are implications from a set of quad-patterns (called the body) to a another set of quad-patterns (called the head). Hence, we deem the outcomes of this thesis to be straightforwardly propagatable to the problem of RDF data integration. Peer-to-Peer RDF Data Exchange

The classical peer-to-peer (P2P) data exchange setting [40, 2] is a system of relational databases (called peers) interconnected using schema mappings that specify various dependency relations between the peer schemas. Typical schema 10

Query

Query

... Query

Peers

Peers Mappings

Peers

Query

Peers

Query

Peers

Figure 1.3: Architecture of a P2P Data Exchange System.

mappings considered are the ones in a which a conjunctive query over a set of peer schemas are mapped to a conjunctive query over another set of peer schemas. A typical architecture is depicted in Fig. 1.3. Its variant in the realm of SW [10] called the P2P RDF Data Exchange setting is a system of RDF graphs interconnected with schema mappings that maps a conjunctive query over a set of peer graphs to a conjunctive query over another set of peer graphs. A user query is typically a conjunctive query on any of the peers. The answer to the query is computed taking into account not only the knowledge in the peer, but also the mappings to the other peers. Since a set of quads can be seen as an indexed set of peer RDF graphs, and the bridge rules map the conjunctive queries over a set of peer graphs to conjunctive queries over another set of peer 11

Meta Knowledge Base D1

D2

Dn

...

c2

cm



≺ c1

...



cm-1

c3

K(c1 )

K(c3 )

K(cm-1 )

... K(cm )

K(c2 )

Figure 1.4: CKR architecture

graphs, the results we have in this thesis for quad-systems are directly portable to the realm of P2P RDF data exchange. RDF based Contextualized Knowledge Repositories (CKRRDF )

CKRRDF [75] is a general purpose RDF based framework for modelling, reasoning, indexing, searching, and querying over contextualized knowledge in the SW. It is the RDF version of the more general CKRSROIQ framework [86], that has computationally attractive properties such as decidability, materializability, and easy implementability using minor extensions on existing triple-stores5 . Its 5

A prototype is available at https://dkm.fbk.eu/technologies/ckr

12

main facet is the organization of contexts into hierarchies that allows to make additional inferences in individual contexts. This is quite intuitive as contexts represent real world domains, and real world domains can naturally be ordered into hierarchies. Hence, the language of CKR contains a cover relation ≺ on contexts, that is a strict partial order. In addition, contexts are also associated with context dimensions {Di = hDi , ≺i i}i=1...n , each of which is a strict poset, with a strict partial order ≺i over the set of values Di . Example of context dimensions are time, topic, geolocations etc. The architecture of the CKRRDF is given in Fig. 1.4. Any knowledge statement should be defined w.r.t to a context, and hence belong to a context. Hence, in CKRRDF , the most atomic piece of knowledge statement is a quad. The main component of the CKR is the meta knowledge base cmk , which itself is a context. It contains various definitions that relate other contexts to their dimension values, statements about various dimensions Di , and the properties of their cover relation ≺i . The strict partial order property of ≺i can be imposed by the following bridge rules: cmk : (x1 , ≺i , z1 ) ∧ cmk : (z1 , ≺i , x2 ) → cmk : (x1 , ≺i , x2 ) cmk : (x1 , ≺i , x1 ) → where the latter BR is a a negative constraint that states the negation of its body. The above BRs are instantiated for ≺i , for i = 1 . . . n. Also the cover relation of contexts, ≺, is defined on top of the cover relation ≺i of the dimensions. n ^ [cmk : (c, Di , vi ) ∧ cmk : (c0 , Di , vi0 ) ∧ cmk : (vi , ≺i , vi0 )] → cmk : (c, ≺, c0 ) i=1

cmk : (x1 , ≺, z1 ) ∧ cmk : (z1 , ≺, x2 ) → cmk : (x1 , ≺, x2 ) cmk : (x1 , ≺, x1 ) → Note that the second and third of the above set of rules impose the strict partial order on the context cover relation ≺. Each triple (s, p, o) in the object knowledge K(c) of context c is defined as a quad c : (s, p, o). Table 1.1 encodes the 13

g g g g g g

: (a, rdf:type, Cd ) ∧ cmk : (g, ≺, h) : (a, Rd , b) ∧ cmk : (g, ≺, h) : ( : m, rdf:type , Cd ) ∧ cmk : (g, ≺, h) : ( : m, Rd , b) ∧ cmk : (g, ≺, h) : (a, Rd , : m) ∧ cmk : (g, ≺, h) : ( : m, Rd , : n) ∧ cmk : (g, ≺, h)

→ → → → → →

h : (a, rdf:type, Cd ) h : (a, Rd , b) ∃y h : (y, rdf:type, Cd ) ∃y h : (y, Rd , b) ∃y h : (a, Rd , y) ∃y1 , y2 h : (y1 , Rd , y2 )

Table 1.1: Domain expansion inference rules of CKRRDF [75].

set of domain expansion inference rules of the CKRRDF from [75] as a set of bridge rules. We refer the reader to [75] for details on the semantics and other set of inference rules. Since any CKRRDF inference rule in [75] can be encoded as bridge rule, it is easy to see that a CKRRDF can be simulated using a quad-system. 1.2.2

Publications

Besides the aforementioned contributions, the endeavors and efforts of my tenure as a PhD student has led to the following publications, and reader should note that some of the contents of the following chapters has been borrowed from the these: 1. A. Tamilin, B. Magnini, L. Serafini, C. Girardi, M. Joseph, R. Zanoli. Context-driven Semantic Enrichment of Italian News Archive. In proceedings of Extended Semantic Web Conference (ESWC-2010). In use track. Lecture Notes in Computer Science, Vol. 6088, pp. 364-378, 2010 2. M. Joseph. A Contextualized Knowledge Framework for Semantic Web. In proceedings of Extended Semantic Web Conference (ESWC-2010). PhD symposium track. Lecture Notes in Computer Science, Volume 6089, pp 467-471. 2010 3. M. Joseph, L. Serafini. Simple Reasoning for Contextualized RDF Knowl14

edge. In Proceedings of Workshop on Modular Ontologies (WOMO-2011), Volume 230, IOS Press, Frontiers in Artificial Intelligence and Applications. PP. 79-93. 2011 4. M. Joseph, G. Kuper, L. Serafini. Query Answering over Contextualized RDF Knowledge with Forall-Existential Bridge Rules: Attaining Decidability using Acyclicity. In Proceedings of Italian Conference in Computational Logic (CILC-2014). Volume 1195 of CEUR Workshop Proceedings, pages 210-224, CEUR-WS.org, 2014 5. M. Joseph, G. Kuper, L. Serafini. Query Answering over Contextualized RDF/OWL Knowledge with Forall-Existential Bridge Rules: Attaining Decidability using Acyclicity. In Proceedings of International Conference in Web Reasoning and Rule Systems (RR-2014). Springer Lecture Notes in Computer Science Volume 8741 pp. 60-75, 2014 6. M. Joseph, G. Kuper, T. Mossakowski, L. Serafini. Query Answering over Contextualized RDF/OWL Knowledge with Forall-Existential Bridge Rules: Decidable Finite Extension Classes. Semantic Web Journal (IOS Press). To Appear. 2015.

1.3

Structure of the Thesis

The thesis is structured as follows. In Chapter 2, we give a review of the stateof-the-art ontology languages relevant for the SW, glimpsing through languages such as RDF, OWL, and forall-existential rule fragment of first order logic. We also give an account on query answering over these languages, touching notions such as chase and its variants, and brief through the computational complexity fundamentals relevant for this thesis. In Chapter 3, we give a review on the existing frameworks for contextual knowledge modelling relevant to the SW, 15

and then give an account on the shortcomings of these frameworks that motivates this thesis work and its contributions. In Chapter 4, we formally describe the main problem dealt by this thesis – The problem of query answering over contextualized RDF knowledge with forall-existential bridge rules and its undecidability, introducing notions such as quad-graphs, bridge rules, quad-systems, and the problem of query answering on quad-systems and its undecidability. In Chapter 5, we describe a subclass of quad-systems for which the query answering problem is decidable. The class, which we call context acyclic quadsystems, ensures decidability by not allowing cyclic paths that involve blank node generating contexts (TGCs) in the context dependency graph. Further in Chapter 6, we give more expressive classes of quad-systems namely csafe, msafe, and safe classes that strictly subsume the cacyclic quad-systems, based on bounded depth DAG structure of Skolem blank nodes generated in the chase. For tractability reasoning, we subsequently in Chapter 7, derive less expressive RR and restricted RR quad-systems, for which data complexity of query answering is tractable. Both these classes do not allow the generation of Skolem blank nodes in their chases. In Chapter 8, we compare the classes we derived with well known decidable classes in the realm of forall-existential rules. In Chapter 9, we detail the related work relevant for this thesis and summarize the results obtained in Chapter 10.

16

Chapter 2 Semantic Web Languages and Query Answering In this chapter, we give an overview of SW concepts relevant to this thesis, introducing certain well known notations and parlances already existing in the literature. We review some of the ubiquitous languages used for representing knowledge in the context of SW. We also discuss briefly the topic query answering over knowledge defined using these languages. A few well known complexity classes relevant to the discussion in this thesis are concisely glimpsed.

2.1

Semantic Web Languages

SW languages are KR languages with particular emphasis on the representation and reasoning of knowledge and resources on the (world wide) web. Apart from being a formal logical language with semantics, the ideosyncratic feature of such a language is the use of uniform resource identifiers (URI) for the constants in the language. A URI specifies a resource by name in a particular namespace, and can be used to identify a resource without implying its location or how to access it. A URI can denote anything such as a person, place, or, in general, a logical or physical object in the universe or web. Though proposals exist for encoding uniquely identifying information of resources represented by URIs in 17

their syntax, currently available web standards are not adequate for this.

2.1.1

RDF Preliminaries

Let U be the set of uniform resource identifiers (URIs), B the set of blank nodes1 , and L the set of literals. The set C = U ∪ B ∪ L are called the set of (RDF) constants. Any (s, p, o) ∈ C × C × C is called a generalized RDF triple (from now on, just triple). A generalized RDF graph (from now on, just graph) is defined as a set of triples. For any graph g, U(g), B(g), L(g), C(g) denote respectively, the set of URIs, blank nodes, literals, constants in g. Some of the commonly known syntaxes for serializing graphs are RDF/XML, RDF/JSON, N-Triples, Turtle, etc. Example 1. The following is an example of a graph in Turtle syntax. The graph encodes information such as – the URI geonames:germany is related to the URI geonames:argentina by the property :defeats, and also the former and latter are objects of classes dbpedia:Champion and dbpedia:RunnerUp, respectively, both of which in turn are objects of the meta class rdfs:Class. @prefix rdf: . @prefix rdfs: . @prefix dbpedia: . @prefix geonames: . @base: . geonames:germany rdf:type dbpedia:Champion ; :defeats geonames:argentina . geonames:argentina rdf:type dbpedia:RunnerUp. dbpedia:Champion rdf:type rdfs:Class. dbpedia:RunnerUP rdf:type rdfs:Class.

1

a.k.a labelled nulls

18

The visual representation of the graph is given in Fig. 2.1. Note that the graph refers to terms in other ontologies in the SW, namely dbpedia and geonames, and also uses terms from the well known RDF and RDFS vocabularies, which are interpreted in a standard way. Since a graph represents the state of affairs of a domain of interest, it often referred to as an ontology.

Figure 2.1: Visual rendering of a sample RDF graph

Predefined vocabularies with commonly understood semantics exist for interpreting graphs that contain terms from these vocabularies. Most of the terms in these vocabularies correspond to logical operators with their semantics adopted from well known logical languages, such as DL and first-order logic. Some of the well known examples of semantics for interpreting graphs are simple [50], RDF [50], RDFS [50], and OWL semantics [79]. Also fragments of graphs are defined based on the restrictions of these vocabularies that are permitted in the graphs. Examples are the OWL 2 profiles [76] – OWL 2 EL, OWL 2 QL, OWL 2 RL, the OWL-Horst fragment [90], and so on. The most basic semantics for interpreting graphs is the simple semantics that do not take into account any terms from the RDF, RDFS or OWL vocabularies. The simple semantics is defined using a simple interpretation structure that is defined as follows: 19

Definition 2 (Simple Interpretation). A simple interpretation (structure) of a signature hU, B, Li is a tuple Isimple = hIR, IP, IC, IEXT, ICEXT, LV, ISi where : 1. IR is a nonempty set of objects, called the domain of Isimple ; 2. IP ⊆ IR is a set of objects denoting properties; 3. IC ⊆ IR is a distinguished subset of IR denoting classes; 4. IEXT : IP → 2IR×IR is a mapping that assigns to each property object, a set of pairs of domain objects; 5. ICEXT : IC → 2IR is a mapping that assigns to each class, a set of domain objects; 6. LV ⊆ IR is a set of literal values for literals in L; 7. IS : U ∪ L → IR, the interpretation mapping, is a map that assigns an object in IR to each element in U ∪ L. The class of RDF (resp. RDFS) interpretations is a subclass of the class of simple interpretations with additional constraints on the interpretation of the RDF (resp. RDFS) primitives. For instance, one of the constraint that need to be satisfied by an RDF interpretation is the following: x ∈ IP

⇐⇒

hx, IS(rdf:Property)i ∈ IEXT(IP(rdf:type))

We refer the interested reader to Hayes [50] for an exhaustive list of all the constraints associated to RDF and RDFS interpretations. Tables A.1 and A.2 of section A.1.1 in Appendix list the sets of RDF and RDFS inference rules, respectively. 20

Definition 3 (Model of a Graph). A Simple (resp. RDF, resp. RDFS) interpretation Isimple (resp. Irdf , resp. Irdfs ) = hIR, IP, IC, IEXT, ICEXT, LV, ISi, is a model of a graph g, in symbols Isimple |=simple g (resp. Irdf |=rdf g, resp. Irdfs |=rdfs g) if there is a map A : B(g) → IR, s.t. for every triple (s, p, o) ∈ G, we have that hIS + A(s), IS + A(o)i ∈ IEXT(IS + A(p)), where IS + A(x) = IS(x), if x ∈ U(g) ∪ L(g), and A(x) otherwise. Entailment from a graph g to a triple or to another graph, is defined as: Definition 4 (Simple, RDF, RDFS entailment). A graph g simple-entails (resp. RDF-entails, resp. RDFS-entails) a triple (s, p, o), in symbols, g |=simple (s, p, o) (resp. g |=rdf (s, p, o), g |=rdfs (s, p, o)), iff for any simple interpretation Isimple (resp. RDF interpretation Irdf , RDFS interpretation Irdfs ), if Isimple |=simple g (resp. Irdf |=rdf g, resp. Irdfs |=rdfs g), then Isimple |=simple (s, p, o) (resp. Irdf |=rdf (s, p, o), resp. Irdfs |=rdfs (s, p, o)). A graph g simple-entails (resp. RDF-entails, resp. RDF-entails) another graph g 0 , iff g |=simple (s, p, o) (resp. g |=rdf (s, p, o), resp. g |=rdfs (s, p, o)), for every (s, p, o) ∈ g 0 . 2.1.2

OWL Preliminaries

Although RDF, RDFS vocabularies and their semantics enabled the specification of non-trivial ontologies in the SW, quest for a more expressive language to specify more complex ontology axioms led to the development of OWL language. Consequently, the OWL vocabulary [11] and its semantics [79] were proposed. Its vocabulary contain terms that correspond to the logical constructs from the DLs, and its syntax and semantics is largely adopted from the DLs. The profiles OWL Lite and OWL DL, which was part of the initial release of OWL, are based on DLs SHIF(D) and SHOIN (D), respectively. The OWL 2 DL, the successor and extension of OWL DL, is based on the DL SROIQ(D) [53]. We first start by describing the syntax of OWL 2 DL, and subsequently show how some its fragments can be derived using syntactic restrictions. 21

An OWL signature is given by the 4-tuple hΣC , ΣP , ΣI , ΣL i, where ΣC is a set of atomic concepts, ΣP is a set of atomic roles, ΣI , a set of individuals, and ΣL , the set of literals. An OWL Concept C over an OWL signature Sig = hΣC , ΣP , ΣI , ΣL i, is inductively defined as: C := A | C u C | C t C | ¬C | ∃R.C | ∀R.C | ≥ nR.C | ≤ nR.C | ∃R.Self | {a1 , a2 , ..., am } | > | ⊥ where A ∈ ΣC , R is an OWL Role (see below) over signature Sig, a1 , ..., am ∈ ΣI , n is a natural number, > the top concept represents all the objects in the domain, and ⊥ the bottom concept has no individuals. An OWL Role R over the signature Sig is defined as: R := P | R− | R ◦ R ◦ ... ◦ R | U where P ∈ ΣP , U is the universal role and is equivalent to > × >. An OWL Ontology O = hT , R, Ai over an OWL signature Sig, where T is a set of statements of the form C v D, where C, D are OWL Concepts over Sig, R is a set of statements of the form R v S or Disjoint(R, S) , where R, S are OWL Roles over Sig. Note that constructs such as Transitive(R), Symmetric(R), Reflexive(R), Irreflexive(R), Funtional(R) can be expressed as R ◦ R v R, R− v R, > v ∃R.Self, ∃R.Self v ⊥, > v≤ 1.R, respectively. OWL 2 DL [51] has further restrictions over concept, role constructors and assertions. It requires the role hierarchies to be regular, i.e. it does not permit cyclic hierarchies of the form R v R1 , R1 v R2 , . . . , Rn v R, and further constrains roles in cardinality restrictions to be simple2 . A is a set of statements of the form: C(a) | R(a, b)|¬P (a, b) | a = b | a 6= b where C is an OWL Concept over Sig, P ∈ ΣP , R an OWL role over Sig, a is an individual over Sig, b an indivual or a literal over Sig. 2

Simple roles are roles that are not implied by composition of other roles

22

Construct type

Concept

Role

Individual Plain Literal

Assertions

Syntax A ∈ ΣC > ⊥ ¬C ∃R.C ∀R.C ≥ nR.C ≤ nR.C {a} CuD CtD ∃R.Self P ∈ ΣP R− U ¬R R◦S a ∈ ΣI l ∈ ΣL C vD RvS C(a) R(a, b) a=b a 6= b

Semantics AI ⊆ ∆I ∆I ∅ ∆I \ C I {x|(x, y) ∈ RI &y ∈ C I } {x|(x, y) ∈ RI implies y ∈ C I } {x|#{y|(x, y) ∈ RI &y ∈ C I } ≥ n} {x|#{y|(x, y) ∈ RI &y ∈ C I } ≤ n} {aI } C I ∩ DI C I ∪ DI {x|(x, x) ∈ RI } P I ⊆ ∆I × ∆I {(y, x)|(x, y) ∈ RI } ∆I × ∆I ∆I × ∆I \ RI RI ◦ S I aI ∈ ∆I lI = l ∈ ∆I C I ⊆ DI RI ⊆ S I aI ∈ C I (aI , bI ) ∈ RI aI = bI aI 6= bI

Table 2.1: Semantics of OWL constructs

Definition 5 (OWL Interpretation). An OWL Interpretation w.r.t. a signature hΣC , ΣP , ΣI , ΣL i is a structure Iowl = (∆I , .I ) where ∆I , called the domain of Iowl . and .I is the valuation function s.t., for each A ∈ ΣC , AI ⊆ ∆I , for each P ∈ ΣP , P I ⊆ ∆I × ∆I , for each a ∈ ΣI , aI ∈ ∆I and for each l ∈ ΣL , lI = l ∈ ∆I . Definition 6 (Model of an OWL Ontology). An OWL Interpretation Iowl = (∆I , .I ) w.r.t. a signature hΣC , ΣP , ΣI , ΣL i is said to be a model of an OWL ontology O = hT , R, Ai, in symbols Iowl |=owl O iff Iowl |=owl St, for every assertion St ∈ T ∪ R ∪ A as per conditions in Table 2.1. Definition 7 (OWL entailment). An OWL Ontology O, OWL-entails a statement St, in symbols O |=owl St iff for any OWL model Iowl , if Iowl |=owl O, then Iowl |=owl St. An OWL ontology O OWL-entails another OWL ontology O0 , iff O |=owl st, for any st ∈ O0 . Thanks to the popularity of the OWL language, a large number of graphs published in the SW extensively contain OWL vocabularies. But due to the high 23

computational complexity of reasoning with expressive OWL ontologies, web scale reasoning became impractical. The complexity of checking concept satisfaction with OWL 2 DL is 2NEXPTIME-hard, and no practical, sound, and complete algorithm is known yet for conjunctive query answering over OWL DL and OWL 2 DL. Consequently, less expressive profiles of OWL language was derived. 2.1.3

OWL 2 RL Profile

OWL 2 RL profile [76] is a part of the OWL 2 [51] standard that enables the use of a substantial part of the OWL vocabulary in an ontology, yet allows efficient reasoning and query answering. For simplicity, we exclude concrete-domains, data properties, and key assertions. Given the set of concept names ΣC , role names ΣP , and individual names ΣI , OWL 2 RL concepts are defined by the following productions: lc := A | {a} | lc u lc | lc t lc | ∃R.lc | ∃R.> rc := A | ¬lc | rc u rc | ∀R.rc | ∃R.{a} | ≤ nR.lc | ≤ nR.> mc := A | ∃P.{a} | mc u mc where A ∈ ΣC , a ∈ ΣI , R is an OWL 2 RL role (see below), and n = 0, 1. OWL 2 RL concept axioms are of the form: LC v RC where LC := lc | mc and RC := rc | mc. OWL 2 RL roles are given by the production R := P | P − where P ∈ ΣP . OWL 2 RL property axioms are of one of the following forms: 24

R v R | domain(R, rc) | range(R, rc) | disjoint(R, R) | f unctional(R) | inversef unctional(R) | symmetric(R) | asymmetric(R) | transitive(R) | irref lexive(R) OWL 2 RL individual axioms are of the form C(a) or R(a, b) where a, b ∈ ΣI , C is a OWL 2 RL concept of the form rc, R is an OWL 2 RL property. An OWL 2 RL ontology O is a triple hT , R, Ai where T is a set of OWL 2 RL concept axioms, R is a set of OWL 2 RL property axioms, A is a set of OWL 2 RL individual axioms. An OWL 2 RL graph is an OWL 2 RL ontology translated to an RDF graph using the standard translation defined in OWL 2 Mapping to RDF graphs [80]. OWL 2 RL RDF rules [76] is a partial axiomatization of OWL 2 RL. These set of rules provides axiomatizations for OWL constructs like owl:intersectionOf, owl:unionOf, owl:complementOf which are not provided by OWL-Horst. Although deductive closure w.r.t. these rules for any graph g can be computed in PTIME, Kroetzsch [61] showed that the set of rules are incomplete for the OWL 2 RL fragment of OWL for reasoning tasks such as computing subsumptions, which is co-NP Hard. 2.1.4

OWL 2 EL Profile

The EL fragment of OWL 2 is intended for applications that demand a fair amount of terminological expressivity, yet require tractable reasoning services for subsumption checking and instance checking. The description logic EL [4] is the foundation of OWL 2 EL fragment. For simplicity, we exclude concretedomains, data properties, equality assertions and R-boxes. Given the set of concept names ΣC , role names ΣP , and individual names ΣI , EL concepts can be described as follows: C := A | C u C | ∃R.C | {a} | >| ⊥ 25

shields a

a

Guard

Guard

(a) A finite model

. shields

Guard

... shields

(b) An infinite model

Figure 2.2: Finite and infinite model of an EL ontology

where A ∈ ΣC , R ∈ ΣP , a ∈ ΣI . An EL T-box consists of statements of the form C v D where C, D are EL concepts. Example 8. Consider the EL ontology with the following set of statements Guard v ∃shields.Guard Guard(a) One can see that the models depicted in Fig. 2.2a and Fig. 2.2b satisfy the above set of statements. The model in Fig. 2.2b has infinitely many objects in its domain, whereas the model in Fig. 2.2a has only a finite number of objects. A model that has a finite number of objects in its domain is called a finite model. An ontology is said to be finitely satisfiable if there exists a finite model that satisfies it. Calvanese in [24] showed that there are DL ontologies that do not have any finite models. We refer the reader to Fig. A.1 in the appendix for a concrete example. It can be noted that the ontology, discussed in example 8, has an infinite chase (see further), and entails the conjunctive query V ∃y1 , ...yn i=1,...,n−1 shields(yi , yi+1 ), for any n ∈ N. 2.1.5

OWL-Horst Extension to RDF

OWL-Horst [90] is an extension of RDFS, a fragment with graph based semantics, with sound and complete axiomatization, yet tractable for entailment problems such as subsumption checking and instance checking. OWL-Horst semantics is an extension to RDFS semantics, that defines semantic conditions for 26

a subset of terms in the OWL vocabulary. These include class assertions such as OWL restrictions (universal, existential, value restrictions), disjointness of classes and properties, property assertions like symmetricity, transitivity, functionality, inverse relations of properties and assertions involving owl:sameAs and owl:differentFrom. Like RDF(S), any ontology serialized as a graph can be interpreted using the OWL-Horst semantics. An OWL-Horst interpretation structure is an RDFS interpretation structure with additional semantic constraints [90]. The class of OWL-Horst interpretation structures are hence a subset of the class of RDFS interpretation structures. OWL-Horst has a set of inference rules that are sound and complete w.r.t. its semantics, s.t. for any OWL-Horst graph g, its deductive closure, owl-horst-closure(g), can be computed by repeatedly running the set of OWL-Horst inference rules on g until a fix-point is reached, which is guaranteed to exist and is finite. OWL-Horst reasoning for a graph can be characterized with the help of an OWL-Horst canonical model, which is an OWL-Horst model that represents all the OWL-Horst models of a graph, and is defined as: Definition 9 (OWL-Horst Canonical Model). For any OWL-Horst graph g, its canonical model canowl-horst (g) = hIRcanowl-horst (g) , IPcanowl-horst (g) , ICcanowl-horst (g) , IEXTcanowl-horst (g) , ICEXTcanowl-horst (g) , IScanowl-horst (g) , LVcanowl-horst (g) i is an OWL-Horst interpretation structure, constructed as follows: • LVcanowl-horst (g) = {l | l is a plain literal and l occurs in owl-horst-closure(g)} ∪ {dv(l) | l is a datatyped literal occuring in owl-horst-closure(g), where dv(l) is the data value of l } • IPcanowl-horst (g) = {P | (P, rdf:type,rdf:Property) ∈ owl-horst-closure(g)} • ICcanowl-horst (g) = {C | (C, rdf:type,rdfs:Class) ∈ owl-horst-closure(g)} 27

• IRcanowl-horst (g) = LVcanowl-horst (g) ∪ IPcanowl-horst (g) ∪ ICcanowl-horst (g) ∪ {a | (a, rdf:type, rdfs:Resource ) ∈ owl-horst-closure(g)} • IScanowl-horst (g) = {(a, a) | a is any URI, blank node or plain literal that occurs in owl-horst-closure(g)} ∪ {(l, dv(l))|l is a datatyped literal occuring in owl-horst-closure(g), whose data value is dv(l) } • for every P ∈ IPcanowl-horst (g) , IEXTcanowl-horst (g) (P ) = {(s, o) | (s, P, o) ∈ owl-horst-closure(g)} • for every C ∈ ICcanowl-horst (g) , ICEXTcanowl-horst (g) (C) = {a | (a, rdf:type, C) ∈ owl-horst-closure(g)} Consistency as defined in [90], for an OWL-Horst graph, determines if the graph has clashes or not. A clash, denoted by the symbol FALSE, can result from invalid datatyped literals, or from simultaneous presence of conflicting statements such as (a, owl:sameAs, b) and (a, owl:differentFrom, b). A graph g is said to be OWL-Horst inconsistent, if g |=owl-horst FALSE, and otherwise said to be OWL-Horst consistent. For any two OWL-Horst consistent graphs g, h, the following are true: • canowl-horst (g) |=owl-horst g • canowl-horst (g) can be computed in PTIME • g |=owl-horst h iff canowl-horst (g) |=simple h. The proofs of these facts can be found in Horst [90]. 2.1.6

Translations of OWL Statements to First Order Logic Statements

Note that any OWL statement (or an RDF statement) can be translated to a firstorder logic sentence. The function Tx given in Table 2.2 defines the first-order translation of common DL-constructs. 28

DL-construct A ∀R.C ∃R.C ¬C ≥ nR.C ≤ n − 1R.C ∃R.self

Tx A(x) ∀y.R(x, y) → Ty (C) ∃y.R(x, y) ∧ Ty (C) ¬Tx (C) V V ∃y1 , ..., yn .R(x, y1 ) ∧ ... ∧ R(x, yn ) ∧ i=1,...,n Tyi (C) ∧ 1≤i6=j≤n yi 6= yj V W ∃y1 , ..., yn .R(x, y1 ) ∧ ... ∧ R(x, yn ) ∧ i=1,...,n Tyi (C) → 1≤i6=j≤n yi = yj R(x, x) p.s. y is a fresh variable, and R is an atomic role

Table 2.2: First order translation of DL concepts C vD RvS C(a) R(a, b)

∀x.Tx (C) → Tx (D) ∀x, y.R(x, y) → S(x, y) C(a) R(a, b)

Table 2.3: First order translation of DL statements for DLs with simple roles

Example 10. consider the EL T-box statement A u ∃R.B v ∃S.C u D where A, B, C, D ∈ ΣC , R, S ∈ ΣP . Now using the translation given in table 2.3, one can obtain: ∀x [A(x) ∧ ∃z(R(x, z) ∧ B(z)) → ∃y (S(x, y) ∧ C(y)) ∧ D(x)] since y is free for A and y free for D, extending their scope results in ∀x [∃z(A(x) ∧ R(x, z) ∧ B(z)) → ∃y (S(x, y) ∧ C(y) ∧ D(x))] which can be written as ∀x [¬∃z(A(x) ∧ R(x, z) ∧ B(z)) ∨ ∃y (S(x, y) ∧ C(y) ∧ D(x))] which implies ∀x∀z [¬(A(x) ∧ R(x, z) ∧ B(z)) ∨ ∃y (S(x, y) ∧ C(y) ∧ D(x))] which can be written to the form described before: ∀x∀z [(A(x) ∧ R(x, z) ∧ B(z)) → ∃y (S(x, y) ∧ C(y) ∧ D(x))] Note that OWL profiles such as EL, QL, RL are fragments of a larger family of DLs called Horn DLs [63]. Any Horn DL ontology has a unique chase, and can be translated to a semantically equivalent set of forall-existential rules [32]. 29

2.1.7

Forall-Existential (∀∃) Rules

∀∃ rules (also known as Datalog+- rules [20] or tuple generating dependencies (tgds) [12]), a fragment of first order logic, is a popular language used for describing ontologies in a rule based format. A field that is currently of extensive research interest has given rise to large number of ∀∃ classes of varying computational complexity. Besides, the RuleML initiative and its recently developed language RuleLog3 is gaining popularity in the SW communities as a rule based KR language, and has its foundations from ∀∃ rules. For any vector or sequence ~x, we denote by k~xk the number of symbols in ~x, and by {~x} the set of symbols in ~x. A ∀∃ rule is a first order formula of the form:

∀~x∀~z [p1 (~x, ~z) ∧ ... ∧ pn (~x, ~z) → ∃~y p01 (~x, ~y ) ∧ ... ∧ p0m (~x, ~y )]

(2.1)

where ~x, ~y , ~z are vectors of variables s.t. {~x}, {~y } and {~z} are pairwise disjoint, pi (~x, ~z), for 1 ≤ i ≤ n are predicate atoms whose variables are from ~x or ~z, p01 (~x, ~y ), for 1 ≤ i ≤ m are predicate atoms whose variables are from ~x or ~y . Sometimes, we write a ∀∃ rule r as φ(r)(~x, ~z) → ψ(r)(~x, ~y ), or φ(~x, ~z) → ψ(~x, ~y ), when r is implicit from the context. Also note φ(r)(~x, ~z) = φ(~x, ~z) = {p1 (~x, ~z), ..., pn (~x, ~z)}, ψ(~x, ~y )(r) = ψ(~x, ~y ) = {p01 (~x, ~y ), ... p0m (~x, ~y )}. A set of ∀∃ rules is called a ∀∃ rule set. Checking entailment over ∀∃ rule sets is undecidable, in general [12]. Various decidable subclasses with associated entailment procedures have been derived lately. A few examples of these subclasses are the linear ∀∃ rules [56], (weakly) guarded rules [21], (weakly) frontier guarded rules [6], jointly frontier guarded rules [60], ‘sticky’ rules [23], and weakly acyclic rules [39, 34].

3

http://ruleml.org/rif/rulelog/spec/Rulelog.html

30

2.2

Query Answering over Ontologies

Let V be the set of variables, any element of the set CV = V ∪C is a term. Any (s, p, o) ∈ CV × CV × CV is called a triple pattern. A triple pattern t, whose variables are elements of the vector ~x or ~y is written as t(~x, ~y ). For any function f : A → B, the restriction of f to a set A0 , is the mapping f |A0 from A0 ∩ A to B s.t. f |A0 (a) = f (a), for each a ∈ A ∩ A0 . For any triple pattern t = (s, p, o) and a function µ from V to a set A, t[µ] denotes (µ0 (s), µ0 (p), µ0 (o)), where µ0 is an extension of µ to C s.t. µ0 |C is the identity function. For any set of triple patterns S G, G[µ] denotes t∈G t[µ]. For any vector of constants ~a = ha1 , . . . , ak~ak i, and vector of variables ~x of the same length, ~x/~a is the function µ s.t. µ(xi ) = ai , for 1 ≤ i ≤ k~ak. We use the notation t(~a, ~y ) to denote t(~x, ~y )[~x/~a]. In this discussion, we limit ourselves to the class of Conjunctive queries (CQ), which are also called select-project-join queries. It is well known that most of the queries that users pose to DBs/knowledge bases (KBs) are CQs. The only logical operators in CQs are conjunctions and existential quantifiers, and they do not contain negations, universal quantification, or functional symbols. Since any OWL ontology serialized in a non-graphical syntax (for instance, in functional style) can be translated using the standard map provided in PatelSchneider et al.[80], and represented as a graph in RDF/XML syntax, and any CQ over an OWL ontology can be translated to a graphical CQ (conjunct of triple patterns) using the same map, we limit ourselves to graphical CQs. Example 11. Consider an OWL ontology O whose statements in DL style syntax is as follows: Champion v ∃ hasWon. Tournament t Championship Champion(ferrari) The mapping of O obtained by the standard OWL to RDF mapping M given in Patel-Schneider et al. [80] is as shown in Fig 2.3. Note that the : b1 , : b2 , : b3 , : b4 are auxilliary blank nodes introduced in the translation. 31

rdf:type : b1 o wl :o nP ro pe rt y

owl:Restriction

rdf:Seq

rd f:

ty pe

hasWon

st ir f f: rd

: b3 rdf:rest

owl:UnionOf

: b2

: b4

Tournament

st ir f f: rd Championship

rdf:rest

ferrari

owl:subClassOf

owl:someValuesFrom

rdf:type

Champion

rdf:nil

Figure 2.3: RDF graph translation of OWL ontology

Consider the following CQ Q over the ontology O: ∃z Champion(x) ∧ hasW on(x, z) ∧ T ournament(z) It is easy to see that Q can be translated to the following graphical CQ using the map M in Patel-Schneider et al. [80]: ∃z (x, rdf:type, Champion) ∧ (x, hasW on, z) ∧ (z, rdf:type, T ournament) Note that for any boolean CQ Q over OWL ontology O, O |=owl Q iff M (O) |=owl M (Q). 32

Definition 12 (Conjunctive query(CQ)). A CQ Q(~x) is an expression of the form: ∃~y t1 (~x, ~y ) ∧ ... ∧tp (~x, ~y ), where ti (~x, ~y ) are triple patterns over vectors of free variables ~x and quantified variables ~y , for i = 1, ..., p. A CQ is called a boolean CQ if it does not have any free variables. Let ~a be a vector such that ai ∈ U ∪ L and k~xk = k~ak; for any query Q(~x), ~x/~a in Q(~x) is denoted by Q(~a). For any CQ Q(~x) and a vector ~a, with k~xk = k~ak, Q(~a) is boolean. A vector ~a is an answer for Q(~x) w.r.t. an interpretation I = h∆I , .I i, if there exists an assignment µ from the set of existential variables {~y } to the domain ∆I , s.t. I |= ti (~a, ~y )[µ], for every ti (~a, ~y ) ∈ Q(~a). A vector ~a is called a certain answer for Q(~x) w.r.t. a graph g iff I |= Q(~a), for every model I of g. For any graph g, a CQ CQ(~x), and a vector ~a, the decision problem (DP) of checking if g |= CQ(~a) is called the CQ entailment problem. Complexity of CQ entailment problem is NP-complete for RDFS [82]. Whereas complexity for CQ answering is still an open problem for OWL 1 DL and OWL 2 DL [51], and no sound, complete algorithms are known yet for deciding CQ entailment. 2.2.1

Chase of an Ontology

In the literature, query answering over an ontology is often done by computing the chase [27, 56, 1] of an ontology. A chase of an ontology is a deductive closure of the ontology, and the algorithm that computes the chase is often referred to as the chase algorithm. For any ontology O, its chase chase(O) is a universal model [35] of the ontology, i.e. chase(O) |= O and for any model I of O, there exists a homomorphism h from chase(O) to I. Hence, for any boolean CQ Q(), O |= Q() iff chase(O) |= Q(). In the following, we show how the chase of an ontology can be constructed for a ∀∃ rules ontology. The technique can be straightforwardly extended to DLs such as OWL 2 EL, OWL 2 QL and other fragment of Horn DLs (DLs for which a unique Herbrand model exist). For 33

disjunctive ∀∃ rules (extension of ∀∃ rules with disjunctive heads) and DLs that permit disjunctions on the right hand side of subsumptions, Deutsch et al. [35] showed that a chase set [35, 70] can be devised for deciding CQ entailment. Various versions of chases, adequate for different scenarios, have been derived for ∀∃ rule sets. We now summarize each of these. Oblivious chase

For any ∀∃ rule r of the form ( 2.1), with slight abuse (Datalog

notation) we write r as: p1 (~x, ~z), ..., pn (~x, ~z) → p01 (~x, ~y ), ..., p0m (~x, ~y )

(2.2)

Let Bsk be a fresh set of blank nodes called Skolem blank nodes. For any ∀∃ rule r of the form ( 2.2), and an assignment µ : {~x} ∪ {~z} → C, the function apply(r, µ) is defined as follows: apply(r, µ) = head(r)[µext(~y) ] where µext(~y) is an extension of µ s.t. µext(~y) (yi ) is a distinct fresh Skolem blank node from Bsk , for each yi ∈ {~y }. For any ∀∃ rule r of the form ( 2.2), a set of instances A, and an assignment µ : {~x} ∪ {~z} → C, the boolean function Oapplicable(r, µ, A) is defined as follows: ( T rue, if body(r)[µ] ⊆ A; Oapplicable(r, µ, A) = F alse, Otherwise; For any ∀∃ rule set R, a set of instances A, let OΣ(R, A) = {(r, µ)|Oapplicable(r, µ, A) = T rue} Let Ochase0 (R) = {ψ(~x, ~y )|r =→ ψ(~x, ~y ) ∈ R}; for i ∈ N, [

Ochasei+1 (R) = Ochasei (R) ∪

(r,µ)∈OΣ(R,Ochasei (R))

34

apply(r, µ)

The oblivious chase of R, denoted Ochase(R), is given as: [ Ochase(R) = Ochasei (R) i∈N

We say that two sets of instances A and B are equivalent, denoted A ≡ B, iff there exists homomorphisms h1 and h2 s.t. A[h1 ] ⊆ B and B[h2 ] ⊆ A. Intuitively, Ochasei (R) can be thought of as the state of Ochase(R) at the end of iteration i. In the oblivious case, the termination condition is given by: If ∃i s.t. Ochasei (R) ≡ Ochasei+1 (R), then Ochase(R) = Ochasei (R); Hence, an algorithm that computes the oblivious chase, at each iteration, needs to take the overhead of checking equivalence of current chase state with the previous chase state. Note that complexity of checking equivalence of two sets of instances is worst case exponential in the size of instances. We now show how the Skolem chase given in works such as Marnette [69] and Cuenca Grau et al [32] is constructed. For any ∀∃ rule r of the Skolem chase

form (2.1), the skolemization sk(r) is the result of replacing each yi ∈ {~y } with a globally unique Skolem function fir , s.t. fir : Ck~xk → Bsk . Intuitively, for every distinct vector ~a of constants, with k~ak = k~xk, fir (~a) is a fresh blank node, r whose node id is a hash of ~a. Let f~r = hf1r , ..., fk~ y k i be a vector of distinct Skolem functions; For any ∀∃ rule r the form (2.1), with slight abuse we write its skolemization sk(r) as follows: p1 (~x, ~z), ..., pn (~x, ~z) → p01 (~x, f~r ), ..., p0m (~x, f~r )

(2.3)

Moreover, any skolemized ∀∃ rule r of the form (2.3) can be replaced by the following equivalent set of formulas, whose size is worst case quadratic w.r.t the size of r: {p1 (~x, ~z), ..., pn (~x, ~z) → p01 (~x, f~r ), ..., p1 (~x, ~z), ..., pn (~x, ~z) → p0m (~x, f~r )} 35

(2.4)

Note that each BR in the above set has exactly one predicate atom with optional function symbols in the head. Also note that a ∀∃ rule without function symbols can be replaced with a set of ∀∃ rules with single atom heads. Hence, w.l.o.g, we assume that any ∀∃ rule in a skolemized set sk(R) of ∀∃ rules is of the form (2.4). For any set of instances A and a skolemized ∀∃ rule r of the form (2.4), the application of r on A, denoted by r(A), is given as: o [ n 0 r r(A) = p1 (~x, f~ )[µ] | p1 (~x, ~z)[µ] ∈ A, ..., pn (~x, ~z)[µ] ∈ A µ∈V→C

For any set of skolemized ∀∃ rules R, application of R on A is given by: [ R(A) = r(A) r∈R

For any ∀∃ rule set R, generating BRs RF is the set of BRs in sk(R) with function symbols, and the non-generating BRs is the set RI = sk(R) \ RF . Let Schase0 (R) = {ψ(~x, f~)|r =→ ψ(~x, f~) ∈ sk(R)}; for i ∈ N, Schasei+1 (R) = Schasei (R) ∪ RI (Schasei (R)), Schasei (R) ∪ RF (Schasei (R)),

if RI (Schasei (R)) 6⊆ Schasei (R); otherwise;

The Skolem chase of R, denoted Schase(R), is given as: [ Schase(R) = Schasei (R) i∈N

Intuitively, Schasei (R) can be thought of as the state of Schase(R) at the end of iteration i. In the Skolem case, the termination condition is simpler and is given by: If ∃i s.t. Schasei (R) = Schasei+1 (R), then Schase(R) = Schasei (R). Note that if the Skolem chase of an ontology terminates, then so does the restricted chase and the core chase of the ontology [32]. 36

The core chase is a slight variant of the oblivious chase in which the core of the chase results are computed at each iteration. For a set of instances Core chase

A, its core core(A) is a minimal subset of A that is equivalent to A [6]. Note that multiple cores of a set of instances are (homomorphically) equivalent [35]. Let Cchase0 (R) = core(Ochase0 (R)); for i ∈ N, Cchasei+1 (R) = core(Cchasei (R) ∪

[

apply(r, µ))

(r,µ)∈OΣ(R,Cchasei (R))

The core chase of R, denoted Cchase(R), is given as: [ Cchase(R) = Cchasei (R) i∈N

Intuitively, Cchasei (R) can be thought of as the state of Cchase(R) at the end of iteration i. The termination condition is given by: If ∃i s.t. Cchasei (R) ≡ Cchasei+1 (R), then Cchase(R) = Cchasei (R); An algorithm that computes the core chase, at each iteration, needs to take the overhead of checking equivalence of current chase state with the previous chase state. Note that, for any rule set R, for each i ∈ N, Cchasei (R) = core(Ochasei (R)). The restricted chase (also called non-oblivious chase) given in Fagin et al. [39] is a version of the chase in which a redun-

Non-oblivious/Restricted chase

dancy check is performed before rule application. A rule is only applied, if the rule application is not redundant, i.e. the application of the rule does not lead to an equivalent set. Assume that there exists a strict linear order ≺ that linearly orders the set of all instance sets. Cali et al [21] gives one such order based on lexicographic order of the constants. Also for any two rules r, r0 and assignments µ, µ0 , let (r, µ) ≺ (r0 , µ0 ) iff φ(r)[µ] ≺ φ(r0 )[µ0 ]. 37

Given a ∀∃ rule set R; for any rule r = φ(r)(~x, ~z) → ψ(r)(~x, ~y ) ∈ R of the form (2.2), an assignment µ : {~x} ∪ {~z} → C, a set of instances A, let N applicableR be the least predicate inductively defined as: N applicableR (r, µ, A) holds, if φ(r)[µ] ⊆ A, ψ(r)[µ00 ] 6⊆ A, ∀µ00 ⊇ µ and 6 ∃r0 ∈ R, 6 ∃µ0 s.t. r0 6= r or µ0 6= µ with (r0 , µ0 ) ≺ (r, µ) and N applicableR (r0 , µ0 , A); Let N chase0 (R) = {ψ(~x, ~y )|r =→ ψ(~x, ~y ) ∈ R}; for i ∈ N, N chasei+1 (R) = N chasei (R) ∪ apply(r, µ), If N applicableR (r, µ, N chasei (R)) holds, for some r ∈ R, assignment µ; N chasei (R),

Otherwise;

The non-oblivious chase of R, denoted N chase(R), is given as: [ N chase(R) = N chasei (R) i∈N

Intuitively, N chasei (R) can be thought of as the state of N chase(R) at the end of iteration i. In the non-oblivious case, the termination condition is given by: If ∃i s.t. Ochasei (R) = Ochasei+1 (R), then Ochase(R) = Ochasei (R); Hence, an algorithm that computes the oblivious chase, at each iteration, just needs to detect if any new instances were added; if not, the computation of N chase can be stopped. 2.2.2

Complexity Measures of Query Answering

Given a ∀∃ rule set R, it is common in practice to distinguish the instance part of R from the terminological part, and to study the complexity emphasizing these 38

two aspects, independently. Hence, we distinguish the set of assertions RA is given by RA = {ψ(~x, ~y )|r =→ ψ(~x, ~y ) ∈ R}; and the terminological part RT given by: RT = {r ∈ R|φ(r)is non-empty} Also given a query Q over such a rule set R, the following three different kinds of complexity measures are commonly used to evaluate the performances of query answering: Query complexity of query answering is the complexity measure of query answering, when one assumes that the size of the ontology/KB (both terminology part and assertional part) over which query is evaluated is fixed to a constant, with the size of the query being varied. Hence, while evaluating query complexity, we fix the size of R to be a constant, and the final complexity result is a function in the size of Q. In the context of DLs, query complexity is the complexity measure of query answering when both the T-box and A-box is assumed to be of a constant size. Data complexity different from query complexity, data complexity is the complexity measure of query answering when only the instance part (assertions) RA is varied, while both schema (terminology) part RT and the query Q is assumed to be fixed to a constant. Hence, the complexity measure is computed as a function in the size of RA . In the context of DLs, data complexity is the complexity measure of query answering when both the T-box and query is assumed to be constant sized, and the A-box is assumed to be the variable part of the final complexity function. Combined complexity nothing is fixed, hence the complexity measure is a function of all the components – schema RT , instances RA and the query 39

Q. In case of DLs, all the components T-box, A-box and query are considered to be variant while analyzing combined complexity.

2.3

Computational Complexity Fundamentals

In the following, we give an overview of basic notions of computational complexity, necessary for grasping the complexity intricacies of this thesis. For a details on these topics, we refer the readers to books such as Goldreich [44] and Arora et al. [3]. From the computational complexity point of view, it is very important to distinguish between the yes/no problems and the search/find problems. Decision problems (DPs) commonly occur in real world, scientific, and industrial scenarios where the solver needs find a boolean Yes/No answer. Decision vs Search Problems

Well known examples of DPs are: Satisfiability problem: Given a set proposition formulas, find whether there exists an assignment of the set of variables in the formulas to true/false values, for which the formula evaluates to true. Hamiltonian path problem: Given a graph G = hV, Ei, to decide whether there exists a path p = hv1 , v2 , . . . , v|V | i s.t. {p} = V and (vi , vi+1 ) ∈ E, for i = 1, . . . , n − 1. Intuitively, an instance of the problem asks for the existence of a path that passes through every vertex of G exactly once. Prime problem: To decide whether a given natural number is prime or not. Any DP P is represented by a set SP ⊆ {0, 1}∗ that represents the Yes instances of the problem. Hence, given an instance p ∈ {0, 1}∗ the decision problem asks whether p ∈ SP . An algorithm A : {0, 1}∗ → {true, f alse}, is said to the solve the DP P iff, for any instance p ∈ {0, 1}∗ , ( true, If p ∈ SP ; A(p) = f alse, Otherwise; 40

Search problems are also common in real world, scientific, and industrial scenarios. For any given instance of the problem, a solver need to find a string that is the answer for the instance. Well known examples are: Shortest path problem Given a weighted graph G = hV, E, λi, a source node s ∈ V , a target node t ∈ V , find a path of minimal weight from s to t, or the report whether no path exists. Prime factorization problem Given a natural number N , find prime numbers n1 , . . . nk s.t. n1 ∗ . . . ∗ nk = N . A search problem R is often defined as a binary relation {0, 1}∗ × {0, 1}∗ . For any instance p ∈ {0, 1}∗ , R(p) = {w ∈ {0, 1}∗ |(p, w) ∈ R} represents the set of solutions for p. An algorithm A : {0, 1}∗ → {0, 1}∗ ∪ {⊥} is said to solve the search problem R, iff, for any instance p ∈ {0, 1}∗ , ( w ∈ R(p), If R(p) 6= ∅; A(p) = ⊥, Otherwise; Note that ⊥ 6∈ {0, 1}∗ is a distinguished symbol returned to indicate the absence of solutions. In practical cases, it is customary to assume that for any search problem the size of the answer for any problem instance is of a reasonably size, i.e. not extremely large. A search problem R, is polynomially bounded, i.e. R(p, w) implies that the size of w is polynomially bounded in the size of p. One of the fundamental problem of computer science that has received widespread attention is the problem of the relation between P and the NP class. The class P represents the class of DPs that can be decided in polynomial time by a deterministic turing machine (DTM). The class NP represents P vs NP question

the class of DPs that can be decided in polynomial time by a non-deterministic turing machine (NTM). Since a DTM is a special kind of NTM, obviously, the relation P ⊆ NP holds. Whereas, if this containment is strict or not is still an 41

open problem, often referred to as the P vs NP question. According to an alternate equivalent definition of NP [44], the class NP refers to the class of decision problems for which there exists a polynomial time proof procedure. That is, for any DP P in NP, there exists a polynomial time procedure A, s.t. if p in the set of Yes instances of P, i.e. if p ∈ SP , then there exists a polynomial sized string w, called the NP-proof s.t. A(p, w) = true. A DP P is in class P, iff there exists a polynomial time procedure A s.t. for any instance p, p is in the Yes instance of P, i.e. p ∈ SP , iff A(p) = true. Hence, P vs NP question also put forwards the question of whether or not the existence of (reasonably sized) proofs adds to the efficiency in computation. The unsettled variations of the question arises also when considering higher classes of computation, and give rise to: E XP T IME vs NE XP T IME, 2E XP T IME vs 2NE XP T IME, and so on. A decision problem P is called decidable iff there exists an algorithm A that decides the membership in the set SP , i.e for Undecidability/Unsolvability of Problems

any instance p ∈ {0, 1}∗ ( A(p) =

true, If p ∈ SP ; f alse, Otherwise;

A problem in undecidable, iff it is not decidable. The set of inputs to functions corresponds to the set N of natural numbers. Since any decision problem can be seen as a function that that an inputs a natural number, and returns 0 or 1. The cardinality of set of problems correspond to the set of all (boolean) functions, N → {0, 1}, which is equal to 2N . Since any program has a description that can be seen as a natural number, The set of programs corresponds to the natural numbers. It is well known that 2N is strictly greater than N, the cardinality of decision problems are strictly higher than the cardinality of programs. Hence, there should be DPs for which there can not exist programs that decide them. Example 13. An example of an undecidable problem is the halting problem. 42

The halting problem H is the following two argument function: ( true, If program n1 on input n2 halts; H(n1 , n2 ) = f alse, Otherwise; where n1 , n2 ∈ N. Cantor, the renowned computer scientist with a diagonalization argument showed that halting problem is undecidable. Below, we briefly overview some of the well known complexity classes, some of which are used henceforth. The following containment Complexity Classes

relation between classes are well known: AC0 ( L OG S PACE ⊆ NL OG S PACE ⊆ PT IME ⊆ NP ⊆ E XP T IME ⊆ NE XP T IME ⊆ 2E XP T IME Also the following strict containment relations are known: PT IME ( E XP T IME ( 2E XP T IME The class AC0 is based on the circuit model of complexity. A (decision) problem belongs to AC0 if it can be decided in constant time by a circuit that has polynomial number of gates w.r.t the size of the input. An example of a problem that is in AC0 from the DB/KR context is the data complexity of answering first order queries over relational DBs. The class L OG S PACE represents the class of (decision) problems that can be decided by a 2-tape DTM, that receives its input string on the read-only input tape, and uses only space of the read-write work tape that is at most logarithmic w.r.t to the input size. The class NL OG S PACE is similar, except that a non-deterministic turing machine is assumed instead of a deterministic one. A typical problem that is in L OG S PACE (but not in AC0 ) is undirected-graph reachability, i.e. the problem of determining if a target node is reachable from a source node in an undirected graph. Similarly, a typical problem that is in NL OG S PACE is directed-graph reachability. 43

A problem is called hard for a class, if every other problem in the class is reducible to the problem in a reasonably small amount of time. Polynomial time reductions and logspace reductions are the commonly considered ones. A problem is called complete for a class, if the problem is hard for the class and is also a member of the class. Well known problems that are complete for class NP, are the boolean satisfiability problem, graph homomorphism problem, and three colorability problem of undirected graphs. Well known problem that is complete for class P is Horn-Sat, the satisfiability problem of propositional horn clauses.

44

Chapter 3 Contextual Representation and Reasoning for Semantic Web: A Review on Existing Frameworks

In this chapter, we review some of the well known existing frameworks, in the SW area, for reasoning with contextualized knowledge. An attempt to formalize contexts was done by McCarthy [55] in the realm of AI, as early as in the 80s. The main solution proposed by McCarthy was to consider contexts as first-class objects, apart from standard logical primitives; his proposal consisted of a special predicate ist using which one could specify axioms such as ist(c, ∀x.person(x) → smart(x)), to intuitively mean that “every person in the scope of context c is smart”. Lifting rules were used to import/inter-operate axioms between contexts. As pointed out by Guha et al. [45] and Bouquet et al. [81], the intricacy of these and the other mechanisms of contexts that existed in AI was not directly applicable for SW applications. As a result a number of works appeared in the 2000s, particularly focusing on problems related to contexts from the SW perspective. In the following, we review some of the important ones: 45

3.1

Distributed Description Logics

Distributed Description Logic (DDL) [14], proposed by Borgida and Serafini, was one of the pioneers among the frameworks for representation of contextualized knowledge in the SW setting. The original work, proposed as an extension to description logics, was motivated to reason with a distributed set of information sources, s.t. each of these sources could assimilate knowledge from the other sources. In DDL each of these information source is a DL KB, and since the information sources model domains that can have possible interconnections, bridge rules and individual correspondences are provided for interoperability of the distributed KBs. Given two DL languages Li and Lj , a bridge rule from i to j is an expression of one of the following forms: v

i : A −→ j : B, called into-bridge rule w

i : A −→ j : B, called onto-bridge rule Intuitively, the into-bridge states that according to information source j, the objects of type A in information source i are of type B in information source i. Whereas, an onto-bridge rule states that according to the information source j, every object of type B is also of type A in the information source i. An individual correspondence is an expression of the form: i : a 7→ j : b where a and b are instances of DL languages Li and Lj , respectively, and intuitively means that according to information source j its object b is same as the object a in information source i. Definition 1. Given a set I of indices, let {Li }i∈I be a collection of DL languages. A distributed T-box T = h{Ti }i∈I , Bi consists of a set of ordinary DL T-boxes {Ti }i∈I , and a set B = {Bij }i6=j∈I of bridge rules. For every k ∈ I, all 46

the assertions in Tk should be in the corresponding DL language Lk . And, for v w every bridge rule i : A −→ j : B or i : A −→ j : B in Bij , the concepts A and B must be in Li and Lj , respectively. A distributed A-box A = h{Ai }i∈I , Ci consists of a a set of A-boxes {Ai }i∈I , and a set C = {Cij }i6=j∈I of individual correspondences. For every k ∈ I, all descriptions in Ak must be in the corresponding language Lk , and for every correspondence of the form i : a 7→ j : b, the individuals a and b must be in languages Li and Lj , respectively. A DDL KB is a pair hT, Ai, consisting of a distributed T-box T and a distributed A-box A. DDL semantics is defined on top of a distributed interpretation (structure), which is a set of local DL interpretation structures, one each for each individual information system, which are further connected by domain relation mappings. Definition 2. A distributed interpretation structure I = h{Ii }i∈I , ri consists of DL interpretation structures Ii = h∆Ii , ·Ii i, and a set of relations r = {rij }i6=j∈I , where rij ⊆ ∆i × ∆j . Satisfaction of distributed T-box statements are defined as follows: Definition 3. A distributed interpretation I = h{Ii }i∈I , ri, d-satisfies (elements of) a distributed T-box T = h{Ti }i∈I , Bi (written I |=d ), is given as per the following conditions: v

• I |=d i : A −→ j : B, iff rij (AIi ) ⊆ B Ij , w

• I |=d i : A −→ j : B, iff rij (AIi ) ⊇ B Ij , • I |=d i : A v B, iff Ii |=DL A v B, • I |=d Ti , iff Ii |=DL Ti , • I |=d T, iff I |=d Ti and I d-satisfies every bridge rule in B, 47

where i 6= j ∈ I and |=DL is the classical DL satisfaction relation. Satisfaction of a distributed A-box is defined as follows: Definition 4. A distributed interpretation I = {Ii }i∈I d-satisfies (elements of) a distributed A-box A = h{Ai }i∈I , Ci, is given as per the following conditions: • I |=d i : a 7→ j : b, iff bIj ∈ rij (aIi ), • I |=d i : C(a), iff Ii |=DL C(a), • I |=d i : P (a, b), iff Ii |=DL P (a, b), • I |=d Ai , iff I |=d st, for every st ∈ Ai , • I |=d A, iff I |=d Ai , for every i ∈ I, and I d-satisfies every individual correspondence in C, where i 6= j ∈ I and |=DL is the classical DL satisfaction relation. A DDL KB KB d-entails DDL axiom st, iff every distributed model of KB satisfies st. A DDL KB KB1 d-entails a DDL KB KB2 , iff KB1 entails st, for every st ∈ KB2 . Wide reach of DDL in SW community is manifested by numerous implementation, application, and extension attempts. C-OWL by Bouquet et al. [15] proposes an extension of OWL language using the DDL semantics that enables the creation of ontologies with multiple local contexts. Also the authors show how the hole interpretations, which are standard OWL interpretations in which every concept and role is empty, can be used to satisfy inconsistent contexts and prevent inconsistency propagation from a locally inconsistent context to a locally consistent context, in spite of the existence of mapping bridge rules. The authors also demonstrate how directionality of mappings can be achieved using into/onto bridge rules of DDL and the domain relations in the DDL semantics. DRAGO [87] is a robust extension of the Pellet DL reasoner [88] based on 48

Tableaux calculus that enables reasoning with DDL semantics over distributed ontologies. Homola et al. [52] proposed an extension to DDL semantics with compositionality property of subsumption axioms. Given v

v

i : C −→ j : E and i : D −→ j : F then the compositionality constraint (which the plain DDL does not possess) ensures that v i : C  D −→ j : E  F where i, j are DL KBs, C, D are DL-concepts of language Li , E, F are DLconcepts of language Lj ,  is any DL-Connective.

3.2

E-connections

E-connections [64] is a methodology for connecting multiple ontologies that represents multiple contexts of a domain via the concept of ontology linking. These multiple ontologies could possibly be defined using multiple logical languages. An example of an E-connections is a domain D1 of companies and locations, connected to a domain D2 of people using the set of links E = {L, W }, where L, W ⊆ D1 × D2 . A pair (x, y) ∈ L, intuitively represents the fact that an individual y of D2 lives in a location x of D1 , and a pair (x, y) ∈ W , intuitively represents the fact that an individual y of D2 works in a company x in D1 . The component domains are represented using the notion of an abstract description system. Common language variants for describing systems such as temporal logics, spatial logics, description logics can be represented using an abstract description system. Abstract Description System (ADS)

An abstract description language (ADL) L is determined by a countably infinite set of set variables V , a countably infinite set of object variables X , a finite set 49

of relation symbols R, a finite set of function symbols F. For any R ∈ R and f ∈ F, let ar(R) and ar(f ), denote the arity of R and f , respectively. The terms tj of L are inductively built as follows: tj := x | ¬t1 | t1 ∧ t2 | f (t1 , ...tar(f ) ) where x ∈ V, f ∈ F. The term assertions of L are of the form t1 v t2 , where t1 , t2 are terms, and the object assertions are of the form: • R(a1 , ..., aar(R) ), for a1 , ..., aar(R) ∈ X , R ∈ R; • t(a), for a ∈ X and t a term. The set of term assertions and object assertions together form the set of Lassertions. The semantics of ADLs are defined via abstract description models. Given a ADL L = hV, X , R, Fi, an abstract description model (ADM) for L is a structure of the form: M = hW, V M = {v M }v∈V , X M = {aM }a∈X , RM = {RM }R∈R , F M = {f M }f ∈F i, where W is a non-empty set, v M ⊆ W , xM ∈ W , f M is function mapping ar(f )-tuples hX1 , ..., Xar(f ) i of subsets of W to a subset of W , and the RM are ar(R)-ary relations on W . The value tM ⊆ W of an L-term t is defined inductively as: • (¬t)M = W \ tM , M • (t1 ∧ t2 )M = tM 1 ∩ t2 , M • (f (t1 , ..., tar(f ) ))M = f M (tM 1 , ..., tar(f ) ).

The satisfaction relation M |= φ of an L-assertion φ is defined in the obvious way: M • M |= R(a1 , ...aar(R ), iff RM (aM 1 , ..., aar(R) ),

50

• M |= t(a), iff aM ∈ tM , M • M |= t1 v t2 , iff tM 1 ⊆ t2 ,

For a set γ of assertions, M |= γ iff M |= φ, for all φ ∈ γ. Definition 5. An ADS is a pair hL, Mi, where L is an abstract description language and M is a class of ADMs for L. E-connections of Abstract Description Systems

Suppose we want to connect n ADSs S1 , ..., Sn , Si = hLi , Mi i for 1 ≤ i ≤ n. In order to connect S1 , ..., Sn , the following additional constructors are used: 1. a non-empty set of n-ary relational symbols E = {Ej }j∈J , 2. for 1 ≤ i ≤ n and each j ∈ J, function symbols hEj ii of arity n − 1 that are distinct from functional symbols of S1 , ..., Sn . The elements of E are called link relations, (or links, for short) and the function symbols hEj ii , link operators. The definition of E-connection C E (S1 , ..., Sn ) of S1 , ..., Sn , following the definition of ADS, contains a set of terms of C E (S1 , ..., Sn ), assertions, and finally a class of models and a satisfaction relation between these models and assertions. The set of C E (S1 , ..., Sn )-terms is partitioned into n sets, each of which contains i-terms, for 1 ≤ i ≤ n. Intuitively, i-terms are the terms of Li enriched with new function symbols hEj ii for each j ∈ J. They are defined inductively as: • every set variable of Li is an i-term; • the set of i-terms is closed under ∧, ¬ and the function symbols of Li ; 51

• if (t1 , ..., ti−1 , ti+1 , ..., tn ) is a sequence of k-terms tk for k 6= i, then hEj ii (t1 , ..., ti−1 , ti+1 , ..., tn ) is an i-term, for every j ∈ J. There are three types of assertions for C E (S1 , ..., Sn ). Two of these types are the term assertions and the object assertions of component ADSs. Additionally, to be able to speak about the ingredients of E-connections, link relations, link assertions are used. The set of assertions of C E (S1 , ..., Sn ) are defined as per the following rules. For 1 ≤ i ≤ n, • the i-term assertions are of the form t1 v t2 , where both t1 and t2 are i-terms; • the i-object assertions are of the form t(a) or R(a1 , ..., aar(R) ), where a and a1 , ..., aar(R) are object variables of Li , t is an i-term, and R a relational symbols of Li ; • the link assertions are of the form Ej (a1 , ...., an ), where ai are object variables of Li , 1 ≤ i ≤ n, and j ∈ J. Taken together, the set of term assertions, object assertions and link assertions form the set of assertions of the E-connection C E (S1 , ..., Sn ). A finite set of assertions is called a knowledge base of C E (S1 , ..., Sn ). The semantics of C E (S1 , ..., Sn ) is defined using a structure of the form: M = h{Mi }1≤i≤n , E M = {EjM }j∈J i, where Mi ∈ Mi , for 1 ≤ i ≤ n and EjM ⊆ (W1 × ... × Wn ), for each j ∈ J. The extension tM of an i-term is defined inductively as per the following rules. For a set variable X and an object variable a of Li , X M = X Mi and aM = aMi . For boolean and function symbols of Li : M M M • (¬t1 )M = Wi \ tM 1 , (t1 ∧ t2 ) = t1 ∩ t2 ,

52

M • (f (t1 , ..., tar(f ) )M = f Mi (tM 1 , ..., tar(f ) )

Now let ~ti = (t1 , ..., ti−1 , ti+1 , ..., tn ) be a sequence of j-terms tj , j 6= i. Then M (hEj ii (~ti ))M = {x ∈ Wi |∃l6=i xl ∈ tM l .(x1 , ..., xi−1 , x, xi+1 , ..., xn ) ∈ Ej }

Finally the extension RM of a relational symbol R of Li is just RMi . The truth relation |= between models M for C E (S1 , ..., Sn ) is defined in the obvious way: M • M |= t1 v t2 iff tM 1 ⊆ t2 ;

• M |= t(a) iff aM ∈ tM ; M • M |= R(a1 , ..., aar(R) ) iff RM (aM 1 , ..., aar(R) ); M • M |= Ej (a1 , ..., an ) iff EjM (aM 1 , ..., an ).

Example 6 (Description Logic-Spatial Logic). A description logic language L1 talks about a domain D1 of objects, and spatial logic language L2 talks about a spatial domain D2 . An E-connection is a relation E ⊆ D1 × D2 defined by taking (x, y) ∈ E iff y belongs to the spatial extension of x – whenever x occupies some space. Given a L1 concept, say university, the operator hEi2 (University) provides us with the spatial extension of all universities. Conversely, given a spatial region say, Italy, hEi1 (Italy) provides us the concept comprising all the objects, whose spatial extension has a non-empty intersection with Italy. So the concept University u hEi1 (Italy) will then denote all the Italian universities. Several extensions to E-connection frameworks have been proposed and implemented. One of the notable ones by Parsia et al. [78] extends the link properties to those which support link properties that are transitive, or also hold between multiple pairs of ADSs/components. Also, the authors provide a decision procedure based on Tableaux calculus for reasoning with E-connected DL systems for expressive DLs such as SHIQ and SHOQ. 53

3.3

Contextualized Knowledge Repository

Contextualized Knowledge Repository (CKR) [86] is a framework for contextualized knowledge representation, which allows a set of knowledge statements to be qualified with dimension values that indicate the modality of truth of the knowledge statements. The system implements the classical context as a box metaphor, proposed in [71], which says that: a context is a set of logical statements, the content of the box, qualified by a set of dimensional values delimiting the boundaries of the box. For instance, the context of current (at the time of writing of this thesis) Italian political scenario, with identifier c, can be graphically represented as follows: duration(c, 22/2/2014-now), location(c, Italy), topic(c, politics)

c=

head of state(Giorgio Napolitano) prime minister(Matteo Renzi) is the ministry of(PierCarlo Padoan, Economy and finance) ...

Popular dimensions are time, geo-location, topic, speaker, provenance URL etc. Definition 7 (Context). Let ∆ and Σ be two (not necessarily disjoint) DL vocabularies, called meta-vocabulary and object-vocabulary, respectively; a context is a triple hc, dim(c), K(c)i s.t.: 1. c is an individual of ∆, 2. dim(c) is a set of assertions of the form A(c, v) on the meta-vocabulary ∆, where A is called dimensional attribute and v is called dimensional value, 3. K(c) is a DL knowledge base in SROIQ or some of its sublanguages over the object-vocabulary Σ. CKR supports the mechanism of contextual qualification, a.k.a context pushpop [72]. By means of this operation a statement within a context can be popped out from a context, preserving its meaning, by modifying it to make explicit 54

the contextual parameters. CKR imposes that for every class (resp. relational) symbol σ of the object-alphabet Σ and for every dimensional value d in the range of a meta-attribute A, the object alphabet contains a class (resp. relation) symbol denoted by σA=d . The set of all such concepts (roles) are called qualified concepts (roles). For instance σ is equal to the concept President and Italy is a constant of the meta-alphabet in the range of the meta-attribute location, then the qualified concept Presidentlocation=Italy can be used to denote the objectclass PresidentsOfTheItalianRepublic. Whenever it is clear from the context, the name of the attribute is skipped. So σItaly is written for σlocation=Italy . In this case σItaly a qualified class/role, σ is called the base class/role. A symbol is unqualified if it is not qualified. Also the CKR framework, motivated by works such as [72, 67], supports the coverage relation among contexts. Intuitively, a context covers another, if the point of view of the former is broader than the point of view of the second. For instance, the context c1 of European politics, covers the context c2 of the Italian Economical Politics and that of European contemporary economical politics. Coverage relation can be determined by formalizing a partial order between the values of contextual attributes. For instance, by means of the metaassertions covers(Italy, Europe), covers(Economical Politics, politics), we could express that fact that Europe is wider than Italy, and that the topic of Politics includes also Economical Politics. These relations impose the desired coverage relation between c1 and c2 . To represent coverage we require that the meta-vocabulary contains one special role coversA , for each attribute A ∈ A. Definition 8 (Contextualized Knowledge Repository). Given a pair of meta/object-alphabets h∆, Σi, a contextualized knowledge repository (CKR) over h∆, Σi is a pair K = hD, Ci, where 1. D is a DL KB on ∆ that contains: (a) n distinct roles A = {A1 , . . . , An } called dimensions (or dimensional 55

attributes); (b) for every dimension A ∈ A a finite set DA of constant symbols called the dimension values of A; (c) For every context c ∈ C, every attribute A ∈ A, an assertion A(c, v), with v ∈ DA ; (d) for every attribute A ∈ A, a role coversA ; 2. D∆ , the dimensional space of ∆, is the set of all full dimensional vectors {dA1 , . . . , dAn }, dAi ∈ DAi , for each 1 ≤ i ≤ n; 3. the transitive closure of the relation {hd, d0 i | D |= coversA (d, d0 )}, denoted by ≺A ; 4. C is a set of contexts, s.t. for every hc, dim(c), K(c)i ∈ C, dim(c) = {A(c, v) ∈ D}, K(c) is over Σ. Notation: For brevity dim(c) = {A(c, dA )|A ∈ A} is alternatively denoted by {dA }A∈A . For every tuple d = {dA ∈ DA }A∈A and any subset B ⊆ A, dB = {dB }B∈B , i.e., the projection of d on the subset of attributes B. For any tuples d = {dA }A∈A and d0 = {d0A }A∈A , d ≺ d0 iff dA ≺A d0A , for all A ∈ A. Similarly dB ≺ d0B iff dB ≺B d0B , for all B ∈ B. For any pair dB and d0C , we define dB + d0C = dB ∪ {d0C |C 6∈ B}. ~

Definition 9 (Translation (.)+dB ). For any set of dimensions attributes dB , and any complex concept X, (X)+dB is obtained by simultaneously applying the following substitutions to the individuals (a), atomic concepts (A), and atomic roles (R) that occur in X (a)+dB → a

B 0 A+d d0 0 → AdB0 +dB B

B Rd+d → Rd0B0 +dB 0 0 B

Intuitively, the operator (.)+dB makes explicit the contextual dimension values dB in a concept. For instance (ProfessorItaly )+2010 = ProfessorItaly,2010

56

(ProfessorItaly )+France = ProfessorItaly

The semantics of a CKR is defined as follows: Definition 10 (Model of a CKR). An interpretation of a contextualized knowledge repository K = hD, Ci over h∆, Σi is a class of DL interpretations IC = {Id }d∈D∆ , Id = h∆d , ·Id i, when the following conditions are satisfied: (a denotes an individual, C an unqualified concept, R an unqualified role, and X either an unqualified concept or a role) 1. ∆d ⊆ ∆e , if d ≺ e; 2. aId = aIe , for every individual symbol a ∈ Σ; 3. (>d )If ⊆ (>e )If , if d ≺ e; 4. (XdB )Ie = (XdB +e )Ie ; 5. (Xd )Ie = (Xd )Id , if d ≺ e; 6. (Cf )Id = (Cf )Ie ∩ ∆d , if d ≺ e; 7. (Cf )Id ⊆ (>f )Id ; 8. (Rf )Id = (Rf )Ie ∩ ∆2d , if d ≺ e; 9. (Rf )Id ⊆ (>f )Id × (>f )Id ; 10. Id |= K(C) if dim(C) = d, for every d ∈ D∆ . The main reasoning task considered in [86] for a CKR over h∆, Σi is the decision problem of checking entailment of formulas over object albabet Σ. Note that for such a reasoning task the reference context, w.r.t which the object formula is considered, need to be explicated. Reasoning in the CKR

Definition 11 (d-entailment and d-satisfiability). Given a CKR K over h∆, Σi, with d ∈ D∆ , any DL formula φ, we say that φ is d-entailed by K, in symbols K |= d : φ, iff for every CKR model Id = h∆d , ·Id i of K, Id |=DL φ, where |=DL is classical satisfaction relation between a DL model and a DL formula. 57

In [86], the authors give a set of inference rules in the spirit of natural deduction calculus that is sound and complete w.r.t to the semantics described above. Each such rule is of the form: d: A v B D: d ≺ e e : Ad v Bd Intuitively the rule means that when A v B holds in context d and when d ≺ e according to the meta knowledge, then Ad v Bd should hold in context e. Also, for a set of rules based on Tableaux calculus, we refer the reader to the work of Bozzato et al. An RDF formulation of the CKR, in which local semantics for each context is defined using an RDF(S) interpretation, is given in Serafini et al. [75]. The authors provide sound and complete set of inference rules, using which a finite deductive closure can be computed in polynomial time. Recently, an extension of the CKR with the capability of defeasible reasoning was provided by Bozzato et al. [16].

3.4

Thesis Advancements

In this section, we give an account on some of the novelties in this thesis w.r.t. to the existing contextual frameworks described in the previous sections – DDL, E-connections, and CKR. We describe the main merits below, and classify them into the following headings. 3.4.1

Conjunctive Bridge Rules

As we noticed, there is a natural requirement to specify rules such as: v

0 c1 : X1 u . . . u cn : Xn −→ c01 : X10 u . . . u c0m : Xm

(3.1)

where Xi , 1 ≤ i ≤ n, and Xj0 , 1 ≤ j ≤ m are concepts (resp. role) symbols. Such a bridge rule establishes inclusion relation from intersection of n concepts (resp. roles) Xj in contexts cj to an intersection of m concepts (resp. roles) 58

Xk0 in contexts c0k . Note that the bridge rules in the framework of quad-systems, which we introduce in this thesis, are adequate for such cases. Though DDL employs bridge rules and individual correspondences to establish relations between objects in the domain of two contexts, the relation established between objects is always a binary relation, given by the domain relation rij that maps objects in the context ci to objects in the context cj . Also a DDL bridge rule establish an inclusion relation from a set of objects in context ci to a set of objects in context cj , and a DDL individual correspondence maps an object in context ci to a set of objects in context cj , via the domain relation rij . Hence, a formula/rule that serves the purpose of the rule ( 3.1), cannot be specified in DDL. Also in an E-connection of n contexts, since a link relation E ⊆ ∆1 × . . . × ∆n relate the objects in the domain of the contexts, and the n − 1-ary functional symbol hEii , given n − 1 concept symbols Ck of contexts ck , k 6= i, defines a concept in context ci , given by hEii (C1 , . . . , Cn ), one cannot mix symbols from languages of different context in a single (subsumption) formula, as in c1 : C1 v c2 : C2 . Hence, a formula/rule that serves the purpose of the rule ( 3.1), cannot be specified. Different from DDL and E-connections, the CKR framework does not have bridge rules. Hence, a formula/rule that serves the purpose of the rule ( 3.1), cannot be specified. Despite this, the reader should note that the conjunctive bridge rules of the form ( 3.1), where every Xi , Xj0 are not roles, were supported in Bao et al. [9] and a few early knowledge based systems such as Tropes [38] that allowed concepts in multiple source contexts to be related to a concept in a destination context. 3.4.2

Heterogeneous Bridge Rules

Also one might want to establish that the nodes of a cross contextual complex role path are members of a certain concept in a context. For instance, if one 59

needs a rule like the following: ∀x1 ∀x2 ∀x3 ci : R(x1 , x2 ) ∧ cj : R(x2 , x3 ) ∧ ck : R(x1 , x3 ) → cl : C(x2 ), or if one needs to form products of two concepts C1 , C2 in two contexts c1 , c2 , respectively, as a role R in another context c3 . Such a constraint can naturally be established by a rule of the following form: ∀x1 ∀x2 c1 : C1 (x1 ) ∧ c2 : C2 (x2 ) → c3 : R(x1 , x2 ) The framework of quad-system, introduced in the thesis, supports the specification of such bridge rules that simultaneously allow the occurrence of concepts and roles. Note that the above kind of bridge rules that allow simultaneous occurrence of concepts and roles cannot be done in DDL, as a bridge rule in DDL can only be an inclusion mapping between a pair of concepts. Also in the framework of E-connections of n contexts, a link operator only allows to create a concept term Ci in a context ci , using concepts terms Ck , in other n − 1 contexts ck , k 6= i. There is no mechanism that allows the specification of a formula that serves the purpose of aforementioned rules. Also, same is the case for CKR, which does not have the mechanism of bridge rules. 3.4.3

Value Inventing Bridge Rules

Another desirable feature for a contextual framework is the support for bridge rules that enable value/blank node invention. For instance, one would want to state assertions such as: c1 : C(a) → ∃y c2 : C(y)

(3.2)

which intuitively states that if an object denoted by a is of type C in context c1 , then there exist an anonymous object o that is also of type C in context c2 . The framework of quad-system, supports the specification of such bridge rules with 60

existential quantifiers in the head, and hence a rule that serves the purpose of (3.2) can be specified. Note that DDL bridge rules does not support existential quantification, and hence does not support value invention. Also, same is the case for E-connections. Note that Package-based Description Logics (PDL) by Bao et al. [9] and the CKR allows to import a concept C from a context d to a context c. After such an import to context c, one could use the qualifying syntax Cd in CKR (resp. d : C in PDL) to refer to a concept C in context d from context c. Subsequently, Cd can be used like any other ordinary concept symbol in context c in order to refer to the extension of Cd in the domain of context c. This allows one to state axioms such as C(a) → ∃y Cd (y) in context c, and serves in a limited way the function of value-inventing bridge rules. 3.4.4

Contextual Conjunctive Queries

Suppose that a knowledge base K, contains two contexts c1 and c2 , and the following axioms in their A-boxes. c1 : C1 (a), c2 : C2 (a) where C1 , C2 are concepts, and a is an individual. Suppose that Q(), a query that spans multiple contexts, is given as: ∃y c1 : C1 (y), c2 : C2 (y) One would expect K to entail Q(). Such queries that span multiple contexts, called contextual conjunctive queries, are described, in detail, later in this thesis. Note that according to the semantics of the framework of quad-systems, Q() is entailed by K. Same is the case for the CKR semantics. This is because in both these frameworks, a constant a represents the same object irrespective of the context in which it occurs. This property is popularly coined by the KR and database community as the rigid constant property. A shortcoming, in this respect, of both DDL and E-connections is that, according to their semantics 61

Q() is not entailed by K. This is because, in both these frameworks, a in c1 and a in c2 are interpreted to arbitrary objects o1 and o2 in c1 and c2 , respectively. Though individual correspondences (resp. link assertions) can be used to map o1 to o2 and vice versa, using domain mapping relations r12 and r21 in DDL (link o1 and o2 in E-connections), there is no way by which one can establish the fact that o1 and o2 are the same objects. This is undesirable in a typical SW scenario, where a is a URI and one would want a to represent the same object, irrespective of the context it appears.

62

Chapter 4 Query Answering over Quad-Systems and its Undecidability In this chapter, we formally introduce notions such as quads, quad-systems, contextual queries, and the problem of query answering over quad-systems. We then establish the undecidability result of query answering.

4.1

Quad-Systems

For any sets A and B, A → B denotes the set of all functions from set A to set B. A quad is a tuple of the form c : (s, p, o), where (s, p, o) is a triple and c is a URI1 , called the context identifier that denotes the context of the RDF triple. A quad-graph is defined as a set of quads. For any quad-graph Q and any context identifier c, we denote by graphQ (c) the set {(s, p, o)|c : (s, p, o) ∈ Q}. We denote by QC the quad-graph whose set of context identifiers is C. The set of constants occurring in QC , given as C(QC ) = {c, s, p, o | c : (s, p, o) ∈ QC }. The set of URIs in QC is given by U(QC ) = C(QC ) ∩ U. The set of blank nodes B(QC ) and the set of literals L(QC ) are similarly defined. An expression of the form c : (s, p, o), where (s, p, o) is a triple pattern, c a context identifier, 1

Although, in general a context identifier can be a constant, for the ease of notation, we restrict them to be a

URI

63

is called a quad pattern. A quad pattern q, whose variables are elements of the vector ~x or elements of the vector ~y is written as q(~x, ~y ), and Q(~x, ~y ) denotes a set of quad-patterns, whose variables are from ~x or ~y , and Q(~a, ~y ) is written for Q(~x, ~y )[~x/~a]. For the sake of interoperating knowledge in different contexts, bridge rules need to be provided: Bridge rules (BRs)

Formally, a BR is of the form: ∀~x∀~z [c1 : t1 (~x, ~z) ∧ ... ∧ cn : tn (~x, ~z) → ∃~y c01 : t01 (~x, ~y ) ∧ ... ∧ c0m : t0m (~x, ~y )]

(4.1)

where c1 , ..., cn , c01 , ..., c0m are context identifiers, ~x, ~y , ~z are vectors of variables s.t. {~x}, {~y }, and {~z} are pairwise disjoint. t1 (~x, ~z), ..., tn (~x, ~z) are triple patterns which do not contain blank-nodes, and whose set of variables are from ~x or ~z. t01 (~x, ~y ), ..., t0m (~x, ~y ) are triple patterns, whose set of variables are from ~x or ~y , and also does not contain blank-nodes. For any BR r of the form (4.1), body(r) is the set of quad patterns {c1 : t1 (~x, ~z),...,cn : tn (~x, ~z)}, and head(r) is the set of quad patterns {c01 : t01 (~x, ~y ), ... c0m : t0m (~x, ~y )}, and the frontier of r, fr (r) = {~x}. Occasionally, we also write the BR r above as body(r)(~x, ~z) → head(r)(~x, ~y ). The set of terms in a BR r is: CV (r) = {c, s, p, o | c : (s, p, o) ∈ body(r) ∪ head(r)} S The set of terms for a set of BRs R is CV (R) = r∈R CV (r). The URIs, blank nodes, literals, variables of a BR r (resp. set of BRs R) are similarly defined, and are denoted as U(r), B(r), L(r), V(r) (resp. U(R), B(R), L(R), V(R)), respectively. Definition 1 (Quad-System). A quad-system QSC is defined as a pair hQC , Ri, where QC is a quad-graph, whose set of context identifiers is C, and R is a set of BRs. 64

For any quad-system, QSC = hQC , Ri, the set of constants in QSC is given by C(QSC ) = C(QC ) ∪ C(R). The sets U(QSC ), B(QSC ), L(QSC ), and V(QSC ) are similarly defined for any quad-system QSC . For any quad-graph QC (BR r), its symbol size kQC k (krk) is the number of symbols required to print QC (r). Hence, kQC k ≈ 4 ∗ |QC |, where |QC | denotes the cardinality of the set QC . Note that |QC | equals the number of quads in QC . For a BR r, krk ≈ 4 ∗ k, where k is the number of quad-patterns in r. For a set of BRs R, kRk is given as Σr∈R krk. For any quad-system QSC = hQC , Ri, its size kQSC k = kQC k + kRk. In order to provide a semantics for enabling reasoning over a quadsystem, we need to use a local semantics for each context to interpret the knowlSemantics

edge pertaining to it. Since one of the goals of this thesis is to derive a decision procedure for query answering over quad-systems based on forward chaining, we consider the following desiderata for the choice of the local semantics and its deductive machinery: • there exists a set LIR of inference rules and an operation lclosure() that computes the deductive closure of a graph w.r.t to the local semantics using the inference rules in LIR, • each inference rule in LIR is range restricted, i.e. non value-generating, • given a finite graph as input, the lclosure() operation, terminates with a finite graph as output in polynomial time whose size is polynomial w.r.t. to the input set. Some of the alternatives for the local semantics satisfying the above mentioned criterion are Simple, RDF, RDFS [50], OWL-Horst [90] etc. Assuming that a local semantics has been fixed, for any context c, we denote by I c = h∆c , ·c i an interpretation structure for the local semantics, where ∆c is the interpretation domain, ·c the corresponding interpretation function. Also |=local denotes the 65

local satisfaction relation between a local interpretation structure and a graph. Given a quad graph QC , a distributed interpretation structure is an indexed set I C = {I c }c∈C , where I c is a local interpretation structure, for each c ∈ C. We define the satisfaction relation |= between a distributed interpretation structure I C and a quad-system QSC as: Definition 2 (Model of a Quad-System). A distributed interpretation structure I C = {I c }c∈C satisfies a quad-system QSC = hQC , Ri, in symbols I C |= QSC , iff all the following conditions are satisfied: 1. I c |=local graphQC (c), for each c ∈ C; 2. aci = acj , for any a ∈ C, ci , cj ∈ C; 3. for each BR r ∈ R of the form (4.1) and for each σ ∈ V → ∆C , where ∆C S = c∈C ∆c , if I c1 |=local t1 (~x, ~z)[σ], ..., I cn |=local tn (~x, ~z)[σ], then there exists function σ 0 ⊇ σ, s.t. 0

0

I c1 |=local t01 (~x, ~y )[σ 0 ], ..., I cm |=local t0m (~x, ~y )[σ 0 ]. Condition 1 in the above definition ensures that for any model I C of a quadgraph, each I c ∈ I C is a local model of the set of triples in context c. Condition 2 ensures that any constant c is rigid, i.e. represents the same resource across a quad-graph, irrespective of the context in which it occurs. Condition 3 ensures that any model of a quad-system satisfies each BR in it. Any I C s.t. I C |= QSC is said to be a model of QSC . A quad-system QSC is said to be consistent if there exists a model I C , s.t. I C |= QSC , and otherwise said to be inconsistent. For any quad-system QSC = hQC , Ri, it can be the case that graphQC (c) is locally consistent, i.e. there exists an I c s.t. I c |=local graphQC (c), for each c ∈ C, whereas QSC is not consistent. This is because the set of BRs R adds more knowledge to the quad-system, and restricts the set of models that satisfy the quad-system. 66

Definition 3 (Quad-system entailment). (a) A quad-system QSC entails a quad c : (s, p, o), in symbols QSC |= c : (s, p, o), iff for any distributed interpretation structure I C , if I C |= QSC then I C |= h{c : (s, p, o)}, ∅i. (b) A quad-system QSC entails a quad-graph Q0C 0 , in symbols QSC |= Q0C 0 iff QSC |= c : (s, p, o) for every c : (s, p, o) ∈ Q0C 0 . (c) A quad-system QSC entails a BR r iff for any I C , if I C |= QSC then I C |= h∅, {r}i. (d) For a set of BRs R, QSC |= R iff QSC |= r, for every r ∈ R. (e) Finally, a quad-system QSC entails another quad-system QSC0 0 = hQ0C 0 , R0 i, in symbols QSC |= QSC0 0 iff QSC |= Q0C 0 and QSC |= R0 . We call the DPs corresponding to the entailment problems (EPs) in (a), (b), (c), (d), and (e) as quad EP, quad-graph EP, BR EP, BRs EP, and quad-system EP, respectively.

4.2

Query Answering on Quad-Systems

In the realm of quad-systems, the classical conjunctive queries or select-projectjoin queries are slightly extended to what we call Contextualized Conjunctive Queries (CCQs). A CCQ CQ(~x) is an expression of the form: ∃~y q1 (~x, ~y ) ∧ ... ∧ qp (~x, ~y )

(4.2)

where qi , for i = 1, ..., p are quad patterns over vectors of free variables ~x and quantified variables ~y . A CCQ is called a boolean CCQ if it does not have any free variables. With some abuse, we sometimes discard the logical symbols in a CCQ and consider it as a set of quad-patterns. For any CCQ CQ(~x) and a vector ~a of constants s.t. k~xk = k~ak, CQ(~a) is boolean. A vector ~a is an answer for a CCQ CQ(~x) w.r.t. structure I C , in symbols I C |= CQ(~a), iff there exists S assignment µ : {~y } → B s.t. I C |= i=1,...,p qi (~a, ~y )[µ]. A vector ~a is a certain answer for a CCQ CQ(~x) over a quad-system QSC , iff I C |= CQ(~a), for every model I C of QSC . The problem of entailment of CCQs over quad-systems is defined as follows: 67

Figure 4.1: A CCQ over quad-system

Definition 4 (CCQ EP). Given a quad-system QSC , a CCQ CQ(~x), and a vector ~a, the decision problem of determining whether QSC |= CQ(~a) is called the CCQ EP. It can be noted that the other DPs over quad-systems, namely Quad/Quad-graph EP, BR(s) EP, Quad-system EP, are reducible to the CCQ EP (See Property 7 of Chapter 8). Hence, in this dissertation, we primarily focus on the CCQ EP. c1

c2

a b

d f

e

Figure 4.2: A sample CCQ: Intersecting objects in different contexts

Example 5. If c1 and c2 are two different contexts about geometric shapes, then the query: c1 : (u, edge, v) ∧ c1 : (u, edge, w) ∧ c1 : (v, edge, w) ∧ c2 : (w, edge, x) ∧ c2 : (w, edge, y) ∧ c2 : (x, edge, y)

intuitively returns three nodes each from c1 and c2 that participate in a triangle such that the third node of the two triangles coincides. The snapshot in Fig. 4.2 gives a scenario in which nodes a, b (bound to variables u, v) participate in a triangle in c1 and nodes d, e (bound to variables x, y) with the common third node of the triangles f (bound to the variable w). Note that such queries are expressible thanks to the condition 2 of definition 2 that gives same denotation 68

to a constant (in this case f ), known as rigid constant property in KR, if it occurs in two different contexts of a quad-system. In the realm of quad-systems, we extend (see forthcoming chapters) the standard chase to a distributed chase, abbreviated dChase. 4.2.1

Undecidability of Query Answering on Quad-Systems

The following proposition reveals that for the class of quad-systems whose BRs are of the form (4.1), which we call unrestricted quad-systems, the dChase can be infinite. Proposition 6. There exists unrestricted quad-systems whose dChase is infinite. Proof. Consider an example of a quad-system QSc = hQc , ri, where Qc = {c : (a, rdf:type, C)}, and the BR r = c : (x, rdf:type, C) → ∃y c : (x, P , y), c : (y, rdf:type, C). The chase computation starts with chase0 (QSc ) = {c : (a, rdf:type, C)}, now the rule r is applicable, and its application leads to dChase1 (QSc ) = {c : (a, rdf:type, C), c : (a, P, : b1 ), c : ( : b1 , rdf:type, C)}, where : b1 is a fresh Skolem blank node. It can be noted that r is yet again applicable on dChase1 (QSc ), for c : ( : b1 , rdf:type, C), which leads to the generation of another Skolem blank node, and so on. Hence, dChase(QSc ) does not have a finite fix-point, and dChase(QSc ) is infinite. A class C of quad-systems is called a finite extension class (FEC), iff for every member QSC ∈ C, dChase(QSC ) is a finite set. Therefore, the class of unrestricted quad-systems is not a FEC. This raises the question if there are other approaches that can be used, for instance, a similar problem of non-finite chase is manifested in description logics (DLs) with value creation, due to the presence of existential quantifiers, whereas the approaches like the ones in Calvanese et al [27], Glimm et al. [43], and Lutz et al [68] provide algorithms for 69

CQ entailment based on query rewriting. The theorem 7 below establishes the fact that the CCQ EP for unrestricted quad-systems is undecidable. Theorem 7. The CCQ entailment problem over unrestricted quad-systems is undecidable. Proof. (sketch) We show that the well known undecidable problem of nonemptiness of intersection of languages generated by two context-free grammars (CFGs) is reducible to the CCQ entailment problem. Given two CFGs, G1 = hV1 , T, S1 , P1 i and G2 = hV2 , T, S2 , P2 i, where V1 , V2 , with V1 ∩ V2 = ∅, are the set of variables, T such that T ∩ (V1 ∪ V2 ) = ∅ is the set of terminals. S1 ∈ V1 is the start symbol of G1 , and P1 are the set of PRs of the form v → w, ~ where v ∈ V , w ~ is a sequence of the form w1 ...wn , where wi ∈ V1 ∪ T . Similarly s2 , P2 is defined. Deciding whether the language generated by the grammars L(G1 ) and L(G2 ) have non-empty intersection is known to be undecidable [48]. Given two CFGs G1 = hV1 , T, S1 , P1 i and G2 = hV2 , T, S2 , P2 i, we encode grammars G1 , G2 into a quad-system QSc = hQc , Ri, with only a single context identifier c. Each PR r = v → w ~ ∈ P1 ∪ P2 , with w ~ = w1 w2 w3 ..wn , is encoded as a BR of the form: c : (x1 , w1 , x2 ), c : (x2 , w2 , x3 ), ..., c : (xn , wn , xn+1 ) → c : (x1 , v, xn+1 ), where x1 , .., xn+1 are variables. For each terminal symbol ti ∈ T , R contains a BR of the form: c : (x, rdf:type, C) → ∃y c : (x, ti , y), c : (y, rdf:type, C) and Qc is the singleton: { c : (a, rdf:type, C)}. It can be proven that: QSc |= ∃y c : (a, S1 , y) ∧ c : (a, S2 , y) ⇔ L(G1 ) ∩ L(G2 ) 6= ∅ We refer the reader to Appendix for the complete proof. Having shown the undecidability results of query answering of unrestricted quad-systems, the rest of the thesis focuses on defining subclasses of unrestricted quad-systems for which query answering is decidable, and establishing 70

their relationships with similar classes in the realm of ∀∃ rules. While defining decidable classes for quad-systems, one mainly has two fundamentally distinct options: (i) is to define notions that solely use the structure/properties of the BR part, ignoring the quad-graph part, or (ii) to define notions that take into account both the BR and quad-graph part. The decidability notions which we define in Chapter 6, namely safety, msafety, and csafety belong to type (ii), as these techniques takes into account the property of the dChase of a quad-system, which is determined by both the quad-graph and BRs of the quad-system. Whereas the ones which we define in chapters 5, 7, namely context acyclic, RR, and restricted RR quad-systems fall into type (i), as the properties of BRs alone are used. With an analogy between a set of BRs and a set of ∀∃ rules, and between a quad-graph and a set of ∀∃ instances, the reader should note that such distinctions can also been made for the decidability notions realm of ∀∃ rule sets. Techniques such as Weak acyclicity [39], Joint acyclicity [60], and Acyclic graph of rule dependencies [6] belong to type (ii), as these notions ignore the instance part. Whereas techniques such as model faithful acyclicity [32] and model summarizing acyclicity [32] are of type (i) as both the rules and instance part is considered.

71

72

Chapter 5 Context Acyclic Quad-Systems: Decidability via Acyclicity In the previous chapter, we saw that dChase of unrestricted quad-systems is infinite and query answering is undecidable, in general. In this chapter, we define a class of quad-systems for which query entailment is decidable. The class is also recognizable [7, 66], i.e. there exists an algorithm that decides for a given a quad-system whether the quad-system is a member of the class or not. The class has the property that dChase is finite for any member of the class, and hence algorithms based on forward chaining, for deciding query entailment, can straightforwardly be implemented. It should be noted that the technique we propose is reminiscent of the Weak acyclicity [39, 34] technique used in the realm of Datalog+-. Before we give the description of our class, we first adapt and reformulate the Skolem variant of the chase given in Marnette [69] and Cuenca Grau et al [32] to the quad-system settings. We call the reformulated Skolem version as the Skolem dChase (abbreviated SdChase). The reader should note that the sources of this chapter has been taken from conference papers [57] and [58]. For any BR r of the form (4.1), the skolemization sk(r) is the result of replacing each yi ∈ {~y } with a globally unique Skolem function fir , s.t. fir : Ck~xk → Bsk . Intuitively, for every distinct vector ~a of constants, with k~ak = k~xk, 73

r fir (~a) is a fresh blank node, whose node id is a hash of ~a. Let f~r = hf1r , ..., fk~ yk i

be a vector of distinct Skolem functions; for any BR r the form (4.1), with slight abuse we write its skolemization sk(r) as follows: c1 : t1 (~x, ~z), ..., cn : tn (~x, ~z) → c0 : t0 (~x, f~r ), ..., c0 : t0 (~x, f~r ) (5.1) 1

1

m

m

Moreover, a skolemized BR r of the form (5.1) can be replaced by the following equivalent set of formulas, whose symbol size is worst case quadratic w.r.t krk: {c1 : t1 (~x, ~z), ..., cn : tn (~x, ~z) → c01 : t01 (~x, f~r ),

(5.2)

..., c1 : t1 (~x, ~z), ..., cn : tn (~x, ~z) → c0m : t0m (~x, f~r )} Note that each BR in the above set has exactly one quad pattern with optional function symbols in the head. Also note that a BR without function symbols can be replaced with a set of BRs with single quad-pattern heads. Hence, w.l.o.g, we assume that any BR in a skolemized set sk(R) of BRs is of the form (5.2). For any quad-graph QC and a skolemized BR r of the form (5.2), the application of r on QC , denoted ( by r(QC ), is given as: ) 0 0 r ~ [ c1 : t1 (~x, f )[µ] | c1 : t1 (~x, ~z)[µ] ∈ QC , ..., cn : tn (~x, ~z)[µ] r(QC ) = ∈ QC µ∈V→C For any set of skolemized BRs R, the application of R on QC is given by: R(QC ) S = r∈R r(QC ). For any quad-graph QC , we define: [ lclosure(QC ) = {c : (s, p, o) |(s, p, o) ∈ lclosure(graphQC (c))} c∈C

For any quad-system QSC = hQC , Ri, generating BRs RF is the set of BRs in sk(R) with function symbols, and the non-generating BRs is the set RI = sk(R) \ RF . Let SdChase0 (QSC ) = lclosure(QC ); for i ∈ N, SdChasei+1 (QSC ) = lclosure(SdChasei (QSC ) ∪ RI (SdChasei (QSC ))), lclosure(SdChasei (QSC ) ∪ RF (SdChasei (QSC ))), 74

if RI (SdChasei (QSC )) 6⊆ dChasei (QSC ); otherwise;

The Skolem dChase of QSC , denoted SdChase(QSC ), is given as: [ SdChase(QSC ) = SdChasei (QSC ) i∈N

Intuitively, SdChasei (QSC ) can be thought of as the state of SdChase(QSC ) at the end of iteration i. It can be noted that, if there exists i s.t. SdChasei (QSC ) = SdChasei+1 (QSC ), then SdChase(QSC ) = SdChasei (QSC ). An iteration i, s.t. SdChasei (QSC ) is computed by the application of the set of (resp. non-)generating BRs RF (resp. RI ) on SdChasei−1 (QSC ) is called a (resp. non-)generating iteration. A model I C of a quad-system QSC is called universal [35], iff the following holds: I C is a model of QSC , and for any model I 0C there exists a homomorphism from I C to I 0C . Theorem 1. For any consistent quad-system QSC , the following holds: (i) SdChase(QSC ) is a universal model of QSC .1 , and (ii) for any boolean CCQ CQ(), QSC |= CQ() iff there exists a map µ : V(CQ) → C such that {CQ()}[µ] ⊆ SdChase(QSC ). An analog of the above theorem for DLs and Databases is stated and proved in [27]. Since the proof in [27] can easily be adapted to our case, we refer the reader to [27] for the proof. We call the sequence SdChase0 (QSC ), SdChase1 (QSC ), ..., the Skolem dChase sequence of QSC . The following lemma shows that in a dChase sequence of a quad-system, the result of a single generating iteration and a subsequent number of non-generating iterations causes only an exponential blow up in size. Lemma 2. For a quad-system QSC = hQC , Ri, the following holds: (i) if i ∈ N is a generating iteration, then kSdChasei (QSC )k = O(kSdChasei−1 (QSC )kkRk ), (ii) suppose i ∈ N is a generating iteration, and for any j ≥ 1, 1

Though SdChase(QSC ) is not an interpretation in a strict model theoretic sense, one can easily create the corresponding interpretation ISdChase(QSC ) = {I c = h∆c , .c i}c∈C , s.t. for every c ∈ C, ∆c is equal to set of constants in graphSdChase(QSC ) (c), and .c is s.t (s, p, o) ∈ graphSdChase(QSC ) (c) iff (sc , oc ) ∈ pc .

75

i + 1, ..., i + j are non-generating iterations, then kSdChasei+j (QSC )k = O(kSdChasei−1 (QSC )kkRk ), (iii) for any iteration k, SdChasek (QSC ) can be computed in time O(kSdChasek−1 (QSC )kkRk ). Proof. (Sketch) (i) R can be applied on SdChasei−1 (QSC ) by grounding R to the set of constants in SdChasei−1 (QSC ), the number of such groundings is of the order O(kSdChasei−1 (QSC )kkRk ), kR(SdChasei−1 (QSC ))k = O(kRk ∗ kSdChasei−1 (QSC )kkRk ). Since lclosure only increases the size polynomially, kSdChasei (QSC )k = O( kSdChasei−1 ( QSC )kkRk ). (ii) From (i) we know that kR(SdChasei−1 (QSC ))k = O(kSdChasei−1 (QSC )kkRk ). Since, no new constant is introduced in any subsequent non-generating iterations, and since any quad contains only four constants, the set of constants in any subsequent dChase iteration is O(4 ∗ kSdChasei−1 (QSC )kkRk ). Since only these many constants can appear in positions c, s, p, o of any quad generated in the subsequent iterations, the size of SdChasei+j (QSC ) can only increase polynomially, which means that kSdChasei+j (QSC )k = O(kSdChasei−1 (QSC )kkRk ). (iii) Since any dChase iteration k involves the following two operations: (a) lclosure(), and (b) computing R(SdChasek−1 (QSC )). (a) can be done in PTIME w.r.t to its input. (b) can be done in the following manner: ground R to the set of constants in SdChasek−1 (QSC ); then for each grounding g, if body(g) ⊆ SdChasek−1 (QSC ), then add head(g) to R(SdChasek−1 (QSC )). Since, the number of such groundings is of the order O(kSdChasek−1 (QSC )kkRk ), and checking if each grounding is contained in SdChasek−1 (QSC ), can be done in time polynomial in kSdChasek−1 (QSC )k, the time taken for (b) is O(kSdChasek−1 (QSC )kkRk ). Consequently, any iteration k can be done in time O(kSdChasek−1 (QSC )kkRk ).

76

c1 : t1 (~x, ~z), c2 : t2 (~x, ~z) → ∃~y c3 : t3 (~x, ~y ), c4 : t4 (~x, ~y )

c1

c3 c2

c4

Figure 5.1: Bridge rule: A mechanism for specifying propagation of knowledge between contexts.

5.1

Context Acyclic Quad-Systems: A Decidable Class

Before we actually introduce our subclass of unrestricted quad-systems, we introduce some necessary notations. Consider a BR r of the form: c1 : t1 (~x, ~z), c2 : t2 (~x, ~z) → ∃~y c3 : t3 (~x, ~y ), c4 : t4 (~x, ~y ). Since such a rule triggers propagation of knowledge in a quad-system, specifically triples from the source contexts c1 , c2 to the target contexts c3 , c4 in a quad-system. As shown in Fig. 5.1, we can view a BR as a propagation rule across distinct compartments of knowledge, divided as contexts. For any BR of the form (4.1), each context in the set {c01 , ..., c0m } is said to depend on the set of contexts {c1 , ..., cn }. In a quadsystem QSC = hQC , Ri, for any r ∈ R, of the form (4.1), any context whose identifier is in the set {c | c : (s, p, o) ∈ head(r), s or p or o is an existentially quantified variable}, is called a triple generating context (TGC). One can analyze the set of BRs in a quad-system QSC using a context dependency graph, which is a directed graph, whose nodes are context identifiers in C, s.t. the nodes corresponding to TGCs are marked with a ∗, and whose edges are constructed as follows: for each BR r of the form (4.1), there exists an edge from each ci to c0j 6= ci , for each i = 1,. . . , n, j = 1,. . . , m, and for any c ∈ {c1 , . . . , cn } ∩ {c01 , . . . , c0m } there is an edge from c to c iff there exists c : (s, p, o) ∈ head(r), and s or p or o is an existentially quantified variable. A quad-system is said to be context acyclic, iff its context dependency graph does not contain cycles involving TGCs. 77

Example 3. Consider a quad-system, whose set of BRs R are: c1 : (x1 , x2 , U1 ) → ∃y1 c2 : (x1 , x2 , y1 ), c3 : (x2 , rdf:type, rdf:Property)

(5.3)

c2 : (x1 , x2 , z1 ) → c1 : (x1 , x2 , U1 )

(5.4)

c3 : (x1 , x2 , x3 ) → c1 : (x1 , x2 , x3 ) where U1 is a URI. The dependency graph of the quad-system is shown in Fig. 5.2. Note that the node corresponding to the triple generating context c2 is marked with a ‘∗’ symbol. Since the cycle (c1 , c2 , c1 ) in the quad-system contains c2 which is a TGC, the quad-system is not context acyclic. In a context acyclic quad-system QSC , since there exists no cyclic path through any TGC node in the context dependency graph, there exists a set of TGCs C 0 ⊆ C s.t. for any c ∈ C 0 , there exists no incoming path2 from a TGC to c. We call such TGCs, level-1 TGCs. In other words, a TGC c is a level-1 TGC, if for any c0 ∈ C, there exists an incoming path from c0 to c, implies c0 is not a TGC. For l ≥ 1, a level-l+1 TGC c is a TGC that has an incoming path from a level-l TGC, and for any incoming path from a level-l0 TGC to c, is s.t. l0 ≤ l. Extending the notion of level also to the non-TGCs, we say that any non-TGC that does not have any incoming paths from a TGC is at level-0; we say that any non-TGC c ∈ C c1 is at level-l, if there exists an incoming path c2 ∗ Figure 5.2: graph

from a level-l TGC to c, and for any inc3 coming path from a level-l0 TGC to c, is s.t. l0 ≤ l. Hence, the set of contexts in a context Context dependency acyclic quad-system can be partitioned using the above notion of levels.

Definition 4. For a quad-system QSC , a context c ∈ C is said to be saturated in an iteration i, iff for any quad of the form c : (s, p, o), c : (s, p, o) ∈ 2

assume that paths have at least one edge

78

SdChase(QSC ) implies c : (s, p, o) ∈ SdChasei (QSC ). Intuitively, context c is saturated in the SdChase iteration i, if no new quad of the form c : (s, p, o) will be generated in any SdChasek (QSC ), for any k > i.

5.2

Context Acyclic Quad-Systems: Computational Properties

In this section, we describe some essential computational properties of quadsystem class we defined in the previous section. The following lemma gives the relation between the saturation of a context and the required number of SdChase iterations, for a context acyclic quad-system. Lemma 5. For any context acyclic quad-system, the following holds: (i) any level-0 context is saturated before the first generating iteration, (ii) any level-1 TGC is saturated after the first generating iteration, (iii) any level-k context is saturated before the k + 1th generating iteration. Proof. Let QSC = hQC , Ri be the quad-system, whose first generating iteration is i. (i) for any level-0 context c, any BR r ∈ R, and any quad-pattern of the form c : (s, p, o), if c : (s, p, o) ∈ head(r), then for any c0 s.t. c0 : (s0 , p0 , o0 ) occurs in body(r) implies that c0 is a level-0 context and r is a non-generating BR. Also, since c0 is a level-0 context, the same applies to c0 . Hence, it turns out that only non-generating BRs can bring triples to any level-0 context. Since at the end of iteration i − 1, SdChasei−1 (QSC ) is closed w.r.t. the set of non-generating BRs (otherwise, by construction of SdChase, i would not be a generating iteration). This implies that c is saturated before the first generating iteration i. (ii) for any level-1 TGC c, any BR r ∈ R, and any quad-pattern c : (s, p, o), if c : (s, p, o) ∈ head(r), then for any c0 s.t. c0 : (s0 , p0 , o0 ) occurs in body(r) implies that c0 is a level-0 context (Otherwise level of c would be greater than 79

1). This means that only contexts from which triples get propagated to c are level-0 contexts. From (i) we know that all the level-0 contexts are saturated before ith iteration, and since during the ith iteration RF is applied followed by the lclosure() operation (RI need not be applied, since SdChasei−1 (QSC ) is closed w.r.t. RI ), c is saturated after iteration i, the 1st generating iteration. (iii) can be obtained from generalization of (i) and (ii), and from the fact that any level-k context can only have incoming paths from contexts whose levels are less than or equal to k. ∗ c1

∗ c1

c4

c4

..

.. ..

..

c2

c3 ∗

..

c2

c3 ∗

..

(a)

(b)

Figure 5.3: Saturation of contexts

Example 6. Consider the dependency graph in Fig. 5.3a, where .. indicates part of the graph that is not under the scope of our discussion. The TGCs nodes c1 and c3 are marked with a ∗. It can be seen that both c2 and c4 are level-0 contexts, since they do not have any incoming paths from TGCs. Since the only incoming paths to context c1 are from c2 and c4 , which are not TGCs, c1 is a level-1 TGC. Context c3 is a level-2 TGC, since it has an incoming path from the level-1 TGC c1 , and has no incoming path from a TGC whose level is greater than 1. Since the level-0 contexts only have incoming paths from level-0 contexts and only appear on the head part of non-generating BRs, before first generating iteration, 80

all the level-0 TGCs becomes saturated, as the set of non-generating BRs RI has been exhaustively applied. This situation is reflected in Fig. 5.3b, where the saturated nodes are shaded with gray. Note that after the first and second generating iterations c1 and c3 also become saturated, respectively. The following lemma shows that for context acyclic quad-systems, there exists a finite bound on the size and computation time of its SdChase. Lemma 7. For any context acyclic quad-system QSC = hQC , Ri, the following holds: (i) the number of SdChase iterations is finite, (ii) size of the SdChase kQS k kSdChase(QSC )k = O(22 C ), (iii) computing SdChase(QSC ) is in 2EXPTIME, (iv) if R and the set of schema triples in QC is fixed, then kSdChase(QSC )k is a polynomial in kQSC k, and computing SdChase(QSC ) is in PTIME. Proof. (i) Since QSC is context-acyclic, all the contexts can be partitioned according to their levels. Also, the number of levels k is s.t. k ≤ |C|. Hence, applying Lemma 2, before the k + 1th generating iteration all the contexts becomes saturated, and k +1th generating iteration do not produce any new quads, terminating the SdChase computation process. (ii) In the SdChase computation process, since by Lemma 2, any generating iteration and a sequence of non-generating iterations can only increase the SdChase size exponentially in kRk, the size of the SdChase before k + 1 k th generating iteration is O(kdChase0 (QSC )kkRk ), which can be written as k

O(kQSC kkRk ) (†). As seen in (i), there can only be |C| generating iterations, and a sequence of non-generating iterations. Hence, applying k = |C| to (†), and taking into account the fact that |C| ≤ kQSC k, the size of the SdChase kQS k kSdChase(QSC )k = O(22 C ). (iii) Since in any SdChase iteration except the final one, at least one new quad kQS k

should be produced and the final SdChase can have at most O(22 C ) quads kQS k (by ii), the total number of iterations are bounded by O(22 C ) (†). Since from Lemma 2, we know that for any iteration i, computing SdChasei (QSC ) is 81

of the order O(kSdChasei−1 (QSC )kkRk ). Since, kSdChasei−1 (QSC )k can at kQS k kQS k most be O(22 C ), computing SdChasei ( QSC ) is of the order O(2kRk∗2 C ). kQS k

Also since kRk ≤ kQSC k, any iteration requires O(22 C ) time (‡). From (†) and (‡), we can conclude that the time required for computing SdChase is in 2EXPTIME. (iv) In (ii) we saw that the size of the SdChase before k + 1th generating iterk ation is given by O(kQSC kkRk ) (). Since by hypothesis kRk is a constant and also the size of the dependency graph and the levels in it. Hence, the expression kRkk in () amounts to a constant z. Hence, kSdChase(QSC )k = O(kQSC kz ). Hence, the size of SdChase(QSC ) is a polynomial in kQSC k. Also, since in any SdChase iteration except the final one, at least one quad should be produced and the final SdChase can have at most O(kQSC kz ) quads, the total number of iterations are bounded by O(kQSC kz ) (†). Also from Lemma 2, we know that any SdChase iteration i, computing SdChasei (QSC ) involves two steps: (a) computing R(SdChasei−1 (QSC )), and (b) computing lclosure(), which can be done in PTIME in the size of its input. Since computing R(SdChasei−1 (QSC )) is of the order O(kSdChasei−1 (QSC )kkRk ), where |R| is a constant and kSdChasei−1 (QSC )k is a polynomial is kQSC k, each iteration can be done in time polynomial in kQSC k (‡). From (†) and (‡), it can be concluded that SdChase can be computed in PTIME. Lemma 8. For any context acyclic quad-system, the following holds: (i) data complexity of CCQ entailment is in PTIME (ii) combined complexity of CCQ entailment is in 2EXPTIME. Proof. For a context acyclic quad-system QSC = hQC , Ri, since SdChase(QSC ) is finite, a boolean CCQ CQ() can naively be evaluated by grounding the set of constants in the chase to the variables in the CQ(), and then checking if any of these groundings are contained in SdChase(QSC ). The number of such groundings can at most be kSdChase(QSC )kkCQ()k (†). 82

(i) Since for data complexity, the size of the BRs kRk, the set of schema triples, and kCQ()k is fixed to constant. From Lemma 7 (iv), we know that under the above mentioned settings the SdChase can be computed in PTIME and is polynomial in the size of QSC . Since kCQ()k is fixed to a constant, and from (†), binding the set of constants in SdChase(QSC ) on CQ() still gives a number of bindings that is worst case polynomial in the size of QSC . Since membership of these bindings can checked in the polynomially sized SdChase in PTIME, the time required for CCQ evaluation is in PTIME. kQSC k

(ii) Since in this case kSdChase(QSC )k = O(22

) (‡), from (†) and (‡),

binding the set of constants in SdChase(QSC ) to variables in CQ() amounts kQS k to O(2kCQ()k∗2 C ) bindings. Since the size of SdChase is double exponential in kQSC k, checking the membership of each of these bindings can be done in 2EXPTIME. Hence, the combined complexity is in 2EXPTIME. Theorem 9. For any context acyclic quad-system, the following holds: (i) The data complexity of CCQ entailment is PTIME-complete, (ii) The combined complexity of CCQ entailment is 2EXPTIME-complete. Proof. (i) (Membership) See Lemma 8 for the membership in PTIME. (Hardness) Follows from the PTIME-hardness of data complexity of CCQ entailment for Range Restricted quad-systems (Theorem 3 of Chapter 7), which are contained in context acyclic quad-systems. (ii) (Membership) See Lemma 8. (Hardness) See the following heading. 2EXPTIME-Hardness of CCQ Entailment

In this subsection, we show that the combined complexity of the CCQ EP for context acyclic quad-systems is 2EXPTIME-hard. We show this by reduction of the word-problem of a 2EXPTIME deterministic turing machine (DTM) to the CCQ EP. A DTM M is a tuple M = hQ, Σ, ∆, q0 , qA i, where 83

• Q is a set of states, • Σ is a finite alphabet that includes the blank symbol , • ∆ : (Q × Σ) → (Q × Σ × {+1, −1}) is the transition function, • q0 ∈ Q is the initial state. • qA ∈ Q is the accepting state. W.l.o.g. we assume that there exists exactly one accepting state, which is also a halting state. A configuration is a word α ~ ∈ Σ∗ QΣ∗ . A configuration α ~ 2 is a successor of the configuration α ~ 1 , iff one of the following holds: 1. α ~1 = w ~ l qσσr w ~ r and α ~2 = w ~ l σ 0 q 0 σr w ~ r , if ∆(q, σ) = (q 0 , σ 0 , R), or 2. α ~1 = w ~ l qσ and α ~2 = w ~ l σ 0 q 0 , if ∆(q, σ) = (q 0 , σ 0 , R), or 3. α ~1 = w ~ l σl qσ w ~ r and α ~2 = w ~ l q 0 σl σ 0 w ~ r , if ∆(q, σ) = (q 0 , σ 0 , L). where q, q 0 ∈ Q, σ, σ 0 , σl , σr ∈ Σ, and w ~ l, w ~ r ∈ Σ∗ . Since number of configurations can at most be doubly exponential in the size of the input string, the number of tape cells traversed by the DTM tape head is also bounded double exponentially. A configuration ~c = w ~ l qw ~ r is an accepting configuration iff q = qA . A language L ⊆ Σ∗ is accepted by a 2EXPTIME bounded DTM M , kwk ~ iff for every w ~ ∈ L, M accepts w ~ in time O(22 ). Simulating DTMs using Context Acyclic Quad-Systems

Consider a DTM M = hQ, Σ, ∆, q0 , qA i, and a string w, ~ with kwk ~ = m. n

Suppose that M terminates in 22 time, where n = mk , k is a constant. In order to simulate M , we construct a quad-system QSCM = hQM C , Ri, where C = {c0 , c1 , ..., cn }, whose various elements represents the constructs of M . We follow the technique in works such as [23, 60] to iteratively generate a doubly 84

exponential number of objects that represent the cells of the tape of the DTM. Let QM C be initialized with the following quads: c0 : (k0 , rdf:type, R), c0 : (k1 , rdf:type, R), c0 : (k0 , rdf:type, min0 ), c0 : (k1 , rdf:type, max0 ), c0 : (k0 , succ0 , k1 ) Now for each pair of elements of type R in ci , a Skolem blank-node is generated in ci+1 , and hence follows the recurrence relation r(m+1) = [r(m)]2 , with seed n r(0) = 2, which after n iterations yields 22 . In this way, a doubly exponentially long chain of elements is created in cn using the following set of rules: ci : (x0 , rdf:type, R), ci : (x1 , rdf:type, R) → ∃y ci+1 : (x0 , x1 , y), ci+1 : (y, rdf:type, R) The combination of the minimal element with the minimal element (elements of type mini ) in ci create the minimal element in ci+1 , and similarly the combination of the maximal element with the maximal element (elements of type maxi ) in ci create the maximal element of ci+1 : ci+1 : (x0 , x0 , x1 ), ci : (x0 , rdf:type, mini ) → ci+1 : (x1 , rdf:type, mini+1 ) ci+1 : (x0 , x0 , x1 ), ci : (x0 , rdf:type, maxi ) → ci+1 : (x1 , rdf:type, maxi+1 ) The successor relation succi+1 is created in ci+1 using the following set of rules, using the well-known integer counting technique: ci : (x1 , succi , x2 ), ci+1 : (x0 , x1 , x3 ), ci+1 : (x0 , x2 , x4 ) → ci+1 : (x3 , succi+1 , x4 ) ci : (x1 , succi , x2 ), ci+1 : (x1 , x3 , x5 ), ci+1 : (x2 , x4 , x6 ), ci : (x3 , rdf:type, maxi ), ci : (x4 , rdf:type, mini ) → ci+1 : (x5 , succi+1 , x6 ) Each of the above set of rules are instantiated for 0 ≤ i < n, and in this way after n generating SdChase iterations, cn has doubly exponential number of 85

elements of type R, that are ordered linearly using the relation succn . By virtue of the first rule below, each of the objects representing the cells of the DTM are linearly ordered by the relation succ. Also the transitive closure of succ is defined as the relation succt cn : (x0 , succn , x1 ) → cn : (x0 , succ, x1 ) cn : (x0 , succ, x1 ) → cn : (x0 , succt, x1 ) cn : (x0 , succt, x1 ), cn : (x1 , succt, x2 ) → cn : (x0 , succt, x2 ) Also using a similar construction, we can create a linearly ordered chain of a doubly exponential number of objects in cn that represents configurations of M , whose minimal element is of type conInit, and the linear order relation being conSucc. Various triple patterns that are used to encode the possible configurations, runs, and their relations in M are: (x0 , head, x1 ) denotes the fact that in configuration x0 , the head of the DTM is at cell x1 . (x0 , state, x1 ) denotes the fact that in configuration x0 , the DTM is in state x1 . (x0 , σ, x1 ) where σ ∈ Σ, denotes the fact that in configuration x0 , the cell x1 contains σ. (x0 , succ, x1 ) denotes the linear order between cells of the tape. (x0 , succt, x1 ) denotes the transitive closure of succ. (x0 , conSucc, x1 ) to denote the fact that x1 is a successor configuration of x0 . (x0 , rdf:type, Accept) denotes the fact that the configuration x0 is an accepting configuration. 86

Since in our construction, each σ ∈ Σ is represented as a relation, we could constrain that no two alphabets σ 6= σ 0 are on the same cell in a given configuration using the following axiom: cn : (z1 , σ, z2 ), cn : (z1 , σ 0 , z2 ) → for each σ 6= σ 0 ∈ Σ. Note that the above BR has an empty head, is equivalent to asserting the negation of its body. Initialization

Suppose the initial configuration is q0 w, ~ where w ~ = σ0 ...σn−1 .

We encode this in our quad-system QSCM using the following BRs: cn : (x0 , rdf:type, conInit), cn : (x1 , rdf:type, minn ) → cn : (x0 , head, x1 ), cn : (x0 , state, q0 ) cn : (x0 , rdf:type, minn ) ∧

n−1 ^

cn : (xi , succ, xi+1 ) ∧ cn : (xj , rdf:type,

i=0

conInit) →

n−1 ^

cn : (xj , σi , xi ) ∧ cn : (xj , , xn )

i=0

cn : (xj , rdf:type, conInit), cn : (xj , , x0 ), cn : (x0 , succt, x1 ) → cn : (xj , , x1 ) The last BR copies the  to every succeeding cell in the initial configuration. Transitions

For every left transition ∆(q, σ) = (qj , σ 0 , −1), the following BR:

cn : (x0 , head, xi ), cn : (x0 , σ, xi ), cn : (x0 , state, q), cn : (xj , succ, xi ), cn : (x0 , conSucc, x1 ) → cn : (x1 , head, xj ), cn : (x1 , σ 0 , xi ), cn : (x1 , state, qj ) For every right transition ∆(q, σ) = (qj , σ 0 , +1), the following BR: cn : (x0 , head, xi ), cn : (x0 , σ, xi ), cn : (x0 , state, q), cn : (xi , succ, xj ), cn : (x0 , conSucc, x1 ) → cn : (x1 , head, xj ), cn : (x1 , σ 0 , xi ), cn : (x1 , state, qj ) 87

If in any configuration the head is at cell i of the tape, then in every successor configuration, elements in preceding and following cells of i in the

Inertia

tape are retained. The following two BRs ensures this: cn : (x0 , head, xi ), cn : (x0 , conSucc, x1 ), cn : (xj , succt, xi ), cn : (x0 , σ, xj ) → cn : (x1 , σ, xj ) cn : (x0 , head, xi ), cn : (x0 , conSucc, x1 ), cn : (xi , succt, xj ), cn : (x0 , σ, xj ) → cn : (x1 , σ, xj ) The rules above are instantiated for every σ ∈ Σ. Acceptance

Any configuration whose state is qA is accepting: cn : (x0 , state, qA ) → cn : (x0 , rdf:type, Accept)

If a configuration of accepting type is reached, then it can be back propagated to the initial configuration, using the following BR: cn : (x0 , conSucc, x1 ), cn : (x1 , rdf:type, Accept) → cn : (x0 , rdf:type, Accept) Finally M accepts w ~ iff the initial configuration is an accepting configuration. Let CQM be the CCQ: ∃y cn : (y, rdf:type, conInit), cn : (y, rdf:type, Accept). It can easily be verified that QSCM |= CQM iff the initial configuration is an accepting configuration. In order to prove the soundness and completeness of our simulation, we prove the following claims: Claim (1) The quad-system QSCM in the aforementioned simulation is a context acyclic quad-system Since there is no edge from any cj to ci , for each 1 ≤ i < j ≤ n, the context dependency graph for QSCM is acyclic, and hence QSCM is context acyclic. Claim (2) QSCM |= CQM iff M accepts w. ~ 88

Suppose that QSCM |= CQM , then by Theorem 1, there exists an assignment µ : V(CQM ) → C, with CQM [µ] ⊆ dChase(QSC ). This implies that there exists a constant o in C(SdChase(QSC )), with {cn : (o, rdf:type, Accept), cn : (o, rdf:type, conInit)} ⊆ SdChase(QSC . But thanks to the acceptance axioms it follows that there exists an constant o0 such that {cn : (o, conSucc, o1 ), cn : (o1 , conSucc, o2 ), . . . , cn : (on , conSucc, o0 )} ⊆ SdChase(QSC ), and cn : (o0 , rdf:type, Accept) ∈ SdChase(QSC ). Also thanks to the initialization axioms, it can be seen that o represents the initial configuration of M i.e. it represents the configuration in which the initial state is q0 , and the left end of the read-write tape contains w ~ followed by trailing s, with the read-write head positioned at the first cell of the tape. Also the transition axioms makes sure that if cn : (o, conSucc, o00 ) ∈ SdChase(QSC ), then o00 represents a successor configuration of o. That is, if o represents the configuration in which M is at state q with read-write head at position pos of the tape that contains a letter σ ∈ Σ, and if ∆(q, σ) = (q 0 , σ 0 , D), then o00 represents the configuration in which M is at state q 0 , in which read-write head is at the position pos − 1/pos + 1 depending on whether D = −1/ + 1, and σ 0 is at the position pos of the tape. As a consequence of the above arguments, it follows that o0 represents an accepting configuration of M , i.e. a configuration in which the state is qA , the lone accepting, halting state. This means that M accepts the string w. ~ For the converse, we briefly show that if QSCM 6|= CQM then M does not accept w. ~ Suppose that QSCM 6|= CQM , then by Theorem 1, for every assignment µ : V(CQM ) → C, it should be the case that CQM [µ] 6⊆ SdChase(QSC ). Thanks to the initialization axioms, we know that there exists a constant o ∈ C(SdChase(QSC )) with cn : (o, rdf:type, conInit) ∈ SdChase(QSC ). We know that o represents the initial configuration of M . Also by the initial construction axioms of QSCM , we know that o is the initial element of a double exponential chain of objects that are linearly ordered by property symbol conSucc. From transition axioms we know that, if, for any o00 , cn : (o, conSucc, o00 ) ∈ 89

dChase(QSC ), then o00 represents a valid successor configuration of o, which itself holds for o00 , and so on. This means that for none of the succeeding double exponential configurations of M , the accepting state qA holds. This means that M does not reach an accepting configuration with string w, ~ and hence rejects it. Since we polynomially reduced the word problem of 2EXPTIME DTM, which is a 2EXPTIME-hard problem, to the CCQ EP over context acyclic quadsystems, it immediately follows that CCQ EP over context acyclic quad-systems is 2EXPTIME-hard. Reconsidering the quad-system in example 3, which is not context acyclic. Suppose that the contexts are enabled with RDFS inferencing, i.e lclosure() = rdfsclosure(). During SdChase construction, since any application of rule (5.3) can only create a triple in c2 in which the Skolem blank node is in the object position, where as the application of rule (5.4), does not propagate constants in object position to c1 . Although at a first look, the SdChase might seem to terminate, but since the application of the following RDFS inference rule in c2 : (s, p, o) → (o, rdf:type, rdfs:Resource), derives a quad of the form c2 : ( :b, rdf:type, rdfs:Resource), where :b is the Skolem blanknode created by the application of rule (5.3). Now by application of rule (5.4) leads to c1 : ( :b, rdf:type, U1 ). Since rule (5.3) is applicable on c1 : ( :b, rdf:type, U1 ), which again brings a new Skolem blank node to c2 . Since this goes on indefinitely, the SdChase construction does not terminate. Hence, as seen above the notion of context acyclicity can alarm us about such infinite cases.

90

Chapter 6 Csafe, Msafe, and Safe Quad-Systems: Restricting the Descendency Structure of Skolem Blank-nodes In the preceding chapter, we introduced context acyclic quad-systems, a class for which query answering is decidable. To briefly sum up, in context acyclicity technique a context dependency graph is used to model the propagation path of constants across various contexts in the rules, and restricts the dependency graph to be acyclic. One of the main drawback of context acyclicity is that it only analyzes the BR part of a quad-system, and ignores the quad-graph part, producing a large number of false alarms. That is it so happens, for large number of cases, that although context dependency graph is cyclic, the dChase is finite. To compensate this drawback, in this chapter, we define more expressive classes of quad-systems, namely SAFE, MSAFE and CSAFE, that are FECs and for which query entailment is decidable. Finiteness/Decidability is achieved by putting certain restrictions (explained below) on the blank nodes generated in the dChase. Before we give the description of our class, we first adapt and reformulate the restricted variant of the chase given in Fagin et al. [39] (also called non-oblivious chase) to the quad-system settings. For a set of quad-patterns S and a set of terms T , we define the relation 91

T -connectedness between quad-patterns in S as the least relation with: • q1 and q2 are T -connected, if CV (q1 ) ∩ CV (q2 ) ∩ T 6= ∅, for any two quad-patterns q1 , q2 ∈ S, • if q1 and q2 are T -connected, and q2 and q3 are T -connected, then q1 and q3 are also T -connected, for any quad-patterns q1 , q2 , q3 ∈ S. It can be noted that T -connectedness is an equivalence relation and partitions S into a set of T -components (similar notion is called a piece in Baget et al. [6]). Note that for two distinct T -components P1 , P2 of S, CV (P1 )∩CV (P2 )∩T = ∅. For any BR r = body(r)(~x, ~z) → head(r)(~x, ~y ), suppose P1 , P2 , . . . , Pk are the pairwise distinct {~y }-components of head(r)(~x, ~y ), then r can be replaced by the semantically equivalent set of BRs {body(r)(~x, ~z) → P1 , . . . , body(r)(~x, ~z) → Pk } whose symbol size is worst case quadratic w.r.t. the symbol size of r. Hence, w.l.o.g. we assume that for any BR r, the set of quad-patterns head(r) is a single component w.r.t. the set of existentially quantified variables in r. Considering the fact that the local semantics for contexts are fixed a priori (for instance RDFS), both the number of rules in the set of local inference rules LIR and the size of each rule in LIR can be assumed to be a constant. Note that each local inference rule is range restricted and does not contain existentially quantified variables in its head. Any ir ∈ LIR is of the form: ∀~x∀~z [t1 (~x, ~z) ∧ . . . ∧ tk (~x, ~z) → t01 (~x)],

(6.1)

where ti (~x, ~z), for i = 1, . . . , n are triple patterns, whose variables are from {~x} or {~z}, and t01 (~x) is a triple pattern, whose variables are from {~x}. Hence, for any quad-system QSC = hQC , Ri in order to accomplish the effect of local inferencing in each context c ∈ C, for each ir ∈ LIR of the form (6.1), we could augment R with a BR irc of the form: ∀~x∀~z [c : t1 (~x, ~z) ∧ . . . ∧ c : tk (~x, ~z) → c : t01 (~x)] 92

Since kLIRk is a constant and the size of the augmentation is linear in |C|, w.l.o.g we assume that the set R contains a BR irc , for each ir ∈ LIR, c ∈ C. For any BR r = body(r)(~x, ~z) → head(r)(~x, ~y ) and an assignment µ : {~x} ∪ {~z} → C, the application of µ on r is defined as: apply(r, µ) = head(r)[µext(~y) ] where µext(~y) ⊇ µ s.t. µext(~y) (yi ) = : b is a fresh blank node from Bsk , for each yi ∈ {~y }. We assume that there exists an order ≺l (for instance, lexicographic order) on the set of constants. We extend ≺l to the set of quads s.t. for any two quads c : (s, p, o) and c0 : (s0 , p0 , o0 ), c : (s, p, o) ≺l c0 : (s0 , p0 , o0 ), iff c ≺l c0 , or c = c0 , s ≺l s0 , or c = c0 , s = s0 , p ≺l p0 , or c = c0 , s = s0 , p = p0 , o ≺l o0 . It can be noted that ≺l is a strict linear order over the set of all quads. For any finite quad-graph QC , the ≺l -greatest quad of QC , denoted greatestQuad≺l (QC ), is the quad q ∈ QC s.t. q 0 ≺l q, for every other q 0 ∈ QC . Also, the order ≺q is defined over the set of finite quad-graphs as follows: for any two finite quad-graphs QC , Q0C 0 , QC ≺q Q0C 0 , if (i) QC ⊂ Q0C 0 ; QC ≺q Q0C 0 , if (i) does not hold and (ii) greatestQuad≺l (QC \ Q0C 0 ) ≺l greatestQuad≺l (Q0C 0 \ QC ); QC 6≺q Q0C 0 , if both (i) and (ii) are not satisfied; A relation R over a set A is called a strict linear order iff R is irreflexive, transitive, and R(a, b) or R(b, a) holds, for every distinct a, b ∈ A. Property 1. Let Q be the set of all finite quad-graphs; ≺q is a strict linear order over Q. Also, we now define in parallel the dChase of a quad-system QSC = hQC , Ri and the level of a quad in dChase of QSC as follows: any quad in QC is of level 0. 93

The level of a set of quads is the largest among levels of quads in the set. Level of any quad that results from the application of a BR r w.r.t. an assignment µ is one more than the level of the set body(r)[µ], if it has not already been assigned a level. Let ≺ be an ordering on the quad-graphs s.t. for any two quad-graphs Q0C 0 and Q00C00 of the same level, Q0C 0 ≺ Q00C00 , iff Q0C 0 ≺q Q00C00 . For Q0C 0 and Q00C00 of different levels, Q0C 0 ≺ Q00C00 , iff level of Q0C 0 is less than level of Q00C00 . It can easily be seen that ≺ is a strict linear order over the set of quad-graphs. For any BRs r, r0 and assignments µ, µ0 over V(body(r)), V(body(r0 )), respectively, (r, µ) ≺ (r0 , µ0 ) iff body(r)[µ] ≺ body(r0 )[µ0 ]. For any quad-graph Q0C 0 , a set of BRs R, a BR r ∈ R, an assignment µ ∈ V(body(r)) → C, let applicableR (r, µ, Q0C 0 ) be the least ternary predicate inductively defined as: applicableR (r, µ, Q0C 0 ) holds, if (a) body(r)[µ] ⊆ Q0C 0 , head(r)[µ00 ] 6⊆ Q0C 0 , ∀µ00 ⊇ µ, and (b) 6 ∃r0 ∈ R, 6 ∃µ0 s.t. r0 6= r or µ0 6= µ with (r0 , µ0 ) ≺ (r, µ) and applicableR (r0 , µ0 , Q0C 0 ); For any quad-system QSC = hQC , Ri, let dChase0 (QSC ) = QC ; dChasei+1 (QSC ) = dChasei (QSC ) ∪ apply(r, µ), if there exists r = body(r)(~x, ~z) → head(r)(~x, ~y ) ∈ R, assignment µ : {~x} ∪ {~z} → C s.t. applicableR (r, µ, dChasei (QSC )); dChasei+1 (QSC ) = dChasei (QSC ), otherwise; for any i ∈ N. The dChase of QSC , noted dChase(QSC ), is given as: [ dChase(QSC ) = dChasei (QSC ) i∈N

Intuitively, dChasei (QSC ) can be thought of as the state of dChase(QSC ) at the end of iteration i. It can be noted that, if there exists i s.t. dChasei (QSC ) = dChasei+1 (QSC ), then dChase(QSC ) = dChasei ( QSC ). A model I C of a quad-system QSC is called universal [35], iff the following holds: I C is a model of QSC , and for any model I 0C there exists a homomorphism from I C to I 0C . 94

Theorem 2. For any consistent quad-system QSC , the following holds: (i) dChase(QSC ) is a universal model of QSC .1 , and (ii) for any boolean CCQ CQ(), QSC |= CQ() iff there exists a map µ : V(CQ) → C such that {CQ()}[µ] ⊆ dChase(QSC ). We call the sequence dChase0 (QSC ), dChase1 (QSC ), ..., the dChase sequence of QSC . The following lemma shows that in a dChase sequence of a quad-system, any dChase iteration can be performed in time exponential w.r.t the size of the largest BR. Lemma 3. For a quad-system QSC = hQC , Ri, for any i ∈ N+ , the following holds: (i) dChasei (QSC ) can be computed in time O( |R| ∗ kdChasei−1 (QSC )krs ), where rs = maxr∈R krk, (ii) kdChasei (QSC )k = O(kdChasei−1 (QSC )k + kRk). Proof. (i) We can first find, if there exists an r ∈ R, assignment µ s.t. applicab leR (r, µ, dChasei−1 (QSC )) holds, in the following naive way: (1) bind the set of variables in all rules in R with the set of constants in dChasei−1 (QSC ). Let this set be called S. Note that |S| = O(|R| ∗ kdChasei−1 (QSC )kkrsk ), where rs = maxr∈R krk. Also, note that each of the binding in S is of the form body(r)(~x, ~z)(µ) → head(r)(~x, ~y )(µ0 ) (♥), where r ∈ R. (2) From the set S we filter out every binding of the form (♥) in which ~x[µ] 6= ~x[µ0 ]. Let S 0 be the resulting set after the above filtering operation. (3) From the set S 0 , we now filter out all the bindings of the form (♥) with head(r)(~x, ~y )(µ0 ) ⊆ dChasei−1 (QSC ), with resulting set S 00 . (4) If S 00 = ∅, then there no r ∈ R, assignment µ s.t. applicableR (r, µ, dChasei−1 (QSC )) is True. Otherwise if S 00 6= ∅, then note that each binding of the form (♥) in S 00 is s.t. condition (a) of the true applicableR (r, µ, dChasei−1 (QSC )) is satisfied. Now, we can sort S 00 w.r.t. ≺ 1

Though dChase(QSC ) is not an interpretation in a strict model theoretic sense, one can easily create the corresponding interpretation IdChase(QSC ) = {I c = h∆c , .c i}c∈C , s.t. for every c ∈ C, ∆c is equal to set of constants in graphdChase(QSC ) (c), and .c is s.t (s, p, o) ∈ graphdChase(QSC ) (c) iff (sc , oc ) ∈ pc .

95

and select the least binding b of the form (♥), so that condition (b) in True condition of applicableR () is satisfied for b. It can easily be seen that applicableR (r, µ, dChasei−1 (QSC )) holds for the r, µ extracted from b. Since the size of each binding is at most rs, the operations (1)-(4) can be performed in time O(|R| ∗ kdChasei−1 (QSC )krs ). Since dChasei (QSC ) = dChasei−1 (QSC ) ∪ head(r)[µ], for r, µ with applicableR (r, µ, dChasei−1 (QSC )), dChasei (QSC ) can be computed in time O(kdChasei−1 (QSC )krs ). (ii) Trivially holds, since in the worst case dChasei ( QSC ) = dChasei−1 (QSC ) ∪ head(r)[µ], for r ∈ R. Lemma 4. For any quad-system QSC , If : b is a Skolem blank node in dChase(QSC ), generated by the application of assignment µ on r = body(r)(~x, ~z) → head(r)(~x, ~y ), with µext(~y) (yj ) = : b, yj ∈ {~y }, then : b is unique for (r, yj , ~x[µext(~y) ]). Proof. By contradiction, suppose if : b is not unique for (r, yj , ~x[µext(~y) ]), i.e. there exists : b0 6= : b in dChase(QSC ), with : b0 generated by r such that : b0 = µ0ext(~y) (yj ) and ~x[µext(~y) ] = ~x[µ0ext(~y) ]. W.l.o.g. suppose : b was generated in an iteration l ∈ N and : b0 in an iteration m > l. This means that head(r)(~x, ~y )[µext(~y) ] ⊆ dChasel (QSC ), and hence head(r)(~x, ~y )[µext(~y) ] ⊆ dChasem−1 (QSC ). Also, since µ|~x = µ0 |~x , there ∃µ00 ⊇ µ0 s.t. head(r)(~x, ~y )[µ00 ] ⊆ dChasem−1 (QSC ). This means that (a) part of the function applicableR is false, for applicableR (r, µ0 , dChasem−1 (QSC )) to be true, and as a consequence applicableR (r, µ0 , dChasem−1 (QSC )) is false. Hence, our assumption that : b0 = yj [µ0ext(~y) ] is false. Hence, : b is unique for (r, yj , ~x[µext(~y) ]).

6.1

Csafe, Msafe, and Safe Quad-Systems: Decidable Classes

Recall that, for any quad-system QSC , the set of blank-nodes B(dChase(QSC )) in its dChase(QSC ) not only contains blank nodes present in QSC , i.e. B(QSC ), but also contains Skolem blank nodes that are generated during the dChase construction process. Note that the following relation holds: Bsk (dChase(QSC )) 96

= B(dChase(QSC )) \ B(QSC ). We assume w.l.o.g. that for any set of BRs R, any BR in R has a unique rule identifier, and we often write ri for the BR in R, whose identifier is i. Definition 5 (Origin RuleId/Vector). For any Skolem blank node : b, generated in the dChase by the application of a BR ri = body(ri )(~x, ~z) → head(ri )(~x, ~y ) using assignment µ : {~x}∪{~z} → C, i.e. : b = µext(~y) (yj ), for some yj ∈ ~y , we say that the origin ruleId of : b is i, denoted originRuleId( : b) = i. Moreover w ~ = ~x[µ] is said to be the origin vector of : b, denoted originV ector( : b) = w. ~ As we saw in Lemma 4, any such Skolem blank node : b, generated in the dChase can uniquely be represented by the expression (i, j, w), ~ where i is the rule id, j is the identifier of the existentially quantified variable yj in ri substituted by : b during the application of µ on ri . Also in the above case, we denote relation between each constant k = µext(~y) (xh ), xh ∈ {~x}, and : b with the relation childOf. Moreover, since children of a Skolem blank node can be Skolem blank nodes, which themselves can have children, one can naturally define relation descendantOf =childOf+ as the transitive closure of childOf. Note that according to the above definition, ‘descendantOf’ is not reflexive. In addition, we could keep track of the set of contexts in which a blank-node was first generated, using the following notion: Definition 6 (Origin-contexts). For any quad-system QSC and for any Skolem blank node : b ∈ Bsk (dChase(QSC )), the set of origin-contexts of : b is given by originContexts( : b) = {c | ∃i. c:(s, p, o) ∈ dChasei (QSC ), s = : b or p = : b or o = : b, and @j < i with c0 :(s0 , p0 , o0 ) ∈ dChasej (QSC ), s0 = : b or p0 = : b or o0 = : b, for any c0 ∈ C}. Intuitively, origin-contexts for a Skolem blank node : b is the set of contexts in which triples containing : b are first generated, during the dChase construction. Note that there can be multiple contexts in which : b can simultaneously be 97

generated. By setting originRuleId(k) = n.d., (resp. originV ector(k) = n.d., resp. originContexts(k) = n.d.,) where n.d. is an ad hoc constant, for every k 6∈ Bsk (dChase(QSC )), we extend the definition of origin ruleId, (resp. origin vector, resp. origin-contexts) to all the constants in the dChase of a quadsystem. Example 7. Consider the quad-system hQC , Ri, where QC = {c1 : (a, b, c)}. Suppose R is the following set:   c : (x , x , z ) → c : (x , x , y ) (r )  1 11 12 1 2 11 12 1 1           c : (a, z , x ) → c : (a, x , y ) (r )  2 2 22 3 22 2 2   R= c2 : (z3 , b, x32 ) → c3 : (b, x32 , y3 ) (r3 )       c : (a, z , x ), c : (b, z , x )   3 41 41 3 42 42       → c2 : (y4 , x41 , a), c2 : (y4 , x42 , b) (r4 ) Suppose that for brevity quantifiers have been omitted, and variables of the form yi or yij are implicitly existentially quantified. Iterations during the dChase construction are: dChase0 (QSC ) = {c1 :(a, b, c)} dChase1 (QSC ) = {c1 : (a, b, c), c2 : (a, b, : b1 )} dChase2 (QSC ) = {c1 :(a, b, c), c2 : (a, b, : b1 ), c3 : (a, : b1 , : b2 )} dChase3 (QSC ) = {c1 :(a, b, c), c2 : (a, b, : b1 ), c3 : (a, : b1 , : b2 ), c3 : (b, : b1 , : b3 )} dChase4 (QSC ) = {c1 :(a, b, c), c2 : (a, b, : b1 ), c3 : (a, : b1 , : b2 ), c3 : (b, : b1 , : b3 ), c2 : ( : b4 , : b2 , a), c2 : ( : b4 , : b3 , b)} dChase5 (QSC ) = dChase4 (QSC ), Also note: originRuleId( : b1 ) = 1, originRuleId( : b2 ) = 2, originRuleId( : b3 ) = 3, originRuleId( : b4 ) = 4, 98

4, h :b2 , :b3 i, {c2 } :b4

:b3

:b2

3, h :b1 i, {c3 }

2, h :b1 i, {c3 } :b1

1, ha, bi, {c2 } a

b

Figure 6.1: Descendance graph of :b4 in Example 7. Note: n.d. labels are not shown

originV ector( :b1 ) = ha, bi, originV ector( :b2 ) = originV ector( :b3 ) = h : b1 i, originV ector( :b4 ) = h :b2 , :b3 i, originContexts( :b1 ) = {c2 }, originContexts( : b2 ) = originContexts( : b3 ) = {c3 }, originContexts( : b4 ) = {c2 }, Also : b1 descendantOf : b3 , : b1 descendantOf : b2 , : b2 descendantOf : b4 , : b3 descendantOf : b4 , : b1 descendantOf : b4 . For any Skolem blank node : b (in dChase), its descendant hierarchy can be analyzed using a descendance graph hV, E, λr , λv , λc i, which is a labeled graph rooted at : b, whose set of nodes V are constants in the dChase, the set of edges E is such that (k, k 0 ) ∈ E, iff k 0 is a descendant of k. λr , λv , λc are node labeling functions, such that λr (k) = originRuleId(k), λv (k) = originV ector(k), and λc (k) = originContexts(k), for any k ∈ V . The descendance graph for :b4 of Example 7 is shown in Fig. 6.1. For any two vectors of constants ~v , w, ~ we note ~v ∼ ~ iff there exists a bijection µ : B(~v ) → B(w) ~ such that w ~ = ~v [µ]. = w, Definition 8 (safe, msafe, csafe quad-systems). A quad-system QSC is said to be unsafe (resp. unmsafe, resp. uncsafe), iff there exist Skolem blank nodes 99

: b 6= : b0 in dChase(QSC ) such that : b is a descendant of : b0 , with originRuleId( : b) = originRuleId( : b0 ) and originV ector( : b) ∼ = originV ector( : b0 ) (resp. originRuleId( : b) = originRuleId( : b0 ), resp. originContexts( : b) = originContexts( : b0 )). A quad-system is safe (resp. msafe, resp. csafe) iff it is not unsafe (resp. unmsafe, resp. uncsafe). Intuitively, safe, msafe and csafe quad-systems, does not allow repetitive generation of Skolem blank-nodes with a certain set of attributes in its dChase. The containment relation between the class of safe, msafe, and csafe quad-systems are established by the following theorem: Theorem 9. Let SAFE, MSAFE, and CSAFE denote the class of safe, msafe, and csafe quad-systems, respectively, then the following holds: CSAFE

⊂ MSAFE ⊂ SAFE

Proof. We first show MSAFE ⊆ SAFE, by showing the inverse inclusion of their compliments, i.e. UNSAFE ⊆ UNMSAFE. Suppose a given quad-system QSC is unsafe, then by definition its dChase contains two distinct Skolem blank nodes : b, : b0 such that : b is a descendant of : b0 , with originRuleId( : b) = originRuleId( : b0 ) and originV ector( : b) ∼ = originV ector( : b0 ). But this implies that originRuleId( : b) = originRuleId( : b0 ). Hence, by definition, QSC is unmsafe. Hence UNSAFE ⊆ UNMSAFE (†). Now, we show that CSAFE ⊆ MSAFE by showing UNMSAFE ⊆ UNCSAFE. Suppose a given quad-system QSC = hQC , Ri is unmsafe, then by definition its dChase contains two distinct Skolem blank nodes : b, : b0 such that : b is a descendant of : b0 , with originRuleId( : b) = originRuleId( : b0 ). But this implies that there exists a BR ri = body(ri )(~x, ~z) → head(ri )(~x, ~y ), assignment µ, (resp. µ0 ,) s.t. : b (resp. : b0 ) was generated in dChase(QSC ) as result of application of µ (resp. µ0 ) on ri . That is : b = yj [µext(~y) ], and : b0 = yk [µ0ext(~y) ], where yj , yk ∈ {~y }. We have the following two subcases (i) j = k, (ii) j 6= k. 100

Suppose (i) j = k, then it immediately follows that originContexts( : b) = originContexts( : b0 ). Hence, QSC is uncsafe. Suppose (ii) j 6= k, then by construction of dChase, on application of µ0 to ri , along with : b0 , there gets also generated a Skolem blank node : b00 = yj [µ0ext(~y) ], with yj ∈ {~y }. Since : b and : b00 are generated by substitutions of the same variable yj ∈ {~y } of BR ri , originContexts( : b) = originContexts( : b00 ). Also since childOf( : b0 ) = childOf( : b00 ) = {~x[µ0ext(~y) ]}, : b is a descendant of : b00 . Hence, by definition, it holds that QSC is uncsafe. Hence UNMSAFE ⊆ UNCSAFE (‡). From † and ‡, it follows that CSAFE ⊆ MSAFE ⊆ SAFE. To show that the containments are strict, consider the quad-system QSC in Example 7. By definition, QSC is msafe, however uncsafe, as the Skolem blank nodes : b1 , : b4 , which have the same origin contexts are s.t. : b1 is a descendant of : b4 . Hence, CSAFE ⊂ MSAFE. For MSAFE ⊂ SAFE, the following example shows an instance of a quad-system that is unmsafe, yet is safe. Example 10. Consider the quad-system QSC = hQC , Ri, where QC = {c1 : (a, b, c), c2 : (c, d, e)}, R is given by: c1 : (x11 , x12 , x13 ), c2 : (x13 , x14 , z1 ) → c3 : (y1 , x11 , x12 ), c4 : (x12 , x13 , x14 )

(r1 )

c3 : (x21 , a, x22 ), c4 : (x22 , x23 , x24 ) → c1 : (x21 , a, x22 ), c2 : (x22 , x23 , x24 )

(r2 )

c3 : (x21 , x22 , a), c4 : (a, x23 , x24 ) → c1 : (x21 , x22 , a), c2 : (a, x23 , x24 )

(r3 )

c3 : (x21 , x22 , x23 ), c4 : (x23 , a, x24 ) → c1 : (x21 , x22 , x23 ), c2 : (x23 , a, x24 )

(r4 )

c3 : (x21 , x22 , x23 ), c4 : (x23 , x24 , a) → c1 : (x21 , x22 , x23 ), c2 : (x23 , x24 , a)

(r5 )

Note that for brevity quantifiers have been omitted, and variables of the form yi 101

or yij are implicitly existentially quantified. Iterations during dChase construction are: dChase0 (QSC ) = {c1 :(a, b, c), c2 :(c, d, e)} dChase1 (QSC ) = dChase0 (QSC ) ∪ {c3 : ( : b1 , a, b), c4 : (b, c, d)} dChase2 (QSC ) = dChase1 (QSC ) ∪ {c1 : ( : b1 , a, b), c2 : (b, c, d)} dChase3 (QSC ) = dChase2 (QSC ) ∪ {c3 : ( : b2 , : b1 , a), c4 : (a, b, c)} dChase4 (QSC ) = dChase3 (QSC ) ∪ {c1 : ( : b2 , : b1 , a), c2 : (a, b, c)} dChase5 (QSC ) = dChase4 (QSC ) ∪ {c3 : ( : b3 , : b2 , : b1 ), c4 : ( : b1 , a, b)} dChase6 (QSC ) = dChase5 (QSC ) ∪ {c1 : ( : b3 , : b2 , : b1 ), c2 : ( : b1 , a, b)} dChase7 (QSC ) = dChase6 (QSC ) ∪ {c3 : ( : b4 , : b3 , : b2 ), c4 : ( : b2 , : b1 , a)} dChase8 (QSC ) = dChase7 (QSC ) ∪ {c1 : ( : b4 , : b3 , : b2 ), c2 : ( : b2 , : b1 , a)} dChase9 (QSC ) = dChase8 (QSC ) ∪ {c3 : ( : b5 , : b4 , : b3 ), c4 : ( : b3 , : b2 , : b1 )} dChase(QSC ) = dChase9 (QSC ) It can be seen that : b1 , : b2 , : b3 , : b4 , : b5 form a descendant chain, since : bi descendantOf : bi+1 , for each i = 1, . . . , 4. Also, originRuleId( : bi ) = originRuleId( : bi+1 ), for each i = 1, . . . , 4. Hence, it turns out that QSC is unmsafe. However, it can be seen that originV ector( : b1 ) = ha, b, c, di, originV ector( : b2 ) = h : b1 , a, b, ci, originV ector( : b3 ) = h : b2 , : b1 , a, bi, originV ector( : b4 ) = h : b3 , : b2 , : b1 , ai, originV ector( : b5 ) = h : b4 , : b3 , : b2 , : b1 i, and originV ector( : bi ) ∼ 6= originV ector( : bj ), for 1 ≤ i 6= j ≤ 5, and hence, by definition, QSC is safe with a terminating dChase. It can be noticed that during each distinct application of r1 , the vector of constants bound to the vector of variables hx11 , . . . , x14 i are different w.r.t ∼ =. 102

Safe quad-systems in this way are capable of recognizing such positive cases of finite dChases, which are classified as negative cases by msafe quad-systems, by also keeping track of the origin vectors of Skolem blank nodes in its dChase.

6.2

Csafe, Msafe, and Safe Quad-Systems: Computational Properties

In this section, we establish some of the essential computational properties of the quad-system classes which we defined in the previous section. The following property shows that for a safe (csafe, msafe) quad-system, the descendance graph of any Skolem blank node in its dChase is a directed acyclic graph (DAG): Property 11 (DAG property). For a safe (csafe, msafe) quad-system QSC , and for any blank node b ∈ Bsk (dChase(QSC )), its descendance graph is a DAG. Proof. By construction, as there exists no descendant for any constant k ∈ C(QSC ), there cannot be any out-going edge from any such k. Hence, no member of C(QSC ) can be involved in cycles. Therefore, the only members that can be involved in cycles are the members of C(dChase(QSC )) − C(QSC ) = Bsk (dChase(QSC )). But if there exists : b ∈ Bsk (dChase(QSC )), such that there exists a cycle through : b, then this implies that : b is a descendant of : b. This would violate the prerequisites of being safe (resp. csafe, resp. msafe), and imply that QSC is unsafe (resp. uncsafe, resp. unmsafe), which is a contradiction. Since the descendance graph G of any Skolem blank node : b ∈ Bsk (dChase(QSC )) is such that G is rooted at : b and is acyclic, any directed path from : b terminates at some node. Hence, one can use a tree traversal technique, such as preorder (visit a node first and then its children) to sequentially traverse nodes in G. Algorithm 1 takes a descendance graph G and unravels it into a tree. The algorithm first removes all the transitive edges from G, i.e. if there are v, v 0 ∈ V 103

Algorithm 1: UnRavel (Descendance Graph G) /* procedure to unravel, a descendance graph into a tree Input : descendance graph G = hV, E, λr , λv , λc i Output: A labeled Tree G begin G = hV, E, λr , λv , λc i := RemoveTranstiveEdges(G); foreach Node vo ∈ preOrder(G) do if (k = indegree(vo )) > 1 then {v1 , ..., vk } :=getFreshNodes();/* each vi 6∈ V is fresh /* replace old node vo by the fresh nodes in V removeNodeFrom(vo , V ); addNodesTo({v1 , ..., vk }, V ); foreach (vo , v 0 ) ∈ E do /* replace each outgoing edge from vo with a fresh outgoing edges from each fresh node vi removeEdgeFrom((vo , v 0 ), E); addEdgesTo({(v1 , v 0 ), ..., (vk , v 0 )}, E);

*/

*/ */

*/

i := 1; foreach (v 0 , vo ) ∈ E do /* replace each incoming edge of vo with an incoming edge for a unique vi */ removeEdgeFrom((v 0 , vo ), E); addEdgeTo((v 0 , vi ), E); i++; /* restrict node labels to the updated set of nodes in V λr := λr |V , λv := λv |V , λc := λc |V ; return G;

104

*/

4, h :b2 , :b3 i, {c2 } :b4

:b3

:b2

3, h :b1 i, 2, h :b1 i, {c3 } {c3 }

a

:b1

:b1

1, ha, bi, {c2 }

1, ha, bi, {c2 } a

b

b

Figure 6.2: Descendance graph of Fig. 6.1 unraveled into a tree. Note: n.d. labels are not shown

with (v, v 0 ) ∈ E and G contains a path of length greater than 1 from v to v 0 , then it removes (v, v 0 ). Note that, in the resulting graph, the presence of a path from v to v 00 still gives us the information that v 00 is a descendant of v. The algorithm then traverses the graph in preorder fashion, as it encounters a node v, if v has an indegree k greater than one, it replaces v with k fresh nodes v1 , ..., vk , and distributes the set of edges incident to v across v1 , ..., vk , such that (i) each vi has at-most one incoming edge (ii) all the edges incident to v are incident to some vi , i ∈ {1, . . . , k}. Outgoing edges of v are copied for each vi . Hence, after the above operation each vi has an indegree 1, whereas outdegree of vi is same as the outdegree of v, i ∈ {1, . . . , k}. Hence, after all the nodes are visited, every node except the root in the new graph G has an indegree 1. G is still rooted, connected, acyclic, and is hence a tree. The algorithm terminates as there are no cycles in the graph, and at some point reaches a node with no children. For instance, the unraveling of the descendance graph of :b4 in Fig. 6.1 is shown in Fig. 6.2. The following property holds for any Skolem blank node of a safe quad-system.

105

Property 12. For a safe quad-system QSC = hQC , Ri, and any Skolem blank node in dChase(QSC ), the unraveling (Algorithm 1) of its descendance graph results in a tree t = hV , E, λr , λv , λc i s.t.: 1. any leaf node of t is from the set C(QSC ), 2. any non-leaf node of t is from the set Bsk ( dChase(QSC )), 3. order(t) ≤ w, where w = maxr∈R |fr (r)|, 4. there cannot be a path between b 6= b0 ∈ V , with λr (b) = λr (b0 ) and λv (b) ∼ = λv (b0 ), 5. there cannot be a path between b 6= b0 ∈ V , with λr (b) = λr (b0 ), if QSC is also msafe, 6. there cannot be a path between b 6= b0 ∈ V , with λc (b) = λc (b0 ), if QSC is also csafe. Proof. 1. Any node n in the descendance graph is such that n ∈ C(dChase(QSC )), and C(dChase( QSC )) = C(QSC ) ] Bsk (dChase(QSC )). Since any member m ∈ Bsk (dChase(QSC )) is generated from an application of a BR with an assignment µ such that its frontier variables are assigned by µ with a set of constants, m has at-least one child. But, since n is a leaf node, n ∈ C(QSC ). 2. Since no member m ∈ C(QSC ) can have descendants and any non-leaf node has children, m cannot be a non-leaf node. Hence, non-leaf nodes must be from Bsk (dChase( QSC )). 3. The order of t is the maximal outdegree among the nodes of t, and outdegree of a node is the number of children it has. Since any node in t with non-zero outdegree is a Skolem blank-node : b generated by application 106

of an assignment µ to r = body(r)(~x, ~z) → head(r)(~x, ~y ) ∈ R, the number of children : b has equals k~xk. Hence the order of t is bounded by w. 4. Since any path from b to b0 implies that b0 is a descendant of b, it must be the case that λr (b) 6= λr (b0 ) or λv (b) ∼ 6= λv (b0 ), otherwise safety condition would be violated. 5. Similar as above, immediate by definition. 6. Similar as above, immediate by definition.

The property above is exploited to show that there exists a finite bound in the dChase size and its computation time. Lemma 13. For any safe/msafe/csafe quad-system QSC = hQC , Ri, the followkQS k ing holds: (i) the dChase size kdChase(QSC )k = O(22 C ), (ii) dChase(QSC ) can be computed in 2EXPTIME, (iii) if kRk and the set of schema triples in QC is fixed to a constant, then kdChase(QSC )k is a polynomial in kQSC k and can be computed in PTIME. Proof. The proofs are provided for safe quad-systems, but since CSAFE ⊂ MSAFE ⊂ SAFE and since we are giving upper bounds, they also propagate trivially to msafe and csafe quad-systems. (i) For any blank node in dChase(QSC ), the size of its originVector is upper bounded by w = maxr∈R |fr (r)|. If S is the set of all origin vectors of blanknodes in dChase(QSC ), then cardinality of the set S 0 = S\ ∼ = is upper bounded by (|U(QSC )| + |L(QSC )| + w)w , which means that |S 0 | = O(2kQSC k ). Also, since the set of origin ruleId labels, Rids, can at most be |R|, the cardinality of the set Rids × S 0 = O(2kQSC k ). For the descendance tree t of any Skolem blank node of dChase(QSC ), since there cannot be paths in t between distinct b and b0 , such that originRuleId(b) = originRuleId(b0 ) and originV ector(b) ∼ = 107

originV ector(b0 ), the length of any such path is upper bounded by |Rids × S 0 | = O(2kQSC k ). However, it turns out that the above upper bound provided is loose, as there is the need of additional filter BRs to transform/back-propagate vectors of constants associated with Skolem blank nodes generated by repetitive application of the same BR. For instance, consider the set of BRs in eg: 10. The BR r1 transforms the origin vector to a new vector each time during its application. BRs r2 - r5 deals with back propagation of these vectors back to input origin vectors of BR r1 . Such filter BRs rule out the case of a BR being applied to a quad that contains a Skolem blank node that was generated using the same BR on an isomorphic origin vector, ensuring that the safety criteria for Skolem blank-nodes generated is not violated. It turns out that the number of such filter BRs required is polynomial w.r.t. to the number of descendants with the same rule id, for a node in t. Hence, it turns out the depth of t is polynomially bounded by kRk. (Note that depth of t is bounded by |R| for msafe quad-systems. Also since, the set of origin context labels are bounded by the set of existential variables in R, depth of t is bounded by kRk for csafe quadsystems.) Also order of the tree is bounded by w. Hence, any such tree can have at most O(2kQSC k ) leaf nodes, O(2kQSC k ) inner nodes, and O(2kQSC k ) nodes. Since each of the leaf nodes can only be from C(QSC ) and each of the inner nodes correspond to an existential variable in R, the number of such possible trees are clearly bounded double exponentially in kQSC k, hence bounds the number of Skolem blank nodes generated in the dChase. (ii) From (i) kdChase(QSC )k is double exponential in kQSC k, and since each iteration add at-least one quad to its dChase, the number of iterations are bounded double exponentially in kQSC k. Also, by Lemma 3 any iteration i can be done in time O(kdChasei−1 (QSC )kkRk ). Hence, by using (i), we get kQS k kdChasei−1 (QSC )k = O(22 C ). Hence, we can infer that each iteration i can kQS k

be done in time O(2kRk∗2 C ). Also since the number of iterations is at most double exponential, computing dChase(QSC ) is in 2EXPTIME. 108

(iii) Since kRk is fixed to a constant, the set of existential variables is also a constant. In this case, since the size of the frontier of any r ∈ R is also a constant, the order and depth of any descendant tree t of a Skolem blank node is a constant. Hence, the number of (leaf) nodes of t is bounded by a constant. Also in this setting, the label of inner nodes of t, which correspond to existential variables, is also a constant, and the leaf nodes of t can only be a constant in C(QSC ). Hence, the number of descendant trees and consequentially, the number of Skolem blank nodes generated is bounded by O(|C(QSC )|z ), where z is a constant. Hence, the set of constants generated in dChase(QSC ) is a polynomial in kQSC k, and so is kdChase(QSC )k. Since in any dChase iteration except the final one, at least one quad is added, and also since the final dChase can have at most O(kQSC kz ) triples, the total number of iterations are bounded by O(kQSC kz ) (†). By Lemma 3, since any iteration i can be computed in O(kdChasei−1 (QSC )kkRk ) time, and since kRk is a constant, the time required for each iteration is a polynomial in kdChasei−1 (QSC )k, which is at most a polynomial in kQSC k. Hence, any dChase iteration can be performed in polynomial time in size of QSC (‡). From (†) and (‡), it can be concluded that dChase can be computed in PTIME. Lemma 14. For any safe/msafe/csafe quad-system, the following holds: (i) data complexity of CCQ entailment is in PTIME, (ii) combined complexity of CCQ entailment is in 2EXPTIME. Proof. Note that the proofs are provided for safe quad-systems, but since CSAFE ⊂ MSAFE ⊂ SAFE and since we are giving upper bounds, they also propagate trivially to msafe and csafe quad-systems. Given a safe quad-system QSC = hQC , Ri, since dChase(QSC ) is finite, a boolean CCQ CQ() can naively be evaluated by binding the set of constants in the dChase to the variables in the CQ(), and then checking if any of these bindings are contained in dChase(QSC ). The number of such bindings can at most be kdChase(QSC )kkCQ()k (†). 109

(i) Since for data complexity, the size of the BRs kRk, the set of schema triples, and kCQ()k is fixed to a constant. From Lemma 13 (iii), we know that under the above mentioned settings the dChase can be computed in PTIME and is polynomial in the size of QSC . Since kCQ()k is fixed to a constant, and from (†), binding the set of constants in dChase(QSC ) on CQ() still gives a number of bindings that is worst case polynomial in the size of kQSC k. Since membership of these bindings can checked in the polynomially sized dChase in PTIME, the time required for CCQ entailment is in PTIME. kQS k

(ii) Since in this case kdChase(QSC )k = O(22 C ) (‡), from (†) and (‡), kQS k binding the set of constants in dChase(QSC ) to CQ() amounts to O(2kCQ()k∗2 C ) number of bindings. Since the dChase is double exponential in kQSC k, checking the membership of each of these bindings can be done in 2EXPTIME. Hence, the combined complexity is in 2EXPTIME. Theorem 15. For any safe/msafe/csafe quad-system, the following holds: (i) The data complexity of CCQ entailment is PTIME-complete (ii) The combined complexity of CCQ entailment is 2EXPTIME-complete. Proof. (i)(Membership) See Lemma 14 for the membership in PTIME. (Hardness) Follows from the PTIME-hardness of data complexity of CCQ entailment for Range-Restricted quad-systems (Theorem 3 of Chapter 7), which are contained in safe/msafe/csafe quad-systems. (ii) (Membership) See Lemma 14. (Hardness) Theorem 16 below shows that the class of context acyclic quadsystems is contained by the class of csafe quad-systems. Since we already showed that CCQ EP for context acyclic quad-systems is 2EXPTIME-hard, it follows that CCQ EP is 2EXPTIME-hard for csafe/msafe/safe quad-systems.

The theorem below establishes the fact that the class of csafe quad-systems contains the class of context acyclic quad-systems defined in the previous sec110

tion. Theorem 16. For any quad-system QSC = hQC , Ri, if QSC is context acyclic, then QSC is csafe. Proof. We prove the contrapositive, i.e. if a quad-system QSC is uncsafe, then it is not context acyclic. We, in order to prove the theorem, give a few supporting claims: 1 If b ∈ C(dChase(QSC )) is a Skolem blank node, then any c ∈ originContexts(b) is a TGC. Since b is a Skolem blank node, there exists a BR r = body(r)(~x, ~z) → head(r)(~x, ~y ) s.t. b = y[µext(~y) ], for some y ∈ {~y }. Hence, any c ∈ originContexts( b) is s.t. c : (s, p, o) ∈ head(~x, ~y ), and s or p or o is an existentially quantified variable. This means that any c ∈ originContexts(b) is a TGC. 2 For any quad-system QSC , for any Skolem blank node b, and for any c 6∈ originContexts(b), suppose there exists a quad c : (s, p, o) ∈ dChase(QSC ), with s = b∨p = b∨o = b, then there exists a path from some ci ∈ originContexts(b) to c in the context dependency graph. Since at any iteration of dChase construction when the Skolem blank node b is introduced in dChase(QSC ), originContexts(b) are the only contexts that contain a triple in which b occurs. And since the only immediate way by which b can propagate to any other context c0 6∈ originContexts(b) in a subsequent iteration is by the application of a BR r ∈ R of the form (4.1), in which some ci ∈ originContexts(b) occurs in body(r) and c0 occurs in head(r). Since for any such BR r, there exists an edge from each ci to each c0j , for i ∈ {1, ..., n}, j = {1, ..., m} in the context dependency graph, there is a path from some c ∈ originContexts(b) to c0 . The claim straightforwardly follows from the generalization of the above arguments. 111

For the claim below, we introduce the concept of the sub-distance. For any two blank nodes, their sub-distance is inductively defined as: Definition 17. For any two blank nodes b, b0 , sub-distance(b, b0 ) is defined inductively as: • sub-distance(b, b0 ) = 0, if b0 = b; • sub-distance(b, b0 ) = ∞, if b 6= b0 and b is not a descendant of b0 ; • sub-distance(b, b0 ) = mint∈{~x[µ]} { sub-distance(b, t)} + 1, if b0 was generated by application of µ on r = body(r)(~x, ~z) → head(r)(~x, ~y ), i.e. b0 = yj [µext(~y) ], for some yj ∈ {~y }, and b is a descendant of b0 . 3 Suppose for any two distinct blank nodes b, b0 ∈ C(dChase(QSC )), if b is a descendant of b0 , then there exists a path from c to c0 in the context dependency graph, for some c ∈ originContexts(b), for every c0 ∈ originContexts(b0 ). Suppose if b is a descendant of b0 , then it should be the case that sub-distance(b, b0 ) ∈ N+ . We prove this by induction on the value of sub-distance(b, b0 ). Base case Suppose sub-distance(b, b0 ) = 1, that is there exists r = body(r)(~x, ~z) → head(r)(~x, ~y ) ∈ R, and an assignment µ, with applicableR (r, µ, dChasek (QSC )), b ∈ {~x[µ]}, and b0 is the result of application of µ on r. This means that b occurs in body(r)(~x, ~z)[µ] ⊆ dChasek (QSC ), and consequently there exists a context c with c : (s, p, o) ∈ body(r)(~x, ~z)[µ], with s = b or p = b or o = b, and c : (s, p, o) ∈ dChasek (QSC ). Suppose c ∈ originContexts(b), then since by construction, there exists an edge from c to every context identifier c0 occuring in head(r), the base case follows. Otherwise, if c 6∈ originContexts(b), then by Claim 2, it follows that there exists a path in context dependency graph from some c0 ∈ originContexts(b) to c. Also since there exists an edge from c to every context identifier c0 occuring in head(r), the base case follows. 112

Hypothesis Suppose sub-distance1 ≤ (b, b0 ) ≤ k, then there exists a path in the context dependency graph, from c to c0 , for some c ∈ originContexts(b), for every c0 ∈ originContexts(b0 ). Inductive step Suppose sub-distance(b, b0 ) = k + 1, then this implies that there exists a Skolem blank node b00 s.t. sub-distance(b, b00 ) = k and subdistance(b00 , b0 ) = 1. From hypothesis it follows that there exists a path in context dependency graph from some c ∈ originContexts(b) to every c00 ∈ originContexts(b00 ), and there exists a path in context dependency graph from some c00 ∈ originContext(b00 ) to every c0 ∈ originContexts(b0 ). This implies that there exists a path from some c ∈ originContexts(b) to every c0 ∈ originContexts(b0 ). Hence, the Claim follows. Suppose if QSC is uncsafe, then by definition, there exists Skolem blank nodes b, b0 in C(dChase(QSC )), s.t b is a descendant of b0 and originContexts(b) = originContexts(b0 ). By Claim 3, there exists a path in context dependency graph from some c ∈ originContexts(b) to every c0 ∈ originContexts(b0 ). Since originContexts(b) = originContexts(b0 ), there exists a c ∈ originContexts(b) s.t. there exists a cycle from c to c it self. Since by Claim 1, every context in originContexts(b) = originContexts(b0 ) is a TGC, QSC , by definition, is not context acyclic.

6.3

Procedure for Detecting Safe/Msafe/Csafe Quad-Systems

In this subsection, we present a procedure for deciding whether a given quadsystem is safe (resp. msafe, resp. csafe) or not. If the quad-system is safe (resp. msafe, resp. csafe), the result of the procedure is a safe dChase (resp. msafe dChase, csafe dChase) that contains the standard dChase, and can be used for query answering. Since the safety (resp. msafety, resp. csafety) property of a quad-system is attributed to the dChase of the quad-system, the procedure 113

nevertheless performs the standard operations for computing the dChase, but also generate quads that indicate origin ruleIds and origin vectors (resp. origin ruleIds, resp. origin-contexts) of each Skolem blank node generated. In each iteration, a test for safety is performed, by checking the presence of Skolem blank-nodes that violate the safety (resp. msafety, resp. csafety) condition. In case a violation is detected, a distinguished quad is generated and the safe (resp. msafe, resp. csafe) dChase construction is aborted, prematurely. On the contrary, if there exists an iteration in which no new quad is generated, the safe (resp. msafe, resp. csafe) dChase computation stops with a completed safe (resp. msafe, resp. csafe) dChase that contains the standard dChase. Since all the additional quads produced for accounting information use a distinguished context identifier cc 6∈ C, the computed safe (resp. msafe, resp. csafe) dChase itself can be used for standard query answering. Before geting to the details of the procedure, we give a few necessary definitions. Definition 18 (Context Scope). The context scope of a term t in a set of quadpatterns Q, denoted by cScope(t, Q) is given as: cScope(t, Q) = {c | c : (s, p, o) ∈ Q, s = t ∨ p = t ∨ o = t}. For any quad-system QSC = hQC , Ri, let cc be an ad hoc context identifier such that cc 6∈ C, then for ri = body(ri )(~x, ~z) → head(ri )(~x, ~y ) ∈ R, we define transformations augS(ri ), augM (ri ), augC(ri ) as follows: augS(ri ) = body(ri )(~x, ~z) → head(ri )(~x, ~y ) ∧ ∀yj ∈ {~y } [

^

cc : (xk ,

xk ∈{~x}

descendantOf, yj ) ∧ cc : (yj , descendantOf, yj ) ∧ cc : (yj , originRuleId, i) ∧ cc : (yj , originVector, ~x)] It should be noted that cc : (yj , originVector, ~x) is not a valid quad pattern, and is only used for notation brevity. In the actual implementation, vectors can be stored using an rdf container data structure such as rdf:List, rdf:Seq or 114

by typecasting it as a string. augM (ri ) = body(ri )(~x, ~z) → head(ri )(~x, ~y ) ∧ ∀yj ∈ {~y } [

^

cc : (xk ,

xk ∈{~x}

descendantOf, yj ) ∧ cc : (yj , descendantOf, yj ) ∧ cc : (yj , originRuleId, i)] augC(ri ) = body(ri )(~x, ~z) → head(ri )(~x, ~y ) ∧ ∀yj ∈ {~y } [

^

cc : (xk ,

xk ∈{~x}

descendantOf, yj ) ∧ cc : (yj , descendantOf, yj ) ∧

^

cc : (yj ,

c∈cScope(yj ,head(ri ))

originContext, c)] Intuitively, the transformation augS/augM/augC on a BR ri , augments the head part of ri with additional types of quad patterns, which are the following: 1. cc : (xk , descendantOf, yj ), for every existentially quantified variable yj in ~y and universally quantified variable xk ∈ {~x}. This is done because, during dChase computation any application of an assignment µ to ri such that ~x[µ] = ~a, resulting in the generation of a Skolem blank node : b = µext(~y) (yj ), any ai ∈ {~a} is a descendant of : b. Hence, due to these additional quad-patterns, quads of the form cc : (ai , descendantOf, : b) are also produced, and in this way, keeps track of the descendants of any Skolem blank node produced. 2. cc : (yj , descendantOf, yj ), in order to maintain also the reflexivity of ‘descendantOf’ relation. 3. cc : (yj , originContext, c), for every existentially quantified variable yj in {~y }, every c ∈ cScope( yj , head(ri )). This is done because during dChase computation, any application of an assignment µ on ri , such that ~x[µ] = ~a, resulting in the generation of a Skolem blank node : b = µext(~y) (yj ), c is an origin context of : b. Hence due to these additional quad-patterns, 115

quads of the form cc : ( : b, originContext, c) is also produced. In this way, we keep track of the origin-contexts of any Skolem blank node produced. 4. cc : (yj , originVector, ~x), This is done because during the dChase computation, for any application of an assignment µ on ri , such that ~x[µ] = ~a, resulting in the generation of a Skolem blank node : b = µext(~y) (yj ), ~a is the origin vector of : b. Hence, due to these additional quad-patterns, quads of the form cc : ( : b, originVector, ~a) is also produced. In this way, we keep track of the origin vector of any Skolem blank node produced. 5. cc : (yj , originRuleId, i), for every existentially quantified variable yj in {~y }, inorder to keep track of the ruleId of the BR used to create any Skolem blank node. It can be noticed that for any BR ri without existentially quantified variables, the transformations augS/augM/augC leaves ri unchanged. For any set of BRs R, let augS(R) (resp. augM (R), resp. augC(R)) =

[

augS(ri ) (resp. augM (ri ),

ri ∈R

resp. augC(ri )) ∪ {cc : (x1 , descendantOf, z1 ) ∧ cc : (z1 , descendantOf, x2 ) → cc : (x1 , descendantOf, x2 )} The function unSafeTest (resp. unMSafeTest, resp. unCSafeTest) defined below, given a BR ri = body(ri )(~x, ~z) → head(ri )(~x, ~y ), an assignment µ, and a quad-graph Q checks, if application of µ on ri violates the safety (resp. msafety, resp. csafety) condition on Q. unSafeTest(ri , µ, Q)=True iff ∃ : b, : b0 ∈ B, with all the following conditions being satisfied: • : b ∈ {~x[µ]}, and • cc : ( : b0 , descendantOf, : b) ∈ Q, and 116

• cc : ( : b0 , originRuleId, i) ∈ Q, and • cc : ( : b0 , originVector, ~a) ∈ Q, and ~a ∼ = ~x[µ]. Intuitively, unSafeTest returns True, if µ applied to ri will produce a fresh Skolem blank node : b00 , whose child : b ∈ {~x[µ]}, and according to knowledge in Q, : b0 is a descendant of : b such that the origin ruleId of : b0 is i (which is also the origin ruleId of : b00 ) and the origin vector of : b0 is isomorphic to the origin vector of ~x[µ] (which is also the origin vector of : b00 ). The functions unMSafeTest and unCSafeTest are similarly defined as follows: unMSafeTest(ri , µ, Q)=True iff ∃ : b, : b0 ∈ B, with all the following conditions being satisfied: • : b ∈ {~x[µ]}, and • cc : ( : b0 ,descendantOf, : b) ∈ Q, and • cc : ( : b0 , originRuleId, i) ∈ Q. unCSafeTest(ri , µ, Q)=True iff ∃ : b, : b0 ∈ B, ∃yj ∈ {~y }, with all the following being satisfied: • : b ∈ {~x[µ]}, and • cc : ( : b0 , descendantOf, : b) ∈ Q, and • {c | cc : ( : b0 , originContext, c) ∈ Q} = cScope( yj , head(ri )(~x, ~y ))\{cc }. For any BR ri and an assignment µ, the safe/msafe/csafe application of µ on ri w.r.t. a quad-graph QC is defined as follows: ( unSafe, If unSafeTest(ri , µ, QC ) = True; apply safe (ri , µ, QC ) = apply(ri , µ), Otherwise; ( unMSafe, If unMSafeTest(ri , µ, QC ) = True; apply msafe (ri , µ, QC ) = apply(ri , µ), Otherwise; 117

( apply csafe (ri , µ, QC ) =

unCSafe, If unCSafeTest(ri , µ, QC ) = True; apply(ri , µ), Otherwise;

where unSafe = cc : (unsafe, unsafe, unsafe) (resp. unMSafe = cc : (unmsafe, unmsafe, unmsafe), resp. unCSafe = cc : (uncsafe, uncsafe, uncsafe) is a distinguished quad that is generated, if the prerequisites of safety (resp. msafety, resp. csafety) is violated. For any quad-system QSC = hQC , Ri, we define its safe dChase dChasesafe (QSC ) as follows: dChasesafe 0 (QSC ) = QC ; safe safe ( ri , µ, dChasesafe dChasesafe m (QSC )), m+1 (QSC ) = dChasem (QSC ) ∪ apply if there exists ri ∈ augS(R), assignment µ such that applicableaugS(R) (ri , µ, dChasesafe m (QSC )); safe dChasesafe m+1 (QSC ) = dChasem (QSC ), otherwise; for any m ∈ N. S dChasesafe (QSC ) = m∈N dChasesafe m (QSC )

The termination condition for safe dChase computation can be implemented using the following conditional: If there exists m such that safe dChasesafe m (QSC ) = dChasem+1 (QSC ); then

dChasesafe (QSC ) = dChasesafe m (QSC ). Similarly, dChases dChasemsafe (QSC ) and dChasecsafe ( QSC ) are defined for msafe and csafe quad-systems, respectively. We bring to the notice of the reader that although application of any augS(r) (resp. augM (r), resp. augC(r)) produces quad-patterns of the form cc : ( : b, descendantOf, : b), for any Skolem blank node : b generated, there is no raise of a false alarm in the unSafeTest (resp. unMSafeTest, resp. unCSafeTest). This is because unSafeTest (resp. unMSafeTest, resp. unCSafeTest) on a BR r = body(r)(~x, ~z) → head(r)(~x, ~y ) and assignment µ checks if the application of µ of r with the fresh : b00 assigned to a yi ∈ {~y } by µext(~y) would have a child : b 6= b00 assigned to some xi ∈ {~x} by µ, such that there exists a quad of the form cc : ( : b0 , descendantOf, : b) in the safe (resp. msafe, resp. csafe) dChase constructed so far, and : b00 and : b0 have the same origin ruleId and originVector (resp. originRuleId, resp. 118

originContexts). Note that in the above : b0 should also be distinct from : b00 , and hence rules out the case in which unSafeTest (resp. unMSafeTest, resp. unCSafeTest) returns True because of the detection of a blank node as a self descendant of itself. The following theorem shows that the procedure above described for detecting unsafe quad-systems is sound and complete: Theorem 19. For any quad-system QSC = hQC , Ri, the quad unSafe (resp. unMSafe, resp. unCSafe) ∈ dChasesafe (QSC ) (resp. dChasemsafe (QSC ), resp. dChasecsafe (QSC )), iff QSC is unsafe (resp. unmsafe, resp. uncsafe). It should be noted that for any quad-system QSC = hQC , Ri, dChasesafe (QSC ) (resp. dChasemsafe (QSC ), resp. dChasecsafe (QSC )) is a finite set and hence the iterative procedure which we described earlier terminates, regardless of whether QSC is safe (resp. msafe, resp. csafe) or not. This is because if QSC is safe (resp. msafe, resp. csafe), then, as we have seen before, there exists a double exponential bound on number of quads in its dChase. Hence, there is an iteration in which no new quad is generated, which leads to stopping of computation. Otherwise, if QSC is unsafe (resp. msafe, resp. csafe), then from Theorem 19, we know that the quad unSafe (resp. unMSafe, resp. unCSafe) gets generated in dChasesafe (QSC ) (resp. dChasemsafe (QSC ), resp. dChasecsafe (QSC )) kQS k in not more than O(22 C ) iterations. This implies that there exists an iteration m such that the quad unSafe (resp. unMSafe, resp. unCSafe) is in msafe dChasesafe (QSC ), resp. dChasecsafe m (QSC ) (resp. dChasem m (QSC )). W.l.o.g, let m be the first such iteration. This means that there exists a BR ri ∈ R with head head(ri )(~x, ~y ), assignment µ such that applicableaugS(R) (ri , µ, dChasmsafe esafe m−1 (QSC )) (resp. applicableaugM (R) (ri , µ, dChasem−1 (QSC )), resp. applicext(~y ) ableaugC(R) (ri , µ, dChasecsafe ] m−1 (QSC )) holds. By construction, since head(ri )[µ is not generated, and instead the quad unSafe (resp. unMSafe, resp. unCSafe) is generated, applicableaugS(R) (ri , µ, dChasesafe m (QSC )) (resp. applicableaugM (R) (ri , µ, dChasemsafe (QSC )), resp. applicableaugC(R) ( ri , µ, dChasecsafe m m (QSC )) holds 119

yet again. This means that the termination condition is satisfied at iteration m + 1, and hence computation stops. Note that regardless of whether a given quad-system is safe (resp. msafe, resp. csafe) or not, the number of safe (resp. msafe, resp. csafe) dChase iterations is double exponentially bounded in the size of the quad-system. Consequently, we derive the following theorem. Theorem 20. Recognizing whether a quad-system is safe/ msafe/csafe is in 2EXPTIME. Also notice that after running procedure described above, if the quad unSafe (resp. unMSafe, resp. unCSafe) is not generated, then its safe (resp. msafe, resp. csafe) dChase itself can be used for CCQ answering, as in such a case the standard dChase is contained in safe (resp. msafe, resp. csafe) dChase, and all the quads generated for accounting information have the context identifier cc . Hence, for any safe (resp. msafe, resp. csafe) quad-system, for any boolean CCQ that does not contain quad patterns of the form cc : (s, p, o), the dChase entails CCQ iff the safe (resp. msafe, resp. csafe) dChase entails CCQ. A set of BRs R is said to be universally safe (resp. msafe, resp. csafe) iff, for any quad-graph QC , the quad-system hQC , Ri is safe (resp. msafe, resp. csafe). For any set of BRs R, whose set of context identifiers is C, also let UR be the set of URIs that occur in the triple patterns of R plus an additional ad hoc blank node : bcrit , the critical quad-graph of R is defined as the set {c : (s, p, o)|c ∈ C, {s, p, o} ⊆ UR }. The following property illustrates how the critical quad-graph of a set of BRs R can be used to determine, whether or not R is universally safe/msafe/csafe. Property 21. A set of BRs R is universally safe (resp. msafe, resp. csafe) iff crit hQcrit is the critical quadC , Ri is safe (resp. msafe, resp. csafe), where QC graph of R.

120

Chapter 7 Range Restricted Quad-Systems In this chapter, we investigate the complexity of CCQ entailment over quadsystems, whose BRs do not have existentially quantified variables.

7.1

Restricting to Range Restricted BRs

Suppose if we prohibit the occurrence of existentially quantified variables from the BRs of the form (4.1), then the resulting BRs must be of the form: c1 : t1 (~x, ~z) ∧ ... ∧ cn : tn (~x, ~z) → c01 : t01 (~x) ∧ ... ∧ c0m : t0m (~x) Note that any set of BRs R of the form above can be replaced by semantically equivalent set R0 , such that each r ∈ R0 is the form: c1 : t1 (~x, ~z), ..., cn : tn (~x, ~z) → c01 : t01 (~x)

(7.1)

Also kR0 k is at most quadratic in kRk, and hence, w.l.o.g, we assume that each r ∈ R is of the form (7.1). Borrowing the parlance from the ∀∃ rules setting, where rules whose variables in the head part are contained in the variables in the body part are called range restricted rules [6], we call such BRs range restricted (RR) BRs. We call a quad-system whose BRs are all of RR-type, a RR quadsystem. Since there exists no existentially quantified variables in BRs of a RR quad-system, no Skolem blank nodes are produced during dChase computation. 121

Hence, there can be no violation of the context acyclicity condition in chapter 5 and safety/msafety/csafety condition in chapter 6, and hence, the class of RR quad-systems are contained in the class of safe/msafe/csafe quad-systems, and is also a FEC. Of course, this containment is strict as any quad-system that contains a BR with an existential variable is not RR. Since one can determine whether or not a given quad-system is RR or not by simply iterating through set of BRs and checking their syntax, the following holds: Theorem 1. Recognizing whether a quad-system is RR can be done in linear time. In the following, we see that restricting to RR BRs, size of the dChase becomes polynomial w.r.t. size of the input quad-system, and the complexity of CCQ entailment further reduces compared to safe/msafe/csafe quad-systems. Lemma 2. For any RR quad-system QSC = hQC , Ri, the following holds: (i) kdChase(QSC )k = O(kQSC k4 ) (ii) dChase(QSC ) can be computed in EXPTIME (iii) If kRk is fixed to be a constant, dChase(QSC ) can be computed in PTIME. Proof. (i) Note that the number of constants in QSC is roughly equal to kQSC k. As no existential variable occurs in any BR in a RR quad-system QSC , the set of constants C(dChase(QSC )) is contained in C(QSC ). Since each c : (s, p, o) ∈ dChase(QSC ) is such that c, s, p, o ∈ C(QSC ), |dChase(QSC )| = O(|C(QSC )|4 ). Hence kdChase(QSC )k = O(|C(QSC )|4 ) = O(kQSC k4 ). (ii) Since from (i) |dChase(QSC )| = O(kQSC k4 ), and in each iteration of the dChase at least one new quad is added, the number of iterations cannot exceed O(kQSC k4 ). Since by Lemma 3, each iteration i of dChase computation requires O(|R| ∗ kdChasei−1 ( QSC )krs ) time, where rs = maxr∈R krk, and rs ≤ kQSC k, time required for each iteration is of the order O(2kQSC k ) time. Although the number of iterations is a polynomial, each iteration requires an 122

exponential amount of time w.r.t kQSC k. Hence time complexity of dChase computation is in EXPTIME. (iii) As we know that the time taken for application of a BR R is O(kdChasei−1 (QSC )kkRk ). Since kRk is fixed to a constant, application of R can be done in PTIME. Hence, each dChase iteration can be computed in PTIME. Also since the number of iterations is a polynomial in kQSC k, computing dChase is in PTIME. Theorem 3. Data complexity of CCQ entailment over RR quad-systems is PTIMEcomplete. Proof. (Membership) Follows from the membership in P of data complexity of CCQ entailment for safe quad-systems, whose expressivity subsumes the expressivity of RR quad-systems (Theorem 15 of Chapter 6). (Hardness) In order to prove P-hardness, we reduce a well known P-complete problem, 3HornSat, i.e. the satisfiability of propositional Horn formulas with at most 3 literals. Note that a (propositional) Horn formula is a propositional formula of the form: P1 ∧ . . . ∧ Pn → Pn+1

(7.2)

where Pi , for 1 ≤ i ≤ n + 1, are either propositional variables or constants t, f , that represents true and false, respectively. Note that for any propositional variable P , the fact that “P holds” is represented by the formula t → P , and “P does not hold” is represented by the formula P → f . A 3Horn formula is a formula of the form (7.2), where 1 ≤ n ≤ 2. Note that any (set of) Horn formula(s) Φ can be transformed in polynomial time to a polynomially sized set Φ0 of 3Horn formulas, by introducing auxiliary propositional variables such that Φ is satisfiable iff Φ0 is satisfiable. A pure 3Horn formula is a 3Horn formula of the form (7.2), where n = 2. Any 3Horn formula φ that is not pure can be trivially converted to equivalent pure form by appending a ∧ t on the body part 123

of φ. For instance, P → Q, can be converted to P ∧ t → Q. Hence, w.l.o.g. we assume that any set of 3Horn formulas is pure, and is of the form: P1 ∧ P2 → P3

(7.3)

In the following, we reduce the satisfiability problem of pure 3Horn formulas to CCQ entailment problem over a quad-system whose set of schema triples, the set of BRs, and the CCQ CQ are all fixed. For any set of pure Horn formulas Φ, we construct the quad-system QSC = hQC , Ri, where C = {ct , cf }. For any formula φ ∈ Φ of the form (7.3), QC contains a quad cf : (P1 , P2 , P3 ). In addition QC contains a quad ct : (t, rdf:type, T ). R is the singleton that contains only the following fixed BR: ct : (x1 , rdf:type, T ), ct : (x2 , rdf:type, T ), cf : (x1 , x2 , x3 ) → ct : (x3 , rdf:type, T ) Let the CQ be the fixed query ct : (f, rdf:type, T ). Now, it is easy to see that QSC |= CQ, iff Φ is not satisfiable. Theorem 4. Combined complexity of CCQ entailment over RR quad-systems is in EXPTIME. Proof. (Membership) By Lemma 2, for any RR quad-system QSC , its dChase dChase(QSC ) can be computed in EXPTIME. Also by Lemma 2, its dChase size kdChase(QSC )k is a polynomial w.r.t to kQSC k. A boolean CCQ CQ() can naively be evaluated by grounding the set of constants in the dChase to the variables in the CQ(), and then checking if any of these groundings are contained in dChase(QSC ). The number of such groundings can at most be kdChase(QSC )kkCQ()k (†). Since kdChase(QSC )k is a polynomial in kQSC k, there are an exponential number of groundings w.r.t kCQ()k. Since containment of each of these groundings can be checked in time polynomial w.r.t. the size of dChase(QSC ), and since kdChase(QSC )k is a polynomial w.r.t. kQSC k, the time complexity of CCQ entailment is in EXPTIME. 124

Concerning the combined complexity of CCQ entailment of RR quad-systems, we leave the lower bounds open.

7.2

Restricted RR Quad-Systems

We call those quad-systems with BRs of form (7.1) with a fixed bound on n as restricted RR quad-systems. They can be further classified as linear, quadratic, cubic,..., quad-systems, when n = 1, 2, 3, ..., respectively. Theorem 5. Data complexity of CCQ entailment over restricted RR quad-systems is P-complete. Proof. The proof is same as in Theorem 3, since the size of BRs are fixed to constant. Theorem 6. Combined complexity of CCQ entailment over restricted RR quadsystems is NP-complete. Proof. Let the problem of deciding if QSC |= CQ() be called DP’. (Membership) for any QSC whose rules are of restricted RR-type, the size of any r ∈ R is a constant. Hence, by Lemma 3, any dChase iteration can be computed in PTIME. Since the number of iterations is also polynomial in kQSC k, dChase(QSC ) can be computed in PTIME in the size of QSC and dChase(QSC ) has a polynomial number of constants. Hence, we can guess an assignment µ for all the existential variables in CCQ CQ(), to the set of constants in dChase(QSC ). Then, one can evaluate the CCQ, by checking if c : (s, p, o) ∈ dChase(QSC ), for each c : (s, p, o) ∈ CQ()[µ], which can be done in time O(kCQk ∗ kdChase(QSC )k), and is hence is in non-deterministic PTIME, which implies that DP’ is in NP. (Hardness) We show that DP’ is NP-hard, by reducing the well known NPhard problem of 3-colorability to DP’. Given a graph G = hV , Ei, where V = {v1 , ..., vn } is the set of nodes, E ⊆ V × V is the set of edges, the 3-colorability 125

problem is to decide if there exists a labeling function l : V → {r, b, g} that assigns each v ∈ V to an element in {r, b, g} such that the condition: (v, v 0 ) ∈ E → l(v) 6= l(v 0 ), for each (v, v 0 ) ∈ E, is satisfied. One can construct a quad-system QSc = hQc , ∅i, where graphQc (c) has the following triples: {(r, edge, b), (r, edge, g), (b, edge, g), (b, edge, r), (g, edge, r), (g, edge, b)} V Let CQ be the boolean CCQ: ∃v1 , ...., vn (v,v0 )∈E [ c : (v, edge, v 0 ) ∧ c : (v 0 , edge, v)]. Then, it can be seen that G is 3-colorable, iff QSc |= CQ.

126

Chapter 8 Quad-Systems vs Forall-Existential rules In this section, we formally compare the formalism of quad-systems with forallexistential (∀∃) rules. In the realm of ∀∃ rule sets, a conjunctive query (CQ) is an expression of the form: ∃~y p1 (~x, ~y ) ∧ ... ∧ pr (~x, ~y )

(8.1)

where pi (~x, ~y ), for 1 ≤ i ≤ r are predicate atoms over vectors ~x or ~y . A boolean CQ is defined as usual. The decision problem of whether, for a ∀∃ rule set P and a CQ Q, if P |=fol Q is called the CQ EP, where |=fol is the standard first order logic entailment relation. For any quad-graph QC = {c1 : (s1 , p1 , o1 ), . . . , cn : (sr , pr , or )}, let rQC be the BR → ~∃yb1 , . . . , ybq c1 : (s1 , p1 , o1 )[µB ] ∧ . . . ∧ cr : (sr , pr , or )[µB ], where { : b1 , . . . , : bq } is the set of blank nodes in QC , and µB is the substitution function { : bi → ybi }i=1,...,q that assigns each blank-node to a fresh existentially quantified variable. It can be noted that the quad-systems hQC , Ri and h∅, R ∪ {rQC }i are semantically equivalent. The following definition gives the translation functions that will be necessary to establish the relation between quad-systems and ∀∃ rule sets. 127

Definition 1 (Translations τq , τr , τccq , τ ). The translation function τq from the set of quad patterns to the set of ternary atoms is defined as: for any quad-pattern c : (s, p, o), τq (c : (s, p, o)) = c(s, p, o). The translation function τbr from the set of BRs to the set of ∀∃ rules is defined as: for any BR r of the form (4.1): τbr (r) = ∀~x∀~z [τq (c1 : t1 (~x, ~z)) ∧ ... ∧ τq (cn : tn (~x, ~z)) → ∃~y τq (c01 : t01 (~x, ~y )) ∧ ... ∧ τq (c0m : t0m (~x, ~y ))], The translation function τ from the set of quad-systems to forall-existential rule sets is defined as: for any quad-system QSC = hQC , Ri, τ (QSC ) = τbr (R)∪ S {τbr (rQC )}, where τbr (R) = r∈R τbr (r). The translation function τccq from the set of boolean CCQs to the set of boolean CQs is defined as: for any boolean CCQ CQ = ∃~y c1 : t1 (~a, ~y ) ∧ . . . ∧ cr : tr (~a, ~y ), τccq (CQ) is: ∃~y τq (c1 : t1 (~a, ~y )) ∧ . . . ∧ τq (cr : tr (~a, ~y )). The following property gives the relation between CCQ entailment of unrestricted quad-systems and standard first order CQ entailment of ∀∃ rule sets. Property 2. For any quad-system QSC , CCQ CQ, QSC |= CQ iff τ (QSC ) |=fol τccq (CQ). Proof. Notice that every context c ∈ C becomes a ternary predicate symbol in the resulting translation. Also, τ (QSC ) is a ∀∃ rule set, and for any CCQ CQ, τccq (CQ) is a CQ. In order to construct the restricted chase for τ (QSC ), suppose that ≺q is also extended to set of instances such that for any two quad-graphs QC , Q0C 0 , QC ≺q Q0C 0 iff τq (QC ) ≺q τq (Q0C 0 ). Suppose ≺ is extended similarly to set of instances. Also assume that during the construction of standard chase chase(τ (QSC )) of τ (QSC ), for any application of a τbr (r) with existentially quantified variables, 128

with r ∈ R, suppose the Skolem blank nodes generated in chase(τ (QSC )) follow the same order as they are generated in dChase( QSC ). Also let us extend the rule applicability function to the ∀∃ rules settings such that for any set of BRs R, for any r ∈ R, quad-graph Q0C 0 , assignment µ, applicableR (r, µ, Q0C 0 ) iff applicableτbr (R) (τbr (r), µ, τq (Q0C 0 )). Now it can be seen that dChase0 (h∅, R ∪ {rQC }i) = ∅, chase0 (τ (QSC )) = ∅, dChase1 (QSC ) = apply(rQC , µ∅ ), where µ∅ is the empty function, and chase1 (τ (QSC )) = apply(τbr (rQC ), µ∅ ), and so on. It is straightforward to see that, for any m ∈ N, τq ( dChasem (h∅, R∪{rQC }i)) = chasem (τ (QSC )). Consequently, τq (dChase(QSC )) = chase(τ (QSC )), and {CQ}[σ] ⊆ dChase(QSC ) iff {τccq ( CQ)}[σ] ⊆ chase(τ (QSC )). Hence, applying Theorem 2 of Chapter 6 and the analogous theorem for ∀∃ rulesets from Deutch et al. [35], it follows that for any quad-system QSC = hQC , Ri and a boolean CCQ CQ, QSC |= CQ iff τ (QSC ) |=fol τccq (CQ). Theorem 3. There exists a polynomial time translation function τ (resp. τccq ) from the set of unrestricted quad-systems (resp. CCQs) to the set of ∀∃ rule sets (resp. CQs), such that for any unrestricted quad-system QSC and a CCQ CQ, QSC |= CQ iff τ (QSC ) |=fol τccq (CQ). Proof. It is easy to see that τq , τbr , τ , and τccq in Definition 1 can be implemented using simple syntax transformation, by iterating through the respective components of a quad-system/CCQ, and the time complexity of these functions are linear w.r.t their inputs. Notice that for any CCQ CQ (resp. CQ Q), → CQ (resp. → Q) is a bridge (resp. ∀∃) rule, with an empty body. Also, since for any quad-graph QC , the translation function τbr defined above can directly be applied on rQC to obtain a ∀∃ rule, the following theorem immediately follows: Theorem 4. For quad-systems, the EPs: (i) quad EP, (ii) quad-graph EP, (iii) BR EP, (iv) BRs EP, (v) Quad-System EP, and (vi) CCQ EP are polynomially 129

reducible to entailment of ∀∃ rule sets. A ∀∃ rule set P is said to be a ternary ∀∃ rule set, iff all the predicate symbols in the vocabulary of P are of arity less than or equal to three. P is a purely ternary rule set, iff all the predicate symbols in the vocabulary P is of arity three. Similarly, a (purely) ternary CQ is defined. The following property gives the relation between the CQ entailment problem of ∀∃ rule sets and CCQ EP of unrestricted quad-systems. Theorem 5. There exists a polynomial time tranlation function ν (resp. νcq ) from ternary ∀∃ rule sets (resp. ternary CQs) to unrestricted quad-systems (resp. CCQs) such that for any ternary ∀∃ rule set P and a ternary CQ Q, P |=fol CQ iff h∅, ν(P)i |= νcq (Q). Proof. Note that the CQ EP of any ternary ∀∃ rule set P, whose set of predicate symbols is P , and CQ Q over P , can polynomially reduced to the CQ EP of a purely ternary rule set P0 and purely ternary CQ Q0 , by the following transformation function χ. Let  be an adhoc fresh URI; χ is such that for any ternary atom c(s, p, o), χ(c(s, p, o)) = c(s, p, o). For any binary atom c(s, p), χ(c(s, p)) = c(s, p, ), and for any unary atom c(s), χ(c(s)) = c(s, , ). For any ∀∃ rule r of the form (2.2), χ(r) = ∀~x∀~z [χ(p1 (~x, ~z)) ∧ . . . ∧ χ(pn (~x, ~z)) → ∃~y χ(p01 (~x, ~y )) ∧ . . . ∧ χ(p0m (~x, ~y ))] S And, for any ∀∃ rule set P, χ(P) = r∈P χ(r). For any CQ Q, χ(Q) is similarly defined. Note that for any ternary ∀∃ rule set P, ternary CQ Q, χ(P) (resp. χ(Q)) is purely ternary, and P |=fol Q iff χ(P) |=fol χ(Q). −1 Also, it can straightforwardly seen that τbr−1 (χ(P)) (resp. τccq (χ(Q))) is a set of BRs (resp. CCQ). Suppose, ν(P) is such that ν(P) = QSC = h∅, τbr−1 (χ(P))i. Intuitively, C contains a context identifier c, for each predicate symbol c ∈ −1 P . Also suppose, νcq (Q) = τccq (χ(Q)). Notice that νcq (Q) is CCQ. It can 130

straightforwardly seen that ν and νcq can be computed in polynomial time, and P |=fol Q iff ν(P) |= νcq (Q). Thanks to Theorem 3 and Theorem 5, the following theorem immediately holds: Theorem 6. The CCQ EP over quad-systems is polynomially equivalent to CQ EP over ternary ∀∃ rule sets. By virtue of the theorem above, we derive the following property: Property 7. For quad-systems, the Quad EP, Quad-graph EP, BR(s) EP, and Quad-system EP are polynomially reducible to CCQ EP. Proof. The following claim is a folklore in the realm of ∀∃ rules. Claim (1) The ∀∃ rule set EP is polynomially reducible to CQ EP. Reducibility of ∀∃ rule EP to CQ EP is a folklore in the realm of ∀∃ rules. For a formal proof, we refer the reader to Baget et al. [6], where it is shown that the ∀∃ rule EP is polynomially reducible to fact (a set of instances) EP, and fact EP are equivalent to CQ EP. Also, Cali et al [21] show that CQ containment problem, which is equivalent to ∀∃ rule EP, is reducible to CQ EP. Since a ∀∃ rule set is a set of ∀∃ rules, by using a series of oracle calls to a function that solves the ∀∃ rule EP, we can define a function for deciding ∀∃ rule set entailment. Hence, the claim holds. (a) Thanks to translation functions τ , τbr defined earlier, such that for any quad-system QSC , quad-graph Q0C 0 , QSC |= Q0C 0 iff τ (QSC ) |=fol τbr (rQ0C0 ), we can infer that quad-graph EP is polynomially reducible to ∀∃ rule set EP. Applying claim 1, it follows the quad-graph EP over quad-systems is polynomially reducible to CQ EP over ∀∃ rule sets. By Theorem 5, we can deduce that quadgraph EP is polynomially reducible to CCQ EP. (b) By the translation functions τ and τbr , defined earlier, such that for any quad-system QSC , a set of BRs R, QSC |= R iff τ (QSC ) |=fol τbr (R), we can 131

infer that BRs EP is polynomially reducible to ∀∃ rule set EP. Similar to (a) above, we deduce that BRs EP is polynomially reducible to CCQ EP. From (a) and (b), it follows that Quad-system EP is reducible to CCQ EP.

Having seen that the CCQ EP over quad-systems is polynomially equivalent to CQ EP over ternary ∀∃ rule sets, we now compare some of the well known techniques used to ensure decidability of CQ entailment in the ∀∃ rules settings to the decidability techniques for quad-systems that we saw earlier in the previous sections. Note that since all the quad-system classes we proposed in this paper are FECs, for a judicious comparison, the ∀∃ rule classes to which we compare are classes which have a finite chase property. We compare to the following three well known classes: (i) Weakly Acyclic rule sets (WA), (ii) Jointly Acyclic rule sets (JA), and (iii) Model Faithful Acyclic ∀∃ rule sets (MFA). The following property is well known in the realm of ∀∃ rules: Property 8. For the any ∀∃ rule set P, the following holds: 1. If P ∈ WA, then P ∈ JA (from [60]), 2. If P ∈ JA, then P ∈ MFA (from [32]), 3. WA ⊂ JA ⊂ MFA (from [60] and [32]). Note that a description of few other ∀∃ rule classes that do not have the finite chase property, but still enjoy decidability of CQ entailment are given in the related work.

8.1

Weak Acyclicity

Weak acyclicity [39, 34] is a popular technique used to detect whether a ∀∃ rule set has a finite chase, thus ensuring decidability of query answering. The set WA represents class of ternary ∀∃ rule sets that have the weak acyclicity property. 132

For any predicate atom p(t1 , . . . , tn ), an expression hp, ii, for i = 1, . . . , n is called a position of p. In the above case, t1 is said to occur at position hp, 1i, t2 at hp, 2i, and so on. For a set of ∀∃ rules P, its dependency graph is a graph whose nodes are positions of predicate atoms in P; for each r ∈ P of the form (2.2), and for any variable x occurring in position hp, ii in head of r: 1. if x is universally quantified and x occurs in the body of r at position hp0 , ji, then there exists an edge from hp0 , ji to hp, ii 2. if x is existentially quantified, then for any universally quantified variable x0 occurring in the head of r, with x0 also occurring in the body of r at position hp0 , ji, there exists a special edge from hp0 , ji to hp, ii. P is called weakly acyclic, iff its dependency graph does not contain cycles going through a special edge. For any ∀∃ rule set P, if P is WA, then its chase is finite, and hence CQ EP is decidable. Note that the nodes in the dependency graph that has incoming special edges corresponds to the positions of predicates where new values are created due to existential variables, and the normal edges capture the propagation of constants from one predicate position to another predicate position. In this way, absence of cycles involving special edges ensures that newly created Skolem blank nodes are not recursively used to create other new Skolem blank nodes in the same position, leading to termination of chase computation. Theorem 9. Let τ be the translation function from the set of unrestricted quadsystems to the set of ternary ∀∃ rule sets, as defined in property 2, then, for any quad-system QSC = hQC , Ri, the following holds: (i) if QSC is context acyclic, then τ (QSC ) is weakly acyclic; the converse may not hold, in general, (ii) if local semantics of contexts is OWL-Horst or its derivative, i.e. OWL-Horst IRs ⊆ LIR, then QSC is context acyclic iff τ (QSC ) is weakly acyclic. Proof. (i) We prove the contrapositive, i.e. suppose τ (QSC ) is not weakly acyclic, then QSC is not context acyclic. By construction of dependency graph, 133

any edge e = (hc, ii, hc0 , ji) in the dependency graph induces an edge of the form (c, c0 ) in the context dependency graph. Moreover, if e is a special edge, then c0 is marked with a ∗ in the context dependency graph, i.e. c0 is a TGC. Suppose the given quad-system QSC is s.t. τ (QSC ) is not weakly acyclic. This means that there exists a cycle involving a special edge in the dependency graph of τ (QSC ). Then from the above arguments, it follows that there exists a cycle involving a TGC in the context dependency graph of QSC , which implies that QSC is not context acyclic. In order to show that the converse need not hold, consider the quad-system QSC = hQC , Ri mentioned in Example 7 of Chapter 6, whose context dependency graph is shown in Fig. 8.2. Note that QSC is not context acyclic, since dependency graph contains a cycle (c2 , c3 ) and c2 , c3 being TGCs. However, it can be seen from Fig. 8.1 that the dependency graph of τ (QSC ) does not contain any directed cycle involving special edges. Hence, τ (QSC ) is weakly acyclic. (ii) Suppose for any quad-system QSC , if we assume that local semantics of contexts is OWL-Horst or its derivative, i.e. OWL-Horst IRs ⊆ LIR. Now in the dependency graph, we also need to take into account the edges induced by OWL-Horst inference rules. Table 8.1 lists a few OWL-Horst inference patterns in the first column, and in the second column the corresponding edge/path induced on the dependency graph due to the inference pattern. For instance, the inference pattern in the third row of the table on the atom c(s, p, o) derives an additional atom c(p, rdf:type, rdf:Property) in which constant p in the position hc, 2i in the former gets propagated to the position hc, 1i in the latter. As indicated by the second column, due to OWL-Horst inferencing, a constant in a position hc, ii of predicate c, i ∈ {1, 2, 3} can potentially spread to every other position of predicate c in the derived atoms. This means that the dependency graph contains a clique (hc, 1i, hc, 2i, hc, 3i), for every c ∈ C. This means that the presence of a special edge in the dependency graph from hc, ii to hc0 , ji, induces a path involving a special edge from c, k to c0 , k 0 , for every 134

OWL-Horst inference pattern c(x1 , owl:equivalentProperty, z), c(x2 , z, x3 ) → c(x2 , x1 , x3 ) c(x, rdf:type, rdfs:Class) → c(x, rdfs:subClassOf, x) c(z1 , x, z2 ) → c(x, rdf:type, rdf:Property) c(z1 , x, z2 ) → c(x, rdf:type, rdf:Property) → c(x, rdfs:subPropertyOf, x) c(z1 , z2 , x) → c(x, rdf:type, rdfs:Resource) c(z, rdfs:subPropertyOf, x1 ), c(x2 , z, x3 ) → c(x2 , x1 , x3 )

Induced edges hc, 1i → hc, 2i hc, 1i → hc, 3i hc, 2i → hc, 1i hc, 2i → hc, 3i hc, 3i → hc, 1i hc, 3i → hc, 2i

Table 8.1: Edges induced in the dependency graph due to OWL-Horst inferencing

k, k 0 ∈ {1, 2, 3}. From these facts, suppose if a given quad-system QSC is not context acyclic, then by definition its context dependency graph contains a cycle through a TGC. Due to the above arguments, the dependency graph of τ (QSC ) should contain a cycle involving a special edge. This implies that QSC is not weakly acyclic. The converse follows from (i).

Example 10. Let us revisit the quad-system QSC = hQC , Ri mentioned in Example 7 of Chapter 6, whose dependency graph is shown in Fig. 8.1. Note that the QSC is uncsafe, since its dChase contains a Skolem blank-node : b4 , which has as descendant another Skolem blank node : b1 , with the same origin context c2 (see Fig. 6.1). However, it can be seen from Fig. 8.1 that the dependency graph of τ (QSC ) does not contain any directed cycle involving special edges. Hence τ (QSC ) is weakly acyclic.

It turns out that there exists no inclusion relationship between the classes and CSAFE in either directions, i.e. WA 6⊆ CSAFE (from example 10), and CSAFE 6⊆ WA (from the fact that WA ⊂ JA , and example 11 below). Whereas WA ⊂ MSAFE , since WA ⊂ MFA and MFA ≡ MSAFE (theorem 12). WA

135

hc1 , 1i

hc2 , 1i ∗

hc3 , 2i

hc2 , 3i ∗

hc1 , 2i

hc3 , 3i

∗ ∗

hc2 , 2i

Figure 8.1: Dependency graph of the quad-system in Example 7 of Chapter 6.

c1

c2

c3





Figure 8.2: Context dependency graph of the quad-system in Example 7 of Chapter 6.

8.2

Joint Acyclicity

Joint acyclicity [60] extends weak acyclicity, by also taking into consideration the join between variables in body of ∀∃ rules while analyzing the rules for acyclicity. The set JA represents the class of all ternary ∀∃ rule sets that have the joint acyclicity property. A ∀∃ rule set P is said to be renamed apart, if for any r 6= r0 ∈ R, V(r)∩V(r0 ) = ∅. Since any set of rules can be converted to an equivalent renamed apart one by simple variable renaming, we assume that any rule set P is renamed apart. Also for any r ∈ P and for a variable y, let P osrH (y) (P osrB (y)) be the set of positions in which y occurs in the head (resp. body) of r. For any ∀∃ rule set P and an existentially quantified variable y occurring in a rule in P, we define M ovP (y) as the least set with: • P osrH (y) ⊆ M ovP (y), if y occurs in r; • P osrH (x) ⊆ M ovP (y), if x is a universally quantified variable and P osrB (x) ⊆ M ovP (y); 136

for any r ∈ P. The existential dependency graph of a (renamed apart) set of rules P is a graph whose nodes are the existentially quantified variables in P. There exists an edge from a variable y to y 0 , if there is a rule r ∈ P in which y 0 occurs and there exists a universally quantified variable x in the head (and body) of r such that P osrB (x) ⊆ M ovP (y). A ∀∃ rule set P is jointly acyclic, iff its existential dependency graph is acyclic. Analyzing the containment relationships, it happens to be the case that JA 6⊆ CSAFE (since WA ⊂ JA, and eg. 10). Also example 11 shows us that CSAFE 6⊆ JA. However JA ⊂ MSAFE, since JA ⊂ MFA and MFA ≡ MSAFE (Theorem 12). Example 11. Consider the quad-system QSC = hQC , Ri, where QC b, c)}. Suppose R is the following set:    c1 : (x11 , x12 , z1 ) → c2 : (x11 , x12 , y1 ) R= c1 : (x21 , x22 , z2 ), c2 : (x22 , x21 , x23 ) → c3 : (x21 , x22 , x23 )   c3 : (x31 , x32 , x33 ) → c1 : (x33 , x31 , x32 )

= {c1 : (a,

(r1 ) (r2 ) (r3 )

    

Iterations during the dChase construction are: dChase0 (QSC ) = {c1 :(a, b, c)} dChase1 (QSC ) = {c1 : (a, b, c), c2 : (a, b, : b1 )} dChase(QSC ) = dChase1 (QSC ) Note that the lone Skolem blank node generated is : b1 , which do not have any descendants. Hence, by definition QSC is csafe (msafe/safe). Now analyzing the BRs for joint acyclicity, we note that for the only existentially quantified variable y1 , M ovR (y1 ) = {hc2 , 3i, hc3 , 3i, hc1 , 1i} Since the BR r1 in which y1 occurs contains the universally quantified variable r1 x11 in head of r1 such that P osB (x11 ) ⊆ M ovR (y1 ), there exists a cycle from y1 to y1 itself in the existential dependency graph of τ (QSC ). Hence, by definition τ (QSC ) is not joint acyclic. Also since the class of weakly acyclic rules are

137

contained in the class of jointly acyclic rule, it follows that τ (QSC ) is also not weakly acyclic.

8.3

Model Faithful Acyclicity (MFA)

MFA, proposed in Cuenca Grau et al. [32], is an acyclicity technique that guarantees finiteness of chase and decidability of query answering, in the realm of ∀∃ rules. The set MFA denotes the class of all ternary ∀∃ rule sets that are model faithfully acyclic. As far as we know, the MFA technique subsumes almost all other known techniques that guarantee a finite chase, in the ∀∃ rules setting. Obviously, WA ⊂ JA ⊂ MFA. For any ∀∃ rule r = φ(r)(~x, ~z) → ψ(r)(~x, ~y ), for each yj ∈ {~y }, let Yrj be a fresh unary predicate unique for yj and r; furthermore, let S a be fresh binary predicate. The transformation mfa of r is defined as: ^ ^ mfa(r) = φ(r)(~x, ~z) → ψ(r)(~x, ~y ) ∧ [Yrj (yj ) ∧ S(xk , yj )] yj ∈{~y }

xk ∈{~x}

Also let r1 and r2 be two additional rules defined as: S(x1 , z) ∧ S(z, x2 ) → S(x1 , x2 )

(r1 )

Yrj (x1 ) ∧ S(x1 , x2 ) ∧ Yrj (x2 ) → C

(r2 )

where C is a fresh nullary predicate. For any set of ∀∃ rules P, let ad(P) be the union of r1 with the set of rules obtained by instantiating r2 , for each r ∈ P, for each existential variable yj in r. For a set of ∀∃ rules P, mfa(P) = S r∈P mfa(r) ∪ ad(P). A ∀∃ rule set P is said to be MFA, iff mfa(P) 6|=fol C. It was shown in Cuenca Grau et al. [32] that if P is MFA, then P has a finite chase, thus ensuring decidability of query answering. The following theorem establishes the fact that the notion of msafety is equivalent to MFA, thanks to the polynomial time translations between quad-systems and ternary ∀∃ rule sets. 138

Theorem 12. Let τ be the translation function from the set of unrestricted quadsystems to the set of ternary ∀∃ rule sets, as defined in Definition 1, then, for any quad-system QSC = hQC , Ri, QSC is msafe iff τ (QSC ) is MFA. Proof. (outline) Recall that τ = hτq , τbr i, where τq is the quad translation function and τbr is the translation function from BRs to ∀∃ rules. Also, τ (QSC ) = τbr ({rQC } ∪ R). Also, recall that for every blank node b in QC , the BR rQC contains a corresponding existentially quantified variable yb . We already saw that, for such a transformation, the following property holds: for any m ∈ N, τq (dChasem (QSC )) = chasem (τ (QSC )), and for any BR r ∈ R ∪ {rQC }, assignment µ, applicableR∪{rQC } (r, µ, dChasem (QSC )) iff applicableτ (QSC ) (τbr (r), µ, chasem (τ (QSC ))). Also notice that for any two blank nodes : b1 , : b2 , S( : b1 , : b2 ) ∈ chase(τ (QSC )), iff : b1 is a descendant of : b2 with respect to dChase(QSC ). Hence, the relations S and descendantOf are identical. Intuitively, MFA looks for cyclic creation of a Skolem blank-node whose descendant is another Skolem blank-node that is generated by the same rule r = body(r)(~x, ~z) → head(r)(~x, ~y ), by the same existential variable in yj ∈ {~y } of r. Wheras, msafety looks only for generation of a Skolem blank-node : b0 whose descendant is another Skolem : b using the same rule r. Hence, if τ (QSC ) is not MFA, then QSC is not msafe, and consequently onlyIf part of the theorem trivially holds. (If part) Suppose QSC is unmsafe, and µ and µ0 are the assignments applied on r ∈ R to create Skolem blank nodes : b and : b0 , respectively, and suppose : b is a descendant of : b0 in the dChase(QSC ). That is : b = µ(yj ) and : b0 = µ0 (yk ), for yj , yk ∈ {~y } of r. Suppose j = k, then the prerequisite of non-MFA is trivially satisfied. Suppose if j 6= k is the case, then there exists : b00 in dChase(QSC ) such that : b00 = µ0 (yj ), since µ0 is applied on r and yj ∈ {~y }. This means that also in this case, the prerequisite of non-MFA is satisfied. As a consequence τ (QSC ) is not MFA. Hence it follows that, QSC is msafe iff τ (QSC ) is MFA. 139

Let us revisit the quad-system QSC in Example 10 of Chapter 6, it can be easily seen that τ (QSC ) is not MFA. Recall that we have seen that QSC is safe but not msafe. We consider the Theorem 12 to be of importance, as it not only establishes the equivalence of MFA and msafety, but thanks to it and the translation τ , it can be deduced that the technique of safety, which we presented earlier, (strictly) extends the MFA technique. As far as we know, the MFA class of ∀∃ rule sets is one of the most expressive class in the realm of ∀∃ rule sets which allows a finite chase. Hence, the notion of safety that we propose can straightforwardly be ported to ∀∃ settings. The main difference between MFA and safety is that MFA only looks for cyclic creation of two distinct Skolem blank-nodes : b, : b0 that are generated by the same rule r, by the same existential variable in r. Whereas safety also takes into account the origin vectors ~a and ~a0 used during rule application to create : b and : b0 , respectively, and only raises an alarm if ~a ∼ = ~a0 . Although, equivalence holds only between quadsystems and ternary ∀∃ rule sets, it can easily be noticed that the technique of safety can be applied to ∀∃ rule sets of arbitrary arity, and can be used to extend currently established tools and systems that work on existing notions of acyclicity such as WA, JA, or MFA.

140

Chapter 9 Related work 9.1

Contexts and Distributed Logics

Work on contexts gained its attention as early as in the 80s, as McCarthy [55] proposed context as a solution to the generality problem in AI. McCarthy in works such as [55, 73] lists a few problems in his past efforts on formalizing contexts, and proposed a general solution – represent contexts as first class objects. After this, various studies about logics of contexts mainly in the field of KR were done by Guha [84], Distributed First Order Logics by Ghidini et al. [42] and Local Model Semantics by Giunchiglia et al. [41]. In his thesis [84], Guha implemented several of the existing ideas of McCarthy, and exemplfied using several realistic examples how context can be used to solve several real life problems. Ghidini’s and Giunchiglia’s ideas were primarily grounded on the “Context as a box” paradigm elaborated in Benerecetti et al. [71]. The “Context as a box” approach proposes the formalization of a context as a theory plus a set of dimension-value pairs for a fixed set of contextual dimensions [36]. Bao et.al. [8] extended the theory of McCarthy [74, 55] by providing a more concrete formalization using the built in predicate isin. A number of constructs were introduced for combining contexts (c1 ∧ c2 , c1 ∨ c2 and ¬c) and for relating contexts (c1 ⇒ c2 , and c1 → c2 ). Primarily in the above works, contexts were formalized as a first order/propositional theory and bridge rules were provided 141

to inter-operate the various theories of contexts. Some of the initial works on contexts relevant to SW were the ones like Distributed Description Logics (DDL) [14] by Borgida et al., Context-OWL [15] by Bouquet et al., and the recent work of CKR [86, 75, 16] by Serafini and Bozzato et al. These were mainly logics based on DLs, which formalized contexts as OWL KBs, whose semantics is given using a distributed interpretation structure with additional semantic conditions that suits varying requirements. DDL and Context-OWL, provides a language for extending ontologies in DL/OWL with contexts. Rather than a global/shared approach in which ontologies can externally refer to other ontologies via import statement, the contextualized/localized approach in DDL/Context-OWL is to allow the co-existence of multiple localized OWL/DL theories, called contexts. A limited or controlled form of globalization is possible by the virtue of mappings via bridge rules. Mappings are projections of local domain onto an external domain, and vice versa. The semantics using domain relations rij make it possible to have directional mappings between a pair of contexts ci , cj , i.e. to have mappings from ci to cj that differ from the mappings from cj to ci . Also, Context-OWL/DDL defines the mechanism of a hole, an interpretation in which every concept (resp. role) is mapped to the universal set (resp. relation), in order to prohibit propagation of inconsistencies from a context to another via mappings. Different from DDL and Context-OWL, the CKR allows to formalize the relation of coverage between contexts, that establishes the inclusion relationship of their corresponding domains. Also, in order to refer to concept/role symbols in foreign context, a concept/role symbol can be qualified with a context identifier. The CKR semantics specifies how the extension of a qualified concept/role in a context inherits the objects of the same from its covered/covering contexts. Euzenat in [38] describes the Tropes taxonomy building framework, that enables the conceptualization of objects in multiple viewpoints (similar to our contexts), also the support for conjunctive bridges allows to map a set of concepts in viewpoints 142

to a concept in another viewpoint. Harth et al. in [49] describe a detailed architecture of Yars, a semantic repository with a search/query engine that stores RDF data in the form of ((s, p, o), c) where (s, p, o) is an RDF triple in context c. The architecture contains modules that includes the crawler, index manager and indexer, query processing and a query distribution module. The indexing module contains an keyword based indexer and a quad index that is distributed over multiple servers that also includes a context identifier as an index key. Compared to these works, the bridge rules we consider are much more expressive with conjunctions and existential variables that supports value/blank-node creation. Also, none of the above works are focused on the query answering problem, which is the main focus of this thesis work.

9.2

Temporal/Annotated RDF

Studies in extending standard RDF with dimensions such as time and annotations have already been accomplished. Gutierrez et al. in [46] tried to add a temporal extension to RDF and defines the notion of a ‘temporal rdf graph’, in which a triple is augmented to a quadruple of the form t : (s, p, o), where t is a time point. Also, the authors extend the standard conjunctive graph queries with a temporal query language that supports temporal variables. A semantics is provided for interpreting temporal RDF graphs, from which the notion of temporal entailment of graphs and queries follows. The authors also provide a sound and complete set of inference rules, and show that entailment of temporal graphs does not yield extra computational complexity than standard RDF graph entailment. Annotated extensions to RDF and querying annotated graphs have been studied in Udrea et al. [92], Straccia [89], and Zimmerman et al. [93]. Unlike the case of time, here the quadruple has the form: a : (s, p, o), where a is an annotation. In Udrea et al. [92], the authors assume that the annotation a 143

in the triple a : (s, p, o) is a member of a strict partial order, whereas Straccia [89], and Zimmerman et al. assume that a is taken from an annotation domain that is an idempotent, commutative semi-ring, with the addition operation, +, being >-annihilating, i.e. x + > = >, for all x in the annotation domain. The use of such a structure for the annotation domain, allows the authors to infer [2000 − 2002] : (a, rdfs:subClassOf, c) from [1999 − 2002] : (a, rdfs:subClassOf, b) and [2000 − 2005] : (b, rdfs:subClassOf, c), whereas Udrea et al. [92] do not support this type of inferencing. The authors provide semantics, inference rules/algorithms and a query language that allows for expressing temporal/annotated queries. While Udrea et al. [92] only supports conjunctive type queries, Zimmerman et al. [93] supports full SPARQL 1.0 and many features of SPARQL 1.1 such as grouping, ordering, nested queries, variable assignments, and built-in predicates on the annotation domain. The authors call their extended query language, AnQL. Also, the authors in [93] show how their framework is suited for concrete real world cases, by illustrating how concrete dimensions such as time, fuzziness, and provenance satisfy the properties of an annotation domain, and exemplifies the applications of their framework on these domains. The authors also demonstrate the suitability of their framework for RDF statements that have annotation from different domains (for instance, time and fuzziness), and also show how AnQL querying can be done on the combination of annotated and non-annotated data. Although these approaches, in a way, address contexts by means of time and annotations, the main difference in our work is that we provide the means to specify expressive bridge rules for inter-operating the reasoning between the various contexts.

9.3

Description Logic Rules

Works on extending DL KBs with Datalog like rules was studied by Grosof et al. [13]. The authors in [13], propose a fragment of DL, called descrip144

tion horn logic (DHL), contained within the intersection of DLs and logic programs [33]. The authors define a translation mechanism for translating an arbitrary DHL ontology to a function-free positive Horn logic program, and illustrate how the common reasoning DL problems such as instance checking over classes/roles, subsumption checking/satisfiability of classes/roles can be reduced to atom entailment in logic programming. Related initiatives gave rise to SWRL[54], which is a formalism using which one can mix a DL ontology with the Unary/Binary Datalog RuleML sublanguages of the Rule Markup Language, and hence enables Horn-like rules to be combined with an OWL KB. Since SWRL is undecidable in general, studies on computable sub-fragments gave rise to works like Description Logic Rules [62] and its extensions [29], where the authors deal with rules that can be totally internalized by a DL knowledge base, and hence if the DL considered is decidable, then also is a DL+rules KB. The authors give various fragments of the rule bases like SROIQ rules, EL++ rules etc. and show that certain new constructs that are not expressible by plain DL can be expressed using rules although they are finally internalized into DL KBs. Unlike in our scenario, these works consider only horn rules without existential variables.

9.4

∀∃ rules, Tuple Generating Dependencies, Datalog+- rules

Query answering over rules with universal-existential quantifiers in the context of databases, where these rules are called Datalog+- rules/tuple generating dependencies (TGDs), was done by Beeri and Vardi [12] even in the early 80s, where the authors show that the query entailment problem, in general, is undecidable. However, recently many classes of such rules have been identified for which query answering is decidable. These classes (according to [6]) can broadly be divided into the following three categories: (i) bounded treewidth sets (BTS), (ii) finite unification sets (FUS), and (iii) finite extension sets (FES). 145

BTS contains the classes of ∀∃ rule sets, whose models have bounded treewidth. Some of the important classes of these sets are the linear ∀∃ rules [56], (weakly) guarded rules [21], (weakly) frontier guarded rules [6], and jointly frontier guarded rules [60]. BTS classes in general need not have a finite chase, and query answering is done by exploiting the fact that the chase is tree shaped, whose nodes (which are sets of instances) start replicating (up to isomorphism) after a while. Hence, one could stop the computation of the chase, once it can be made sure that any future iterations of chase can only produce nodes that are isomorphic to existing nodes. A deterministic algorithm for deciding query entailment for this class is provided in Thomazo et al. [91]. FUS classes include the class of ‘sticky’ rules [23, 22], atomic hypothesis rules in which the body of each rule contains only a single atom, and also the class of linear ∀∃ rules. The approach used for query answering in FUS classes is to rewrite the input query w.r.t. to the ∀∃ rule sets to another query that can be evaluated directly on the set of instances, s.t. the answers for the former query and latter query coincides. The approach is called the query rewriting approach. Compared to approaches proposed in this dissertation, these approaches do not enjoy the finite chase property, and are hence not conducive to materialization/forward chaining based query answering. Unlike BTS and FUS, the FES classes are characterized by the finite chase property, and hence are most related to the techniques proposed in our work. Some of the classes in this set employ termination guarantying checks called ‘acyclicity tests’ that analyze the information flow between rules to check whether cyclic dependencies exists that can lead to infinite chase. Weak acyclicity [39, 34], was one of the first such notions, and was extended to joint acyclicity [60] and super weak acyclicity [69]. The main approach used in these techniques is to exploit the structure of the rules and use a dependency graph that models the propagation path of constants across various predicates in the rules, and restricting the dependency graph to be acyclic. The main drawback of these approaches 146

is that they only analyze the schema/Tbox part of the rule sets, and ignore the instance part, and hence produce a large number of false alarms, i.e. it is often the case that although dependency graph is cyclic, the chase is finite. Recently, a more dynamic approach, called the MFA technique, that also takes into account the instance part of the rule sets was proposed in Cuenca grau et al. [32], where existence of cyclic Skolem blank-node/constant generations in the chase is detected by augmenting the rules with extra information that keeps track of the Skolem function used to generate each Skolem blank-node. As shown in Chapter 8, our technique of safety subsumes the MFA technique, and supports for much more expressive rule sets, by also keeping track of the vectors used by rule bodies while Skolem blank-nodes are generated.

9.5

Data integration

Studies in query answering on integrated heterogeneous databases with expressive integration rules in the realm of data integration is primarily studied in the following two settings: (i) Data exchange [39], in which there is typically a source database and target database that are connected with existential rules, and (ii) Peer-to-peer data management systems (PDMS) [47], where there are an arbitrary number of peers that are interconnected using existential rules. It can be noted that the peer-to-peer extension of (i) given in works such as [40, 2] has a similar architecture as (ii). The variant of data exchange problem in the realm of SW, called the P2P RDF Data Exchange setting, as presented in Barcelo et al. [10] is a system of RDF graphs interconnected using ∀∃ rules. A user query is typically a conjunctive query on any of the peers. The answer to the query is computed taking into account not only the knowledge in the peer, but also the mappings to the other peers. The approach based on dependency graphs, for instance, is used by Halevi et al. in the context of peer-peer data management systems [47], and decidability is attained by not allowing any kind 147

of cycles in the peer topology. Whereas in the context of Data exchange, WA is used in [39, 34] to assure decidability, and the recent work by Marnette [69] employs the super weak acyclicity (SWA) to ensure decidability. It was shown in Cuenca Grau et al [32] that their MFA technique strictly subsumes both WA and SWA techniques in expressivity. Since we saw in Chapter 8 that our technique of safety subsumes the MFA technique and allows the representation of much more expressive rule sets, the safety technique can straightforwardly be employed in the above mentioned systems with decidability guarantees for query answering.

9.6

Distributed/Federated SPARQL Querying

Support for SPARQL queries that span multiple graphs/datasets was already provided in SPARQL 1.0 [83] via FROM, FROM NAMED, and the GRAPH keywords. This has been extended to SPARQL querying over federated data sources/graphs by Buil-Aranda et al [19] and SPARQL 1.1, where the authors introduce multiple constructs for SPARQL queries that span multiple endpoints, and gives an extension of the SPARQL algebra for the federated extension. Chekol in his PhD thesis [31] studied the containment of SPARQL 1.1 queries in the presence of constraints expressed in RDFS and OWL-ALCH. Chekol reduces the containment problem of SPARQL queries to the validity problem in µ-calculus by translating both queries and constraints to formulas in µ-calculus. Though these query languages are similar in a way to the CCQs that span multiple contexts, the main difference is the presence of expressive forall-existential BRs in our work that can potentially cause non-termination. Also different from the above works, in our thesis work, we derive novel classes for which CCQ answering is decidable.

148

Chapter 10 Summary and Conclusion In this thesis, we study the problem of query answering over contextualized RDF knowledge in the presence of forall-existential bridge rules. We show that the problem, in general, is undecidable, and present a number of decidable subclasses of quad-systems. Table 10.1 displays the complexity results of chase computation and query entailment for the various classes of quad-systems we have derived. Fig. 10.1 graphically portrays the landscape of decidable classes that we derived in this thesis work, along with the already existing classes in the ∀∃ rules paradigm, namely the classes of model faithful acyclic rulesets (MFA), jointly acyclic rulesets (JA), and weakly acyclic rulesets (WA). There is a bidirectional edge between two nodes in the graph if there is an equivalence relation between classes represented by these nodes. Hence, there is a bidirectional edge between UNRESTRICTED QUAD - SYSTEMS and TERNARY ∀∃ RULES classes, and also a bidirectional edge between MSAFE and MFA. Also, note that a unidirectional edge/path exists between a node to another, if the class represented by the former is contained in the class represented by the latter, and in case if the latter is at a higher altitude then it signifies strict containment. The class of context acyclic quad-systems do not allow cyclic dependencies involving triple generating contexts. Classes csafe, msafe, and safe, ensure decidability by restricting the structure of Skolem blank-nodes generated in the 149

Quad-System Fragment Unrestricted Quad-Systems Safe Quad-Systems MSafe Quad-Systems CSafe Quad-Systems Context Acyclic Quad-Systems RR Quad-Systems Restricted RR Quad-Systems

Chase size w.r.t input quad-system Infinite Double exponential Double exponential Double exponential Double exponential Polynomial Polynomial

Data Complexity of CCQ entailment Undecidable PTIME-complete PTIME-complete PTIME-complete PTIME-complete PTIME-complete PTIME-complete

Combined Complexity of CCQ entailment Undecidable 2EXPTIME-complete 2EXPTIME-complete 2EXPTIME-complete 2EXPTIME-complete EXPTIME NP-complete

Complexity of Recognition PTIME 2EXPTIME 2EXPTIME 2EXPTIME PTIME PTIME PTIME

Table 10.1: Complexity info for various quad-system fragments

dChase. Briefly, the csafe, msafe, and safe classes do not allow an infinite descendant chain for Skolem blank-nodes generated, by constraining each Skolem blank-node in a descendant chain to have a different value for certain attributes, whose value sets are finite. RR and restricted RR quad-systems, do not allow the generation of Skolem blank nodes, thus constraining the dChase to have only constants from the initial quad-system. The above classes which suit varying situations, can be used to extend the currently established tools for contextual reasoning to give support for expressive bridge rules with conjunctions and existential quantifiers with decidability guarantees. From an expressivity point of view, the class of safe quad-systems subsumes all the above classes, and other well known classes in the realm of ∀∃ rules with finite chases. We view the results obtained in this thesis as a general foundation for contextual reasoning and query answering over contextualized RDF knowledge formats such as quads, and can straightforwardly be used to extend existing quad stores.

150

Combined Complexity of CCQ Entailment

UNDECIDABLE

dChase Size

UNRESTRICTED

TERNARY ∀∃ RULES

I NFINITE

SAFE

MSAFE 2EXPTIMEC OMPLETE

CSAFE

MFA Ceunca Grau et al.[32]

JA Kr¨otzsch et al.[60]

D OUBLE E XPONENTIAL

WA Fagin et al.[39]

C ACYCLIC EXPTIME NP- COMPLETE

RR REST.

P OLYNOMIAL

RR

Figure 10.1: Landscape of classes for quad-systems and ternary ∀∃ rules

Bibliography [1] Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. AddisonWesley (1995) [2] Arenas, M., Barcel´o, P., Libkin, L., Murlak, F.: Relational and XML Data Exchange. Synthesis Lectures on Data Management, Morgan & Claypool Publishers (2010), http://dx.doi.org/10.2200/ S00297ED1V01Y201008DTM008 [3] Arora, S., Barak, B.: Computational Complexity: A Modern Approach. Cambridge University Press, New York, NY, USA, 1st edn. (2009) [4] Baader, F., Brandt, S., Lutz, C.: Pushing the EL envelope. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence IJCAI05. Edinburgh, UK (2005) [5] Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press (2003) [6] Baget, J.F., Lecl`ere, M., Mugnier, M.L., Salvat, E.: On rules with existential variables: Walking the decidability line. Artificial Intelligence 175(910), 1620–1654 (2011) [7] Baget, J.F., Mugnier, M.L., Rudolph, S., Thomazo, M.: Walking the Complexity Lines for Generalized Guarded Existential Rules. In: IJCAI. pp. 712–717 (2011) 153

[8] Bao, J., Tao, J., McGuinness, D.: Context Representation for the Semantic Web. In: In Proceedings of the Web Science Conference 2010. Online at http://www.websci10.org/ (2010) [9] Bao, J., Voutsadakis, G., Slutzki, G., Honavar, V.: Package-based description logics. In: Modular Ontologies, pp. 349–371 (2009) [10] Barcel´o, P., P´erez, J., Reutter, J.: Schema Mappings and Data Exchange for Graph Databases. In: Proceedings of the 16th International Conference on Database Theory. pp. 189–200. ICDT ’13, ACM, New York, NY, USA (2013), http://doi.acm.org/10.1145/2448496.2448520 [11] Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., Patel-Schneider, P.F., Stein, L.A.: OWL Web Ontology Language Reference. Tech. rep., W3C, http://www.w3.org/TR/owl-ref/ (February 2004) [12] Beeri, C., Vardi, M.Y.: The Implication Problem for Data Dependencies. In: ICALP. pp. 73–85 (1981) [13] Benjamin N. Grosof, Ian Horrocks, R.V.S.D.: Description logic programs: Combining logic programs with description logic. In: Gusztav Hencsey and Bebo White Editors, Proceedings of the Twelfth International World Wide Web Conference (WWW). pp. 48–57. ACM (2003) [14] Borgida, A., Serafini, L.: Distributed Description Logics: Assimilating Information from Peer Sources. J. Data Semantics 1, 153–184 (2003) [15] Bouquet, P., Giunchiglia, F., van Harmelen, F., Serafini, L., Stuckenschmidt, H.: C-owl: Contextualizing ontologies. In: ISWC. pp. 164–179 (2003) 154

[16] Bozzato, L., Eiter, T., Serafini, L.: Defeasibility in Contextual Reasoning with CKR. In: Italian Conference in Computation Logic (CILC). pp. 132– 146 (2014) [17] Bozzato, L., Ghidini, C., Serafini, L.: Comparing contextual and flat representations of knowledge: A concrete case about football data. In: Proceedings of the seventh international conference on Knowledge capture (K-CAP ’13), 9-16, ACM 2013 [18] Brachman, R.J., Levesque, H.J.: Knowledge Representation and Reasoning. Elsevier - Morgan Kaufmann (2004), http: //www.elsevier.com/wps/find/bookdescription.cws_ home/702602/description [19] Buil-Aranda, C., Arenas, M., Corcho, O., Polleres, A.: Federating queries in sparql 1.1: Syntax, semantics and evaluation. Web Semant. 18(1), 1–17 (Jan 2013), http://dx.doi.org/10.1016/j.websem.2012. 10.001 [20] Cal`ı, A., Gottlob, G., Lukasiewicz, T., Marnette, B., Pieris, A.: Datalog+/: A Family of Logical Knowledge Representation and Query Languages for New Applications. In: Logic in Computer Science (LICS), 2010 25th Annual IEEE Symposium on. pp. 228 –242 (july 2010) [21] Cal`ı, A., Gottlob, G., Kifer, M.: Taming the Infinite Chase: Query Answering under Expressive Relational Constraints. In: KR. pp. 70–80 (2008) [22] Cal`ı, A., Gottlob, G., Pieris, A.: Towards more expressive ontology languages: The query answering problem. In: in Artificial Intelligence, vol. 93, Elsevier. pp. 87–128 (2012) [23] Cal`ı, A., Gottlob, G., Pieris, A.: Query Answering under Non-guarded Rules in Datalog+/-. In: RR. pp. 1–17 (2010) 155

[24] Calvanese, D.: Finite Model Reasoning in Description Logics. In: KR. pp. 292–303 (1996) [25] Calvanese, D., Damaggio, E., De Giacomo, G., Lenzerini, M., Rosati, R.: Semantic data integration in p2p systems. In: Aberer, K., Koubarakis, M., Kalogeraki, V. (eds.) Databases, Information Systems, and Peer-toPeer Computing, Lecture Notes in Computer Science, vol. 2944, pp. 77– 90. Springer Berlin Heidelberg (2004), http://dx.doi.org/10. 1007/978-3-540-24629-9_7 [26] Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R., Ruzzi, M.: Using OWL in data integration. In: De Virgilio, R., Giunchiglia, F., Tanca, L. (eds.) Semantic Web Information Management – a Model Based Perspective, chap. 17, pp. 397–424. Springer Verlag (2009) [27] Calvanese, D., Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable Reasoning and Efficient Query Answering in Description Logics: The DL-Lite Family. J. Autom. Reason. 39(3), 385–429 (Oct 2007), http://dx.doi.org/10.1007/s10817-007-9078-x [28] Carothers, G.: RDF 1.1 N-Quads. Tech. rep., W3C Recommendation (February 2014), http://www.w3.org/TR/n-quads/ [29] Carral Martnez, D., Hitzler, P.: Extending description logic rules. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) The Semantic Web: Research and Applications, Lecture Notes in Computer Science, vol. 7295, pp. 345–359. Springer Berlin Heidelberg (2012), http://dx.doi.org/10.1007/978-3-642-30284-8_30 [30] Carroll, J., Bizer, C., Hayes, P., Stickler, P.: Named graphs, provenance and trust. In: Proc. of the 14th int.l. conf. on WWW. pp. 613–622. ACM, New York, NY, USA (2005) 156

[31] Chekol, M.W.: Static analysis of semantic web queries. Ph.D. thesis (2012), ftp://ftp.inrialpes.fr/pub/exmo/thesis/ thesis-chekol.pdf [32] Cuenca Grau, B., Horrocks, I., Kr¨otzsch, M., Kupke, C., Magka, D., Motik, B., Wang, Z.: Acyclicity Notions for Existential Rules and Their Application to Query Answering in Ontologies. In: Journal of Artificial Intelligence Research (JAIR), vol. 47. pp. 741–808. AI Access Foundation (2013) [33] Dantsin, E., Eiter, T., Gottlob, G., Voronkov, A.: Complexity and expressive power of logic programming. Computing Surveys (CSUR 33(3) (Sep 2001), http://portal.acm.org/citation.cfm? id=502807.502810 [34] Deutsch, A., Tannen, V.: Reformulation of XML Queries and Constraints. In: In ICDT. pp. 225–241 (2003) [35] Deutsch, A., Nash, A., Remmel, J.: The chase revisited. In: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. pp. 149–158. PODS ’08, ACM, New York, NY, USA (2008), http://doi.acm.org/10.1145/ 1376916.1376938 [36] D.Lenat: The Dimensions of Context Space. Tech. rep., CYCorp (1998), published online https://courses.csail.mit.edu/ 6.803/pdf/lenat2.pdf [37] Doan, A., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration. Morgan Kaufmann (2012) [38] Euzenat, J.: Brief overview of t-tree: The tropes taxonomy building tool. Proc. 4th ASIS SIG/CR workshop on classification research 157

, Columbus (OH US), (rev. Philip Smith, Clare Beghtol, Raya Fidel, Barbara Kwasnik (eds), Advances in classification research 4, Information today 4(1) (1994), http://journals.lib.washington. edu/index.php/acro/article/view/12612 [39] Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data Exchange: Semantics and Query Answering. In: Theoretical Computer Science. pp. 28(1):89– 124 (2005) [40] Fuxman, A., Kolaitis, P.G., Miller, R.J., Tan, W.C.: Peer Data Exchange. ACM Trans. Database Syst. 31(4), 1454–1498 (Dec 2006), http:// doi.acm.org/10.1145/1189769.1189778 [41] Ghidini, C., Giunchiglia, F.: Local Models Semantics, or Contextual Reasoning = Locality + Compatibility. Artificial Intelligence 127 (2001) [42] Ghidini, C., Serafini, L.: Distributed first order logics. In: Frontiers Of Combining Systems 2, Studies in Logic and Computation. pp. 121–140. Research Studies Press (1998) [43] Glimm, B., Lutz, C., Horrocks, I., Sattler, U.: Answering conjunctive queries in the SHIQ description logic. In: Proceedings of the IJCAI’07. pp. 299–404. AAAI Press (2007) [44] Goldreich, O.: Computational Complexity: A Conceptual Perspective. Cambridge University Press, New York, NY, USA, 1 edn. (2008) [45] Guha, R., Mccool, R., Fikes, R.: Contexts for the semantic web. In: ISWC, volume 3298 of Lecture Notes in Computer Science. pp. 32–46. Springer (2004) [46] Gutierrez, C., Hurtado, C.A., Vaisman, A.A.: Temporal RDF. In: ESWC. pp. 93–107 (2005) 158

[47] Halevy, A.Y., Ives, Z.G., Suciu, D., Tatarinov, I.: Schema Mediation in Peer Data Management Systems. In: In ICDE. pp. 505–516 (2003) [48] Harrison, M.A.: Introduction to Formal Language Theory. AddisonWesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edn. (1978) [49] Harth, A., Umbrich, J., Hogan, A., Decker, S.: YARS2: A federated repository for querying graph structured data from the web. In: Proceedings of the ISWC/ASWC-2007 (2007) [50] Hayes, P. (ed.): RDF Semantics. W3C Recommendation (Feb 2004), http://www.w3.org/TR/rdf-mt/ [51] Hitzler, P., Kr¨otzsch, M., Parsia, B., Patel-Schneider, P.F., Rudolph, S.: OWL 2 Web Ontology Language Primer. W3C Recommendation, World Wide Web Consortium (October 2009), http://www.w3.org/TR/ owl2-primer/ [52] Homola, M., Serafini, L.: Augmenting Subsumption propogation in distributed description logics. Applied Artificial Intelligence 24, 137–174 (2010) [53] Horrocks, I., Kutz, O., Sattler, U.: The Even More Irresistible SROIQ. In: Doherty, P., Mylopoulos, J., Welty, C.A. (eds.) KR. pp. 57– 67. AAAI Press (2006), http://dblp.uni-trier.de/db/conf/ kr/kr2006.html#HorrocksKS06 [54] Horrocks, I., Patel-Schneider, P.F., Boley, H., Tabet, S., Grosof, B., Dean, M.: SWRL: A Semantic Web Rule Language Combining OWL and RuleML. W3c member submission, World Wide Web Consortium (2004), http://www.w3.org/Submission/SWRL [55] J.McCarthy: Generality in AI. Comm. of the ACM 30(12), 1029–1035 (1987) 159

[56] Johnson, D.S., Klug, A.C.: Testing Containment of Conjunctive Queries under Functional and Inclusion Dependencies. J. Comput. Syst. Sci. 28(1), 167–189 (1984) [57] Joseph, M., Kuper, G., Serafini, L.: Query Answering over Contextualized RDF knowledge with Forall-Existential Bridge rules: Attaining Decidability using Acyclicity. In: Italian Conference in Computation Logic (CILC-2014), Turin, Italy. pp. 210–224 (2014) [58] Joseph, M., Kuper, G., Serafini, L.: Query Answering over Contextualized RDF/OWL knowledge with Forall-Existential Bridge rules: Attaining Decidability using Acyclicity. In: International Conference in Web Reasoning and Rule Systems (RR-2014), Athens, Greece (2014) [59] Klarman, S., Guti´errez-Basulto, V.: Two-dimensional description logics for context-based semantic interoperability. In: Proceedings of AAAI-11 (2011) [60] Kr¨otzsch, M., Rudolph, S.: Extending decidable existential rules by joining acyclicity and guardedness. In: Walsh, T. (ed.) Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI’11). pp. 963–968. AAAI Press/IJCAI (2011) [61] Kr¨otzsch, M.: The not-so-easy task of computing class subsumptions in OWL RL. In: Proceedings of the 11th International Semantic Web Conference. LNCS, Springer (2012) [62] Kr¨otzsch, M., Rudolph, S., Hitzler, P.: Description Logic Rules. In: Proceedings of the 18th European Conference on Artificial Intelligence (ECAI’08). pp. 80–84. IOS Press (2008) 160

[63] Kr¨otzsch, M., Rudolph, S., Hitzler, P.: Complexities of horn description logics. ACM Trans. Comput. Log. 14(1), 2 (2013), http://doi.acm. org/10.1145/2422085.2422087 [64] Kutz, O., Lutz, C., Wolter, F., Zakharyaschev, M.: E-connections of abstract description systems. Artificial Intelligence 156(1), 1–73 (2004) [65] Lenzerini, M.: Data integration: A theoretical perspective. In: Proceedings of the Twenty-first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. pp. 233–246. PODS ’02, ACM, New York, NY, USA (2002), http://doi.acm.org/10.1145/ 543613.543644 [66] Leone, N., Manna, M., Terracina, G., Veltri, P.: Efficiently Computable Datalog Programs. International Conference in Knowledge Representation and Reasoning (KR 2012) (2012), http://www.aaai.org/ocs/ index.php/KR/KR12/paper/view/4521 [67] L.Serafini, P.Bouquet: Comparing Formal Theories of Context in AI. Artificial Intelligence 155, 41–67 (2004) [68] Lutz, C., Toman, D., Wolter, F.: Conjunctive Query Answering in the Description Logic EL using a Relational Database System. In: Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI09). AAAI Press (2009) [69] Marnette, B.: Generalized Schema-Mappings: From Termination to Tractability. In: Proceedings of the twenty-eighth ACM SIGMODSIGACT-SIGART Symposium on Principles of Database Systems. pp. 13– 22. PODS ’09, ACM, New York, NY, USA (2009) [70] Marnette, B., Geerts, F.: Static analysis of schema-mappings ensuring oblivious termination. In: Segoufin, L. (ed.) ICDT. pp. 183–195. 161

ACM International Conference Proceeding Series, ACM (2010), http://dblp.uni-trier.de/db/conf/icdt/icdt2010. html#MarnetteG10 [71] M.Benerecetti, Bouquet, P., C.Ghidini: On the Dimensions of Context Dependence. In: P.Bouquet, L.Serafini, R.H.Thomason (eds.) Perspectives on Contexts, chap. 1, pp. 1–18. CSLI Lecture Notes, Center for the Study of Language and Information/SRI (2007) [72] M.Benerecetti, P.Bouquet, C.Ghidini: Contextual Reasoning Distilled. Experimental and Theoretical AI 12(3), 279–305 (2000) [73] Mccarthy, J.: A logical AI approach to context (1996), published online http://www-formal.stanford.edu/jmc/logical/ logical.html [74] McCarthy, J., Buvac, S., Costello, T., Fikes, R., Genesereth, M., Giunchiglia, F.: Formalizing context (Expanded Notes) (1995) [75] M.Joseph, L.Serafini: Simple reasoning for contextualized RDF knowledge. In: Proc. of Workshop on Modular Ontologies (WOMO-2011) (2011) [76] Motik, B., Grau, B.C., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C.: OWL 2 Web Ontology Language – Profiles. Tech. rep., W3C (2009) [77] Mylopoulos, J., Borgida, A., Jarke, M., Koubarakis, M.: Telos: Representing knowledge about information systems. Information Systems 8(4), 325–362 (1990), http://www.cs.toronto.edu/˜nernst/ papers/mylo-telos.pdf [78] Parsia, B., Grau, B.C.: Generalized Link Properties for Expressive epsilon-Connections of Description Logics. In: Veloso, M.M., Kambhampati, S. (eds.) AAAI. pp. 657–662. AAAI Press / The MIT Press (2005) 162

[79] Patel-Schneider, P.F., Hayes, P., Horrocks, I.: OWL Web Ontology Language Semantics and Abstract Syntax Section 5. RDFCompatible Model-Theoretic Semantics. Tech. rep., W3C (Dec 2004), http://www.w3.org/TR/owl-semantics/rdfs.html\ #built\_in\_vocabulary [80] Patel-Schneider, P.F., Motik, B.: OWL 2 Web Ontology Language: Mapping to RDF Graphs. World Wide Web Consortium, Working Draft WDowl2-mapping-to-rdf-20081202 (December 2008) [81] P.Bouquet, L.Serafini, H.Stoermer: Introducing Context into RDF Knowledge Bases. In: Proceedings the 2nd Italian Semantic Web Workshop (SWAP-2005). pp. 14–16 (2005) [82] P´erez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of sparql. ACM Trans. Database Syst. 34, 16:1–16:45 (September 2009), http: //doi.acm.org/10.1145/1567274.1567278 [83] Prud’hommeaux, E., Seaborne, A.: Sparql query language for rdf. W3c recommendation, W3C (Jan 2008), http://www.w3.org/TR/ rdf-sparql-query/ [84] R.Guha: Contexts: a Formalization and some Applications. Ph.D. thesis, Stanford (1992) [85] Schueler, B., Sizov, S., Staab, S., Tran, D.T.: Querying for meta knowledge. In: WWW ’08: Proceeding of the 17th international conference on World Wide Web. pp. 625–634. ACM, New York, NY, USA (2008), http://dx.doi.org/10.1145/1367497.1367582 [86] Serafini, L., Homola, M.: Contextualized Knowledge Repositories for the Semantic Web. Web Semantics: Science, Services and Agents on the 163

World Wide Web (Special Issue on Reasoning with context in the Semantic Web) (2012) [87] Serafini, L., Tamilin, A.: DRAGO: Distributed Reasoning Architecture for the Semantic Web. In: European Semantic Web Conference (ESWC). pp. 361–376 (2005) [88] Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A., Katz, Y.: Pellet: A Practical OWL-DL Reasoner. Web Semant. 5(2), 51–53 (Jun 2007), http: //dx.doi.org/10.1016/j.websem.2007.03.004 [89] Straccia, U., Lopes, N., Lukacsy, G., Polleres, A.: A general framework for representing and reasoning with annotated semantic web data. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI 2010), Special Track on Artificial Intelligence and the Web (July 2010), http://www.polleres.net/publications/ stra-etal-2010AAAI.pdf [90] ter Horst, H.J.: Completeness, decidability and complexity of entailment for RDF Schema and a semantic extension involving the OWL vocabulary. Web Semantics: Science, Services and Agents on the WWW 3(2-3), 79–115 (2005), http://www.sciencedirect. com/science/article/B758F-4H16P4Y-1/2/ d039e4784b224e95aafca856ecfb1edb, Selected Papers from the ISWC, 2004 [91] Thomazo, M., Baget, J.F., Mugnier, M.L., Rudolph, S.: A Generic Querying Algorithm for Greedy Sets of Existential Rules. In: KR’12: International Conference on Principles of Knowledge Representation and Reasoning. pp. 096–106. Italie (2012), http://hal-lirmm.ccsd. cnrs.fr/lirmm-00763518 164

[92] Udrea, O., Recupero, D.R., Subrahmanian, V.S.: Annotated RDF. ACM Transactions on Computational Logic 11(2), 1–41 (2010) [93] Zimmermann, A., Lopes, N., Polleres, A., Straccia, U.: A general framework for representing, reasoning and querying with annotated semantic web data. Web Semantics 11, 72–95 (Mar 2012), http://dx.doi. org/10.1016/j.websem.2011.08.006

165

Appendix A Appendix A.1

Appendix of Chapter 2

A.1.1

RDF and RDFS Inference Rules

Tables A.1 lists the set of RDF inference rules. RDFS inference rules are composed of the set of RDF inference rules and the set of rules in Table A.2. 1: s p o ⇒ p rdf:type rdf:Property 2: s p o (o is a well typed ⇒ o rdf:type rdf:XMLLiteral xml literal) Table A.1: RDF rules

A.1.2

Ontology with only Infinite Models

Example 1. The SROIQ ontology described below (adapted from Calvanese et al. [24]). is an instance of an ontology that does not have a finite model: Guard v ∃shields.Guard u ≤ 1shields−

(A.1)

F irstGuard v Guard u ≤ 0shields−

(A.2)

F irstGuard(a)

(A.3)

167

1: s p o (if o is a plain ⇒ literal) 2: p rdfs:domain x & s p o ⇒ 3: p rdfs:range x & s p o ⇒ 4a: s p o ⇒ 4b: s p o ⇒ 5: p rdfs:subPropertyOf q & ⇒ q rdfs:subPropertyOf r 6: p rdf:type rdf:Property ⇒ 7: s p o & p rdfs:subPropertyOf q ⇒ 8: s rdf:type rdfs:Class ⇒ 9: s rdf:type x & x rdfs:subClassOf y 10: s rdf:type rdfs:Class 11: x rdfs:subClassOf y & y rdfs:subClassOf z 12: p rdf:type rdfs:Container-MemberShipProperty 13: o rdf:type rdfs:Datatype

o rdf:type rdfs:literal s o s o p

rdf:type x rdf:type x rdf:type rdfs:Resource rdf:type rdfs:Resource rdfs:subPropertyOf r



p rdfs:subPropertyOf p s q o s rdfs:subClassOf rdfs:Resource s rdf:type y

⇒ ⇒

s rdfs:subClassOf s x rdfs:subClassOf z



p rdfs:subPropertyOf rdfs:member ⇒ o rdfs:subClassOf rdfs:Literal

Table A.2: RDFS rules

In the above ontology, statement 1 constraints an object of type Guard to have at least one outgoing edge of type shields to an object of type Guard and at most one incoming edge of type shields to it. Statement 2 constraints objects of type F irstGuard to be also of type Guard, and are further restricted not to have an incoming edge to type shields. The model depicted in figure A.1 satisfies the ontology. One can note that the above ontology is only satisfiable F irstGuard a Guard

shields

o2 Guard

shields

o3 Guard

Figure A.1: An infinite model 168

... shields

in models that have infinite domains, and is hence not finitely satisfiable. The example is a proof of existence of SROIQ ontologies that have only infinite sized models.

A.2

Appendix of Chapter 4

Proof of Property 1. Note that a strict linear order is a relation that is irreflexive, transitive, and linear. Irreflexivity: By contradiction, suppose ≺q is not irreflexive, then there exists Q ∈ Q such that Q ≺q Q holds. This means that neither of the conditions (i) and (ii) of ≺q definition holds for Q. Hence, due to condition (iii) Q 6≺q Q, which is a contradiction. Linearity: Note that for any two distinct Q, Q0 ∈ Q, one of the following holds: (a) Q ⊂ Q0 , (b) Q0 ⊂ Q, or (c) Q \ Q0 and Q0 \ Q are non-empty and disjoint. Suppose (a) is the case, then Q ≺q Q0 holds. Similarly, if (b) is the case then Q0 ≺q Q holds. Otherwise if (c) is the case, then by condition (ii), either Q ≺q Q0 or Q0 ≺q Q should hold. Hence, ≺q is a linear order over Q. Transitivity: Suppose there exists Q, Q0 , Q00 ∈ Q such that Q ≺q Q0 and Q0 ≺q Q00 . Then, one of the following four cases hold: (a) Q ≺q Q0 due to (i) and Q0 ≺q Q00 due to (i), (b) Q ≺q Q0 due to (i) and Q0 ≺q Q00 due to (ii), (c) Q ≺q Q0 due to (ii) and Q0 ≺q Q00 due to (i), (d) Q ≺q Q0 due to (ii) and Q0 ≺q Q00 due to (ii). Suppose if (a) is the case, then trivially Q ⊂ Q00 , and hence by applying condition (i) Q ≺q Q00 . Otherwise if (b) is the case, then either (1) Q ⊂ Q00 or (2) Q 6⊂ Q00 . Suppose, (1) is the case then, by (i) Q ≺q Q00 . Otherwise, if (2) is the case, then since, Q ⊂ Q0 , it cannot be the case that greatestQuad≺l (Q00 \ Q) ≺l greatestQuad≺l (Q00 \ Q0 ), and it cannot be the case that greatestQuad≺l (Q0 \ Q00 ) ≺l greatestQuad≺l (Q\Q00 ). Hence, it should be the case that greatestQuad≺l (Q00 \ Q0 ) l greatestQuad≺l (Q00 \Q) and greatestQuad≺l (Q\Q00 ) ≺l greatestQuad≺l (Q0 \ 169

Q00 ). But since, greatestQuad≺l (Q0 \ Q00 ) ≺l greatestQuad≺l (Q00 \ Q0 ), it follows that greatestQuad≺l (Q \ Q00 ) ≺l greatestQuad≺l ( Q00 \ Q), and hence by condition (ii), Q ≺q Q00 . Hence, if (b) is the case, then in both possible cases (1) or (2), it should be the case that Q ≺q Q00 . Otherwise if (c) is the case, then similar to the arguments in (b), by condition (i) or (ii), it can easily be seen that Q ≺q Q00 . Otherwise, if (d) is the case, then it must be the case that greatestQuad≺l (Q \ Q0 ) ≺l greatestQuad≺l (Q0 \Q) (†), and greatestQuad≺l (Q0 \Q00 ) ≺l greatestQuad≺l ( Q00 \ Q0 ) (‡). Suppose by contradiction Q00 ≺q Q, then one of the following holds: (1) Q00 ≺q Q by condition (i) or (2) Q00 ≺q Q by condition (ii). Suppose, (1) is the case, then it should be the case that Q00 ⊂ Q. Hence, it should not be the case that greatestQuad≺l ( Q \ Q0 ) ≺l greatestQuad≺l (Q00 \ Q0 ) and it should not be the case that greatestQuad≺l (Q0 \ Q00 ) ≺l greatestQuad≺l (Q0 \ Q). Hence, it should be the case that greatestQuad≺l (Q00 \ Q0 ) l greatestQuad≺l (Q \ Q0 ) (♥), and greatestQuad≺l ( Q0 \ Q) l greatestQuad≺l (Q0 \ Q00 ) (♠). Applying (‡) in (♥), we get greatestQuad≺l (Q0 \ Q00 ) ≺l greatestQuad≺l (Q \ Q0 ), and applying (†) in (♠), we get greatestQuad≺l ( Q\Q0 ) ≺l greatestQuad≺l (Q0 \Q00 ), which is a contradiction. Suppose if (2) is the case, then greatestQuad≺l (Q00 \ Q) ≺l greatestQuad≺l ( Q \ Q00 ). The last statement can also be written as: greatestQuad≺l ( Q00 \ (Q ∩ Q00 )) ≺l greatestQuad≺l (Q \ (Q ∩ Q00 )). Using Q ∩ Q0 ∩ Q00 ⊆ Q ∩ Q0 , it follows that greatestQuad≺l (Q00 \ (Q ∩ Q0 ∩ Q00 )) l greatestQuad≺l ( Q\(Q∩Q0 ∩Q00 )) (♣). Also applying similar transformation in (†) and (‡), we get greatestQuad≺l (Q \ (Q ∩ Q0 ∩ Q00 )) l greatestQuad≺l ( Q0 \ (Q∩Q0 ∩Q00 )), and greatestQuad≺l (Q0 \(Q∩Q0 ∩Q00 )) l greatestQuad≺l (Q00 \ (Q ∩ Q0 ∩ Q00 )). From which, it follows that greatestQuad≺l (Q \ (Q ∩ Q0 ∩ Q00 )) l greatestQuad≺l (Q00 \ (Q ∩ Q0 ∩ Q00 )). Using (♣) in the above, we get greatestQuad≺l (Q \ (Q ∩ Q0 ∩ Q00 )) = greatestQuad≺l (Q0 \ (Q ∩ Q0 ∩ Q00 )) = greatestQuad≺l (Q00 \ (Q ∩ Q0 ∩ Q00 )), which is a contradiction. Hence, it should be the case that Q ≺q Q00 . 170

Proof of Theorem 7. We show that the CCQ entailment problem is undecidable for unrestricted quad-systems, by showing that the well known undecidable problem of “non-emptiness of intersection of context-free grammars” is reducible to the CCQ entailment problem. Given an alphabet Σ, string w ~ is a sequence of symbols from Σ. A language L is a subset of Σ∗ , where Σ∗ is the set of all strings that can be constructed from the alphabet Σ, and also includes the empty string . Grammars are machineries that generate a particular language. A grammar G is a quadruple hV, T, S, P i, where V is the set of variables, T , the set of terminals, S ∈ V is the start symbol, and P is a set of production rules (PR), in which each PR r ∈ P is of the form: w ~ →w ~0 where w, ~ w ~ 0 ∈ {T ∪ V }∗ . Intuitively, application of a PR r of the form above on a string w ~ 1 , replaces every occurrence of the sequence w ~ in w ~ 1 with w ~ 0 . PRs are applied starting from the start symbol S until it results in a string w, ~ with w ~ ∈ Σ∗ or no more production rules can be applied on w. ~ In the former case, we say that w ~ ∈ L(G), the language generated by grammar G. For a detailed review of grammars, we refer the reader to Harrison et al. [48]. A context-free grammar (CFG) is a grammar, whose set of PRs P , have the following property: Property 2. For a CFG, every PR is of the form v → w, ~ where v ∈ V , w ~ ∈ {T ∪ V }∗ . Given two CFGs, G1 = hV1 , T, S1 , P1 i and G2 = hV2 , T, S2 , P2 i, where V1 , V2 , with V1 ∩ V2 = ∅, are the set of variables, T , such that T ∩ (V1 ∪ V2 ) = ∅, is the set of terminals. S1 ∈ V1 is the start symbol of G1 , and P1 are the set of PRs of the form v → w, ~ where v ∈ V , w ~ is a sequence of the form w1 ...wn , where wi ∈ V1 ∪ T . Similarly s2 , P2 is defined. Deciding whether the languages generated by the grammars L(G1 ) and L(G2 ) have a non-empty intersection is known to be undecidable [48]. Since we can turing reduce the above problem 171

to the problem of non-emptiness checking of languages generated by two CFGs G01 and G02 s.t.  6∈ L(G01 ) ∪ L(G02 ), w.l.o.g we assume that both L(G1 ) and L(G2 ) does not contain the empty string . Given two CFGs, G1 = hV1 , T, S1 , P1 i and G2 = hV2 , T, S2 , P2 i, we encode grammars G1 , G2 into a quad-system of the form QSc = hQc , Ri, with a single context identifier c. Each PR r = v → w ~ ∈ P1 ∪ P2 , with w ~ = w1 w2 w3 ..wn , is encoded as a BR of the form: c : (x1 , w1 , x2 ), c : (x2 , w2 , x3 ), ..., c : (xn , wn , xn+1 ) → c : (x1 , v, xn+1 ) (A.4) where x1 , .., xn+1 are variables. W.l.o.g. we assume that the set of terminal symbols T is equal to the set of terminal symbols occurring in P1 ∪ P2 . For each terminal symbol ti ∈ T , R contains a BR of the form: c : (x, rdf:type, C) → ∃y c : (x, ti , y), c : (y, rdf:type, C)

(A.5)

and Qc is the singleton with the quad: c : (a, rdf:type, C) We in the following show that: QSc |= ∃y c : (a, S1 , y) ∧ c : (a, S2 , y) ↔ L(G1 ) ∩ L(G2 ) 6= ∅

(A.6)

Claim (1) For any w ~ = t1 , ..., tp ∈ T ∗ , there exists b1 , ...bp , such that c : (a, t1 , b1 ), c : (b1 , t2 , b2 ), ..., c : (bp−1 , tp , bp ), c : (bp , rdf:type, C) ∈ dChase(QSc ). We proceed by induction on the |w|. ~ base case suppose if |w| ~ = 1, then w ~ = ti , for some ti ∈ T . But by construction c : (a, rdf:type, C) ∈ dChase0 (QSc ), on which rules of the form (A.5) is applicable. Hence, there exists an i such that dChasei (QSc ) contains c : (a, ti , bi ), c : (bi , rdf:type, C), for each ti ∈ T . Hence, the base case. 172

hypothesis for any w ~ = t1 ...tp , if |w| ~ ≤ p0 , then there exists b1 , ..., bp , such that c : (a, t1 , b1 ), c : (b1 , t2 , b2 ), ..., c : (bp−1 , tp , bp ), c : (bp , rdf:type, C) ∈ dChase(QSc ). inductive step suppose w ~ = t1 ...tp+1 , with |w| ~ ≤ p0 +1. Since w ~ can be written ~ 0 tp+1 , where w as w ~ 0 = t1 ...tp , and by hypothesis, there exists b1 , ..., bp such that c : (a, t1 , b1 ), c : (b1 , t2 , b2 ), ..., c : (bp−1 , tp , bp ), c : (bp , rdf:type, C) ∈ dChase(QSc ). Also since rules of the form (A.5) are applicable on c : (bp , rdf:type, C), triples of the form c : (bp , ti , bip+1 ), c : (bip+1 , rdf:type, C) are produced, for each ti ∈ T . Since tp+1 ∈ T , the claim follows. For a grammar G = hV, T, S, P i, whose start symbol is S, and for any w ~ ∈ {V ∪ T }∗ , for some Vj ∈ V , we denote by Vj →i w, ~ the fact that w ~ was derived from Vj by i production steps, i.e. there exists steps Vj → r1 , ..., ri → w, ~ which lead to the production of w. ~ For any w, ~ w ~ ∈ L(G), iff there exists an i such that S →i w. ~ For any Vj ∈ V , we use Vj →∗ w ~ to denote the fact that there exists an arbitrary i, such that Vj →i w. ~ Claim (2) For any w ~ = t1 ...tp ∈ {V ∪ T }∗ , and for any Vj ∈ V , if Vj →∗ w ~ and there exists b1 , ..., bp+1 , with c : (b1 , t1 , b2 ), ..., c : (bp , tp , bp+1 ) ∈ dChase(QSc ), then c : (b1 , Vj , bp+1 ) ∈ dChase(QSc ). We prove this by induction on the size of w. ~ base case Suppose |w| ~ = 1, then w ~ = tk , for some tk ∈ T . If there exists b1 , b2 such that c : (b1 , tk , b2 ). But since there exists a PR Vj → tk , by transformation given in (A.4), there exists a BR c : (x1 , tk , x2 ) → c : (x1 , Vj , x2 ) ∈ R, which is applicable on c : (b1 , tk , b2 ) and hence the quad c : (b1 , Vj , b2 ) ∈ dChase(QSc ). hypothesis For any w ~ = t1 ...tp , with |w| ~ ≤ p0 , and for any Vj ∈ V , if Vj →∗ w ~ and there exists b1 , ...bp , bp+1 , such that c : (b1 , t1 , b2 ), ..., c : (bp , tp , bp+1 ) ∈ 173

dChase(QSc ), then c : (b1 , Vj , bp+1 ) ∈ dChase(QSc ). inductive step Suppose if w ~ = t1 ...tp+1 , with |w| ~ ≤ p0 + 1, and Vj →i w, ~ and there exists b1 , ...bp+1 , bp+2 , such that c : (b1 , t1 , b2 ), ..., c : (bp+1 , tp+1 , bp+2 ) ∈ dChase(Qc ). Also, one of the following holds (i) i = 1, or (ii) i > 1. Suppose (i) is the case, then it is trivially the case that c : (b1 , Vj , bp+2 ) ∈ dChase(QSc ). Suppose if (ii) is the case, one of the two sub cases holds (a) Vj →i−1 Vk , for some Vk ∈ V and Vk →1 w ~ or (b) there exist a Vk ∈ V , such that Vk →∗ tq+1 ...tq+l , with 2 ≤ l ≤ p, where Vj →∗ t1 ...tq Vk tp−l+1 ...tp+1 . If (a) is the case, then trivially c : (b1 , Vk , bq+2 ) ∈ dChase(QSc ), and since by construction there exists c : (x0 , Vk , x1 ) → c : (x0 , Vk+1 , x1 ), ..., c : (x0 , Vk+i , x1 ) → c : (x0 , Vj , x1 ) ∈ R, c : (b1 , Vj , bq+2 ) ∈ dChase( QSc ). If (b) is the case, then since |tq+1 . . . tq+l | ≥ 2, |t1 . . . tq V2 tp−l+1 . . . tp+1 | ≤ p0 . This implies that c : (b1 , Vj , bp+2 ) ∈ dChase(QSc ). Similarly, by construction of dChase(QSc ), the following claim can straightforwardly be shown to hold: Claim (3) For any w ~ = t1 ...tp ∈ {V ∪ T }∗ , and for any Vj ∈ V , if there exists b1 , ..., bp , bp+1 , with c : (b1 , t1 , b2 ), ..., c : (bp , tp , bp+1 ) ∈ dChase(QSc ) and c : (b1 , Vj , bp+1 ) ∈ dChase(QSc ), then Vj →∗ w. ~ (a) For any w ~ = t1 . . . tp ∈ T ∗ , if w ~ ∈ L(G1 ) ∩ L(G2 ), then by Claim 1, since there exists b1 , . . . , bp , such that c : (a, t1 , b1 ), . . . , c : (bp−1 , tp , bp ) ∈ dChase(QSc ). But since w ~ ∈ L(G1 ) and w ~ ∈ L(G2 ), S1 → w ~ and S2 → w, ~ and by claim 2, c : (a, S1 , bp ), c : (a, S2 , bp ) ∈ dChase(QSc ), it follows that dChase(QSc ) |= ∃y c : (a, s1 , y) ∧ c : (a, s2 , y). Hence, by Theorem 2, QSc |= ∃y c : (a, s1 , y) ∧ c : (a, s2 , y). (b) Suppose if QSc |= ∃y c : (a, S1 , y) ∧ c : (a, S2 , y), then applying Theorem 2, it follows that there exists bp such that c : (a, S1 , bp ), c : (a, S2 , bp ) ∈ dChase(QSC ). Then it must be the case that there exists w ~ = t1 . . . tp ∈ T ∗ , and 174

b1 ,. . . , bp such that c : (a, t1 , b1 ), ..., c : (bp−1 , tp , bp ), c : (a, S1 , bp ), c : (a, S2 , bp ) ∈ dChase(QSc ). Then by claim 3, S1 →∗ w, ~ S2 →∗ w. ~ Hence, w ∈ L(G1 ) ∩ L(G2 ). By (a),(b) it follows that there exists w ~ ∈ L(G1 ) ∩ L(G2 ) iff QSc |= ∃y c : (a, s1 , y) ∧ c : (a, s2 , y). As we have shown that the intersection of CFGs, which is an undecidable problem, is reducible to the problem of query entailment on unrestricted quad-system, the latter is undecidable.

A.3

Appendix of Chapter 6

Theorem 19. We in the following show the case of dChasecsafe (QSC ), i.e. unCSafe ∈ dChasecsafe (QSC ) iff QSC is uncsafe. The proof follows from Lemma 3 and Lemma 4 below. The proofs for the case of dChasesafe (QSC ) and dChasemsafe (QSC ) is similar, and is omitted.

Lemma 3 (Soundness). For any quad-system QSC = hQC , Ri, if the quad unCSafe ∈ dChasecsafe (QSC ), then QSC is uncsafe. S Proof. Note that augC(R) = r∈R augC(r) ∪ {brT R}, where brT R is the range restricted BR cc : (x1 , descendantOf, z), cc : (z, descendantOf, x2 ) → cc : (x1 , descendantOf, x2 ). Also for each r ∈ R, body(r) = body(augC(r)), and for any c ∈ C, c : (s, p, o) ∈ head(r) iff c : (s, p, o) ∈ head(augC(r)). That is, head(r) = head(augC(r))(C), where head( r)(C) denotes the quadpatterns in head(r), whose context identifiers is in C. Also, head(augC(r)) = head(augC(r))(C) ∪ head(augC(r))(cc ), and also the set of existentially quantified variables in head(augC(r))(cc ) is contained in the set of existentially quantified variables in head(augC(r))(C) (†). We first prove the following claim: 175

Claim (0) For any quad-system QSC = hQC , Ri, let i be a csafe dChase iteration, let j be the number of csafe dChase iterations before i in which brT R was applied, then dChasei−j (QSC ) = dChasecsafe (QSC )(C). i We approach the proof of the above claim by induction on i. base case If i = 1, then dChasecsafe (QSC )(cc ) = ∅ and dChasecsafe (QSC )(C) 0 0 = dChasecsafe (QSC ) = dChase0 (QSC ). Hence, it should be the case that 0 applicableaugC(R) (brT R, µ, dChasecsafe ( QSC )) does not hold, for any µ. 0 Hence, applicableR ( r, µ, dChase0 (QSC )) iff applicableaugC(R) ( augC(r), (QSC )), for any r ∈ R, assignment µ. Also using (†), it folµ, dChasecsafe 0 lows that dChase1 (QSC ) = dChasecsafe 1−0 (QSC )(C). hypothesis for any i ≤ k, if i is a csafe dChase iteration, and j be the number of csafe dChase iterations before i in which brT R was applied, then dChasei−j (QSC ) = dChasecsafe (QSC )(C). i step case suppose i = k + 1, then one of the following three cases should (QSC )) does not hold for any hold: (a) applicableaugC(R) (r, µ, dChasecsafe k csafe (QSC ), r ∈ augC(R), assignment µ, and dChasecsafe k+1 (QSC ) = dChasek or (b) applicableaugC(R) ( brT R, µ, dChasecsafe (QSC )) holds, for some ask signment µ, or (c) applicableaugC(R) (r, µ, dChasecsafe (QSC )) holds, for k some r ∈ augC(R) \ {brT R}, for some assignment µ. If (a) is the case, then it should be the case that applicableR (r0 , µ, dChasek−j (QSC )) does not hold, for any r0 ∈ R, assignment µ. As a result dChasek+1−j (QSC ) = dChasek−j (QSC ), and hence, dChasek+1−j ( QSC ) = dChasecsafe k+1 (QSC )(C). csafe If (b) is the case, then since dChasecsafe ( QSC )(C), k+1 (QSC )(C) = dChasek dChasecsafe k+1 (QSC )(C) = dChasek+1−j−1 ( QSC ) = dChasek−j (QSC ). If (c) is the case, then applicableR (r0 , µ, dChasek−j (QSC )) should hold, where r = augC(r0 ) and head(r)(C) = head(r). Hence, it should be the case that dChasecsafe k+1 (QSC )(C) = dChasek+1−j (QSC ). 176

The following claim, which straightforwardly follows from claim 0, shows that any quad c : (s, p, o), with c ∈ C derived in csafe dChase, is also derived in its standard dChase. In this way, csafe dChase do not generate any unsound triples in any context c ∈ C. Claim (1) For any quad c : (s, p, o), with c ∈ C, if c : (s, p, o) ∈ dChasecsafe (QSC ), then c : (s, p, o) ∈ dChase(QSC ). The following claim shows that the set of origin context quads are also sound. Claim (2) If there exists quad cc : (b, originContext, c) ∈ dChasecsafe (QSC ), then c ∈ originContexts(b). If cc : (b, originContext, c) ∈ dChasecsafe (QSC ), there exists i ∈ N, such that cc : (b, originContext, c) ∈ dChasecsafe ( QSC ) and there exists no j < i with i cc : (b, originContext, c) ∈ dChasecsafe (QSC ). But if cc : (b, originContext, j c) ∈ dChasecsafe (QSC ) implies that there exists an augC(r) = body(~x, ~z) → i head(~x, ~y ) ∈ augC(R), with cc : (yj , originContext, c) ∈ head(~x, ~y ), yj ∈ {~y }, such that cc : (b, originContext, c) was generated due to application of an assignment µ on augC(r), with b = yj [µext(~y) ]. This implies that there exists c : (s, p, o) ∈ head(~x, ~y ), with s = yj or p = yj or o = yj , c ∈ C. Since according to our assumption, i is the first iteration in which cc : (b, originContext, c) is generated, it follows that i is the first iteration in which c : (s, p, o)[µext(~y) ] is also generated. Let k be the number of iterations before i in which brT R was applied. By applying claim 0, it should be the case that c : (s, p, o)[µext(~y) ] ∈ dChasei−k (QSC ), and i − k should be the first such dChase iteration. Hence, c ∈ orginContexts(b). In the following claim, we prove the soundness of the descendant quads generated in a safe dChase. Claim (3) For any two distinct blank nodes b, b0 in dChasecsafe (QSC ), if cc : (b0 , descendantOf, b) ∈ dChasecsafe (QSC ) then b0 is a descendant of b. 177

Since any quad of the form cc : (b0 , descendantOf, b) ∈ dChasecsafe (QSC ) is not an element of QC , and can only be introduced by an application of a BR r ∈ augC(R), any quad of the form cc : (b0 , descendantOf, b) can only be introduced, earliest in the first iteration of dChasecsafe (QSC ). Suppose cc : (b0 , descendantOf, b) ∈ dChasecsafe (QSC ), then there exists an iteration i ≥ 1 such that cc : (b0 , descendantOf, b) ∈ dChasecsafe (QSC ), for any j ≥ i, and cc : (b0 , j 0 descendantOf, b) 6∈ dChasecsafe j 0 (QSC ), for any j < i. We apply induction on i for the proof. base case suppose cc :(b0 , descendantOf, b) ∈ dChasecsafe ( QSC ) and since b 6= 1 b0 , then there exists a BR r ∈ augC(R), ∃µ such that applicableaugC(R) (r, µ, dChasecsafe (QSC )), i.e. body(r)(~x, ~z)[µ] ⊆ dChasecsafe (QSC ) and cc : (b0 , 0 0 descendantOf, b) ∈ head(r)(~x, ~y )[µext(~y) ]. Then by construction of augC(r), it follows that b = yj [µext(~y) ], for some yj ∈ {~y } and b0 = µ(xi ), for some (QSC ), it follows using (†) xi ∈ {~x}. Since dChase0 (QSC ) = dChasecsafe 0 that applicableR (r0 , µ, dChase0 (QSC )) holds, for r0 = body(r0 )(~x, ~z) → head(r0 )(~x, ~y ), with augC(r0 ) = r. Hence, by construction, it follows that b = yj [µext(~y) ] ∈ C(dChase1 (QSC )), for yj ∈ {~y } and b0 = µ(xi ), for xi ∈ {~x}. Hence b0 is a descendant of b (by definition). hypothesis if cc : (b0 , descendantOf, b) ∈ dChasecsafe ( QSC ), for 1 ≤ i ≤ k, i then b0 is a descendant of b. inductive step suppose cc : (b0 , descendantOf, b) ∈ dChasecsafe k+1 (QSC ), then either (i) cc : (b0 , descendantOf, b) ∈ dChasecsafe (QSC ) or (ii) cc : (b0 , dek scendantOf, b) 6∈ dChasecsafe (QSC ). Suppose (i) is the case, then by k hypothesis, b0 is a descendant of b. If (ii) is the case, then either (a) cc : (b0 , descendantOf, b) is the result of the application of a brT R ∈ augC(R) on dChasecsafe (QSC ) or (b) cc : (b0 , descendantOf, b) is the result of the k application of a r ∈ augC(R) \ {brT R} on dChasecsafe (QSC ). If (a) is k the case, then there exists a b00 ∈ C(dChasecsafe (QSC )) such that cc : (b0 , k 178

descendantOf, b00 ) ∈ dChasecsafe (QSC ) and cc : (b00 , descendantOf, b) ∈ k dChasecsafe (QSC ). Hence, by hypothesis b0 is a descendantOf b00 and b00 is k a descendantOf b. Since ‘descendantOf’ relation is transitive, b0 is a descendantOf b. Otherwise if (b) is the case then similar to the arguments used in the base case, it can easily be seen that b0 is a descendant of b. Suppose if the quad unCSafe ∈ dChasecsafe (QSC ), then this implies that there exists an iteration i such that the function unCSafeTest on augC(r), with r = body(r)(~x, ~z) → head(r)(~x, ~y ) ∈ R, assignment µ, and dChasecsafe (QSC ) i returns True. This implies that, there exists b, b0 ∈ B, yj ∈ {~y } such that body(r)(~x, ~z)[µ] ⊆ dChasecsafe (QSC ), b ∈ {µ(~x)}, cc : (b0 , descendantOf, b) i ∈ dChasecsafe (QSC ) and {c | cc : (b0 , originContext, c) ∈ dChasecsafe (QSC )} = i i cScope(yj , head(r)(~x, ~y )). Suppose k be the number of csafe dChase iterations before i, in which brT R was applied. Hence, by claim 0, dChasei−k−1 (QSC ) = dChasecsafe i−1 (QSC )(C), and consequently applicableR ( r, µ, dChasei−k−1 (QSC )) holds. Hence, as a result of µ being applied on r, there exists b00 = yj [µext(~y) ] ∈ B(dChasei−k (QSC ))), with b ∈ {µ(~x)}. Hence, by definition originContext(b00 ) = cScope(yj , head(r)), and b is a descendantOf b00 . If b 6= b0 , then by Claim 2, b0 is a descendantOf b, otherwise b0 = b and hence b0 is a descendantOf b00 . Consequently, b0 is a descendantOf b00 . Also, applying claim 3, we get that originContexts(b0 ) = originContexts(b00 ), which means that prerequisites of uncsafety is satisfied, and hence, QSC is uncsafe. Lemma 4 (Completeness). For any quad-system, QSC = hQC , Ri, if QSC is uncsafe then unCSafe ∈ dChasecsafe (QSC ). Proof. We first prove a few supporting claims in order to prove the theorem. Claim (0) For any quad-system QSC = hQC , Ri, suppose unCSafe 6∈ dChasecsafe (QSC ), then for any dChase iteration i, there exists a j ≥ 0 such that dChasei (QSC ) = dChasecsafe i+j (QSC )(C). We approach the proof by induction on i. 179

base case for i = 0, we know that dChase0 (QSC ) = dChasecsafe (QSC ) = QC . 0 Hence, the base case trivially holds. hypothesis for i ≤ k ∈ N, there exists j ≥ 0 such that dChasei (QSC ) = dChasecsafe i+j (QSC ) step case for i = k + 1, one of the following holds: (a) dChasek+1 (QSC ) = dChasek (QSC ) or (b) dChasek+1 (QSC ) = dChasek (QSC ) ∪ head(r)(~x, ~y )[µext(~y) ] and applicableR (r, µ, dChasek (QSC )) holds, for some r = body(r)(~x, ~z) → head(r)(~x, ~y ), assignment µ. If (a) is the case, then trivially the claim holds. Otherwise, if (b) is the case, then let j ∈ N 0 be such that dChasek ( QSC ) = dChasecsafe k+j (QSC )(C). Let j ≥ j, l ∈ N be such that applicableaugC(R) (brT R, µ, dChasecsafe k+l (QSC )), for any j 0 ≥ l ≥ j, and applicableaugC(R) (brT R, µ, dChasecsafe k+j 0 +1 (QSC )) does not hold. By construction, it should be the case that applicable(r0 , µ, 0 dChasecsafe k+j 0 +1 (QSC )) holds, where r = augC( r). Also since no new Skolem blank node was introduced in any csafe dChase iteration k + l, for any j ≤ l ≤ j 0 . It should be the case that head(r)[µext(~y) ] = head(r0 )[µext(~y) ](C). Since dChasecsafe k+l (QSC )(C) = dChasek (QSC ), for csafe 0 any j ≤ l ≤ j 0 , and dChasecsafe k+j 0 +1 (QSC ) = dChasek+j 0 (QSC ) ∪ head(r )[µext(~y) ], dChasecsafe k+j 0 +1 (QSC )(C) = dChasek+1 (QSC ). Hence, the claim follows. The following claim, which straightforwardly follows from claim 0, shows that, for csafe quad-systems its standard dChase is contained in its safe dChase. Claim (1) Suppose unCSafe 6∈ dChasecsafe (QSC ), then dChase(QSC ) ⊆ dChasecsafe (QSC ). Claim below shows that the generation of originContext quads in csafe dChase is complete. 180

Claim (2) For any quad-system QSC , if unCSafe 6∈ dChasecsafe (QSC ), then for any Skolem blank-node b generated in dChase(QSC ), and for any c ∈ C, if c ∈ originContexts(b), then there exists a quad cc : (b, originContext, c) ∈ dChasecsafe (QSC ). Since the only way a Skolem blank node b gets generated in any iteration i of dChase(QSC ) is by the application of a BR r ∈ R, i.e. when there ∃r = body(r)(~x, ~z) → head(r)(~x, ~y ) ∈ R, assignment µ, such that applicableR (r, µ, dChasei−1 (QSC )), and b = yj [µext(~y) ], for some yj ∈ {~y }, and dChasei (QSC ) = dChasei−1 (QSC ) ∪ head(r)(~x, ~y )[µext(~y) ]. Also since c ∈ originContexts(b), it should be the case that c ∈ cScope(yj , head(r)). From claim 0, we know that there exists j ≥ 0, such that dChasei (QSC ) = dChasecsafe i+j (QSC )(C). W.l.o.g, assume that i + j is the first such csafe dChase iteration. Hence, it follows 0 that applicableaugC(R) (r0 , µ, dChasecsafe i+j−1 (QSC )), where r = augC(r). Since, head(r) ⊆ head(r0 ), it should be the case that c ∈ cScope(yj , head(r0 )). Hence, by construction of augC, cc : (yj , originContext, c) ∈ head(r0 ), and as a result of application of µ on r0 in iteration i + j, cc : (b, originContext, c) gets generated in dChasecsafe i+j (QSC ). Hence, the claim holds. For the claim below, we introduce the concept of the sub-distance. For any two blank nodes, their sub-distance is inductively defined as: Definition 5. For any two blank nodes b, b0 , sub-distance(b, b0 ) is defined inductively as: • sub-distance(b, b0 ) = 0, if b0 = b; • sub-distance(b, b0 ) = ∞, if b 6= b0 and b is not a descendant of b0 ; • sub-distance(b, b0 ) = mint∈{~x[µ]} { sub-distance(b, t)} + 1, if b0 was generated by application of µ on r = body(r)(~x, ~z) → head(r)(~x, ~y ), i.e. b0 = yj [µext(~y) ], for some yj ∈ {~y }, and b is a descendant of b0 . 181

Claim (3) For any quad-system QSC = hQC , Ri, if unCSafe 6∈ dChasecsafe (QSC ), then for any two Skolem blank nodes b, b0 in dChase(QSC ), if b is a descendant of b0 then there must be a quad of the form cc : (b, descendantOf, b0 ) ∈ dChasecsafe (QSC ). Note by the definition of sub-distance that if b is a descendant of b0 , then subdistance(b, b0 ) ∈ N. Assuming unCSafe 6∈ dChasecsafe (QSC ), and b is a descendant of b0 , we approach the proof by induction on sub-distance(b, b0 ). base case Suppose sub-distance(b, b0 ) = 1, then this implies that there exists r = body(~x, ~z) → head(r)(~x, ~y ), assignment µ such that b0 was generated due to application of µ on r, i.e. b0 = yj [µext(~y) ], for some yj ∈ {~y }, and b ∈ {~x[µ]}. This implies that there exists a dChase iteration i such that applicableR (r, µ, dChasei (QSC )) and dChasei+1 (QSC ) = dChasei (QSC ) ∪ apply(r, µ). Since unCSafe 6∈ dChasecsafe (QSC ), using (QSC )(C). claim 0, there exists k ≥ i such that dChasei (QSC ) = dChasecsafe k W.l.o.g., let k be the first such csafe dChase iteration. This means that (QSC )), where r0 = augC(r), and applicableaugC(R) (r0 , µ, dChasecsafe k csafe (QSC ) ∪ head(r0 )[µext(~y) ], and blank nodes b, dChasecsafe k+1 = dChasek b0 ∈ head(r0 )[µext(~y) ], b ∈ {~x[µ]}, b0 = yj [µext(~y) ]. By construction of augC(), since there exists a quad-pattern cc : (xl , descendantOf, yj ) ∈ head(r0 ), for any xl ∈ {~x}, yj ∈ {~y }, it follows that cc : (b, descendantOf, b0 ) ∈ dChasecsafe k+1 (QSC ). hypothesis Suppose sub-distance(b, b0 ) ≤ k, k ∈ N, then cc : (b, descendantOf, b0 ) ∈ dChasecsafe (QSC ). inductive step Suppose sub-distance(b, b0 ) = k + 1, then there exists a b00 6= b, assignment µ, and BR r = body(r)(~x, ~z) → head(r)(~x, ~y ) ∈ R such that b0 was generated due to the application of µ or r with b00 ∈ {~x[µ]}, i.e. b0 = yj [µext(~y) ], for yj ∈ {~y }, and b is a descendant of b00 . This implies that subdistance(b00 , b0 ) = 1, and sub-distance(b, b00 ) = k, and hence by hypothesis 182

cc : (b, descendantOf, b00 ) ∈ dChasecsafe (QSC ), and cc : (b00 , descendantOf, b0 ) ∈ dChasecsafe (QSC ). Hence, by construction of csafe dChase, cc : (b, descendantOf, b0 ) ∈ dChasecsafe ( QSC ). Suppose QSC is uncsafe, then by definition, there exists a blank nodes b, b0 in Bsk (dChase(QSC )), such that b is descendant of b0 , and originContexts(b) = originContexts(b0 ). By contradiction, if unCSafe 6∈ dChasecsafe (QSC ), then by claim 1, dChase(QSC ) ⊆ dChasecsafe (QSC ). Since by claim 2, for any c ∈ originContexts(b), there exists quads of the form cc : (b, originContext, c) ∈ dChasecsafe (QSC ) and for every c0 ∈ originContexts(b0 ), there exists cc : (b0 , originContext, c0 ) ∈ dChasecsafe (QSC ). Since originContexts(b) = originContexts(b0 ), it follows that the sets {c | cc : (b, originContext, c) ∈ dChasecsafe (QSC )}, {c0 | cc : (b0 , originContext, c0 ) ∈ dChasecsafe (QSC )} are equal. Also by claim 3, since b is a descendant of b0 , there exists a quad of the form cc : (b, descendantOf, b0 ) in dChasecsafe (QSC ). But, by construction of dChasecsafe (QSC ), it must be the case that there exists a blank node b00 ∈ Bsk (dChasecsafe (QSC )), r = body(r)(~x, ~z) → head(r)(~x, ~y ) ∈ augC(R), assignment µ such that b0 was generated due to the application of µ on r, i.e. b0 = yj [µext(~y) ] with b00 ∈ {~x[µ]}, and cc : (b, descendantOf, b00 ) ∈ dChasecsafe (QSC ). But, since {c | cc : (b, originContext, c) ∈ dChasecsafe (QSC )} = cScope(yj , (QSC )) should return head(ri )), the method unCSafeTest(r, µ, dChasecsafe l True, for some l ∈ N. Consequently, it must be the case that unCSafe ∈ dChasecsafe (QSC ), which is a contradiction to our assumption. Hence unCSafe ∈ dChasecsafe (QSC ), if dChase(QSC ) is uncsafe. Property 21. (Only If) By definition, R is universally safe (resp. msafe, resp csafe) iff hQC , Ri is safe (resp. msafe, resp. csafe), for any quad-graph QC . Hence, hQcrit C , Ri is safe (resp. msafe, resp. csafe). (If part) We give the proof for the case of safe quad-systems. The proof for the msafe and csafe case can be obtained by slight modification. In order to 183

show that if hQcrit C , Ri is safe, then R is universally safe, we prove the contrapositive. That is we show that if there exists QC such that hQC , Ri is unsafe, then QSCcrit = hQcrit C , Ri is unsafe. Suppose, there exists such an unsafe quadsystem QSC = hQC , Ri, we show how to incrementally construct a homomorphism h from constants in dChase(QSC ) to the constants in dChase(QSCcrit ) such that for any Skolem blank node : b in dChase(QSC ), there exists a homomorphism from descendance graph of : b to the descendance graph of h( : b) in dChase(QSCcrit ). Suppose h is initialized as: for any constant c ∈ C(QSC ), h(c) = : bcrit , if c ∈ C(QSC ) \ C(QSCcrit ); and h(c) = c otherwise . It can be noted that for any BR r = body(r)(~x, ~z) → head(r)(~x, ~y ) ∈ R, if body(r)[µ] ⊆ dChase0 (QSC ) then body(r)[µ][h] ⊆ dChase0 (QSCcric ). Now it follows that for any i ∈ N, level(body(r)[µ]) = 0 if applicable(r, µ, dChasei (QSC )), then there exists j ≤ i such that applicable(r, h ◦ µ, dChasej (QSCcrit )). Let h be extended so that for any i ∈ N, for any Skolem blank node : b introduced in dChasei+1 (QSC ) while applying µ on r, for existential variable y ∈ {~y }, let h( : b) be the blank node introduced in dChasej+1 (QSCcrit ), for the existential variable y while applying h ◦ µ on r. Hence, it follows that, for any i ∈ N, applicableR (r, µ, dChasei (QSC )) implies there exists j ≤ i such that applicable(r, h ◦ µ, dChasej (QSCcrit )), for any r, µ. Also note that, for any Skolem blank node : b generated in dChasei (QSC ), it can be noted that λr ( : b) = λr (h( : b)) and λc ( : b) = λc (h( : b)) and λv ( : b)[h] = λv (h( : b)). Hence, it follows that for any Skolem blank node : b in dChase(QSC ), h is a homomorphism from descendance graph of : b to the descendance graph of h( : b) in dChase(QSCcrit . Hence, if there exists two Skolem blank nodes : b, : b0 in dChase(QSC ), with : b0 a descendant of : b and originRuleId( : b) = originRuleId( : b0 ) and originV ector( : b) ∼ = originV ector( : b0 ), then it follows that there exists h( : b), h( : b0 ) in dChase( QSCcrit ), with h( : b0 ) descendant of h( : b) and originRuleId(h( : b)) = originRuleId(h( : b0 )) and originV ector(h( : b)) ∼ = originV ector(h( : b0 )). Hence, it follows from the 184

definition that QSCcritic is unsafe.

185

PhD-Thesis.pdf

There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. PhD-Thesis.pdf.

1MB Sizes 4 Downloads 313 Views

Recommend Documents

No documents